HouseCanary is developing the most accurate, most comprehensive forecasts for residential real estate. Accurate forecasts are the result of combining granular time series data with the best models. The HouseCanary forecast model is designed to use all available market-level time series data to predict the most likely value of the local home price appreciation index (HPI) with three-year monthly forecasts. HouseCanary provides HPI forecasts for all 381 US metropolitan statistical areas (MSAs), and for every populated ZIP code within those MSAs. Forecast models are updated and estimated each month in order to leverage the latest available data. The forecast models are learning-based, and they become better and better at estimating the next major market inflection point with each new housing cycle.
The purposes of this paper are threefold:
- Identify the data included in the forecast models.
- Provide context for the logic behind the forecast models.
- Define key model outputs.
HouseCanary builds two sets of forecast models on different time horizons: short-run and long-run. For the purpose of this paper, short-run refers to a near-term forecast over the following quarter, and long-run refers to forecasts beyond those of the short-run forecast.
Home price indices built by HouseCanary do not suffer from the 3-to-5-month lag present in other traditional home price indices. Furthermore, HouseCanary price indices are available for sparsely populated ZIP codes through the use of proprietary statistical methodologies.
HouseCanary forecasts are built upon modern machine learning-based algorithms. These algorithms are able to extract signal and capture high-level interactions among potentially many thousands of inputs. This in turn yields highly accurate forecasts.
HouseCanary’s forecast models are created around the principle of machine learning and model averaging. This process involves randomly generating many simple models, which are then combined into a single prediction model. This process improves both the accuracy and the stability of the estimated outputs.
The primary assumptions behind HouseCanary’s forecast model are the following:
- Models utilize leading predictive information contained in both the history of the HPI series itself and from other locally available time series inputs.
- All input time series data should cover at least one housing cycle, preferably more.
- Machine learning-based algorithms are better suited than classical models at recognizing and exploiting higher-order complex relationships among input variables and their relationship to the response variable.
- Machine learning-based algorithms allow us to consider many more inputs than observations.
Localized time series data with extensive history
Prior to modeling any given market, HouseCanary gathers, cleans, and merges all available time series data we have stored in our database for the target market. All forecast models use localized time series data. Major sources of data include the following: (1) time series data created from raw property/transactional data records within the HouseCanary database; (2) time series data available from government sources, including BLS, ACS, FHA, Federal Reserve, etc.; (3) custom time series data created from linear and non-linear combinations of the prior two categories. Furthermore, smoothing and imputation is often performed on the raw input time series data prior to modeling to help reduce statistical noise and to stabilize model outputs.
Examples of real estate-specific data sent into the short-run models includes series such as sales volume, default and foreclosure rates, mortgage rates, median sale and list prices, sale-to-list-price ratio, new home permits, and inventory. Non-real estate data includes series such as labor market indicators, household demographics, household incomes, commodity prices, and local gross metro prices.
Step 1: Fit and generate short-run forecasts
In the short-run time horizon, HouseCanary uses non-parametric machine learning-based models to (1) bring price indices to current (housing data can often be delayed up to six months), and (2) forecast the near-term price trajectory over the next quarter. These models can accommodate very large numbers of potential inputs and do not require variable selection. Hundreds of localized time series, and/or transformations thereof, are sent into the short-run models.
The short-run models function by building up an ensemble of tens of thousands of simpler models. Each simple model is randomly generated by selecting a subset of inputs from the full set of all possible inputs. This process is repeated enough times to ensure that each possible input is considered multiple times. For each randomly generated model, a portion of the historical HPI is held out of the sample. This holdout is used to measure the predictive potential of the randomly created model.
The predictive performance of each simple model is measured by comparing how well it was able to predict the out-ofsample HPI observations. Models found to have the highest relative predictive performance are kept as part of the final ensemble model. Finally, the ensemble model is used to forecast HPI through the upcoming three months. The extended HPI series is then passed into the long-run models.
Step 2: Fit and generate long-run forecasts
Conditional to the short-run forecasts generated in Step 1, vector error correction models (VECM) are used for longterm modeling and extending forecasts 36 months beyond the current month. Inputs used by the VECMs have been specifically selected, designed, and created by HouseCanary to model major inflection points (cycles) in housing markets. All series entering the long-run models go back at least 35 years. This guarantees that the long-run models have been exposed to one or more housing market cycles within the local market.
The long-run models are governed by a set of data designed to capture long-term fundamental constraints observed through multiple housing cycles.
These data series include the following:
- Smoothed price velocity
- Smoothed price acceleration
- Localized affordability index, measured as the percent of median income required to make the median payment on a 30-year fixed mortgage
- Deviation of home prices from long-term price trajectory
Furthermore, VECMs account for autocorrelation and cross-correlations among inputs, seasonality, and the long-run compound price growth rate. These series are both theoretically and empirically co-integrated.
VECM parameters are estimated through maximum likelihood estimation. VECMs are fit for every lag order between 3 and 26 months. Final parameter estimates and predictions are obtained through AICc based model weighting for all models fit on lags 3 through 26. This model averaging greatly increases stability in the parameter estimates month-to-month.
Machine Learning Primer
Machine learning applications for real estate forecasting
Machine learning algorithms arrive at a solution by learning patterns from large amounts of data in order to make predictions. They seek to minimize the error from predictions by using information from many different inputs. The computer is presented with inputs and outputs. The goal is to learn a general rule, specified by the machine learning algorithm, which maps the inputs to the outputs very accurately. The algorithmic process is iterative and often performs many iterations of error minimization in order to produce a robust and highly accurate prediction.
The patterns that exist in the data are not subject to the same parametric model restrictions as classic statistical models. The learning in the algorithm uncovers interactions among many different variables that cannot be functionally defined in traditional statistical models. The algorithms are naturally suited to capture both highorder relationships with the output variable and interactions among the inputs. Lastly, it is easily possible to consider many more inputs than actual observations. This means that a model may potentially contain thousands of inputs in order to achieve the most accurate prediction possible.
So what does all this mean for real estate forecasting? If predictive signals within the input dataset truly exist, these methods will find them with enough iterations. It is worth pointing out that the improved predictive performance of these methods comes at the expense of the additional computational time required.
Home Price Indices
A Home Price Index (HPI) is a tool that measures changes in single-family home prices across a designated market. These tools can show you areas where home values are increasing or decreasing so you can estimate prices.
A Home Price Index Forecast can be used to create new investment insights resulting in alpha and can be used to manage exposure to market risk.
HouseCanary Home Price Indices are developed at three levels:
- Metropolitan statistical area (MSA) — Covers all 381 MSAs in the US
- ZIP code — Covers 18,200+ US ZIP codes that have population equal to or above 1,000
For the HPIs, HouseCanary has constructed the indices to include the following features:
- 40+ years (1975–)
- Monthly updates
- No lag — includes the most recent month’s data
- Three-year monthly forecasts
Local Market Indices
HouseCanary’s MSA-level home price indices highlight the difference between all MSAs historically and into the future. HouseCanary focuses on local HPIs at the ZIP code level because we have noted that, on average, there is 7% to 10% of price difference between high-performing and low-performing markets within an MSA. Within each MSA, HouseCanary statistically derives market grade clusters as A, B, C, D, and E. Market grade clusters are built using a vast cross section of variables, including median home price, school scores, crime level, owner/rental ratio, and commutability. These clusters are a proxy for the strength and volatility of the submarkets, where A is the least volatile and F is the most volatile through market cycles.
A submarkets typically comprise households with the highest incomes, greatest wealth, and highest home prices, which experience the least amount of household growth. B and C submarkets represent a large portion of household growth and are primarily composed of families; they often offer the best schools at the most affordable prices. D and E submarkets are often homes to the youngest residents with lower incomes and represent the bulk of households. These areas tend to see greater volatility in pricing. Knowledge of the market clusters and their associated behaviors is key in determining investment decisions and risk management. Each market cluster demonstrates different investment returns at various stages of the cycle, with A markets often being the first to grow and last to fall, and the D and F markets being last to rise, while growing the most. Investment strategy can be designed to win in each market cluster
HouseCanary is able to build price indices for sparsely transacted ZIP codes through machine learning models. These models identify other ZIP codes in the same market with similar characteristics as the target ZIP code and a rich history of data. Once identified, data from these data-rich ZIP codes is used to supplement the construction of the price index in the sparsely transacted ZIP code.
40+ years of local time series data, generated monthly
Model forecasts are generated and distributed monthly. Depending on the license, model outputs are available via the following:
- Bulk file (example MSA and ZIP code files are included)
- Interactive, web-based reporting on https://hcpro.housecanary.com/
- Static PDF reports, downloadable from https://hcpro.housecanary.com/
At the national, MSA, and ZIP code levels, HouseCanary generates and extracts the quantities described below from the forecast models for each market.
All model-derived outputs below, except for the risk of downturn index labeled “risk,” are forecast out 36 months. The risk index is only available through the current month at the MSA level. Geographic coverage includes 381 MSAs and all populated ZIP codes within those MSAs. To be considered a populated ZIP code, there must be at least 200 housing units and a population greater than 1,000 within the ZIP code. HPI is not seasonally adjusted. HPI is a composite index across all residential property types.
‘hpi_value’ — Nominal housing price index
‘hpi_yoy_pct_chg’ — Year-over-year percent change in the nominal housing price index; formally computed as [hpi_value(t)/hpi_value(t-12)]-1
‘hpi_distance’ — The normalized distance of hpi_value from a long-term linear trend; units are in standard deviations from the mean
‘hpi_returns’ — Monthly returns in the nominal housing price index — formally computed as hpi_value(t)/hpi_ value(t-1)
‘hpi_real’ — Real housing price index after adjusting nominal HPI for inflation as measured by the CPI
‘hpi_trend’ — Long-term linear trend in hpi_value
‘afford_detrended’ — Normalized distance of afford_pmt from a long-term linear trend; units are in standard deviations from the mean
‘afford_pmt’ — Raw affordability value; represents the percent of median household income required to make the median home payment on a 30-year fixed-rate mortgage with 20% down
‘acceleration_value’ — Monthly change in velocity_value
‘velocity_value’ — Smoothed year-over-year change in hpi_value
‘risk’ — A model-derived metric providing the probability that hpi_value is lower 12 months from the current value; formally, risk = Probability[ hpi_value(t) < hpi_value(t+12) ]
Testing and Validation
Continuous internal testing
HouseCanary forecast models are back-tested for accuracy through time. The principle accuracy measure is median absolute percentage error (MdAPE) over the 3-, 12-, 24-, and 36-month forecast horizon. Current MdAPE by MSA is provided in the included ‘mdape_msa_all.xlsx’ workbook. Numbers in this workbook represent aggregate backtested results when letting the system run over the past 20 years. As of May 31, 2018, HouseCanary’s internal testing over the previous 240 months on the national HPI index yielded a 3-, 12-, 24-, and 36-month forward MdAPE of 0.2%, 0.8%, 4.0%, and 7.8%, respectively.
Columns names from the workbook are defined below:
- ‘err.model.3’ – Median 3-month forward absolute forecast error when tested over the previous rolling 240 months
- ‘err.model.12’ – Median 12-month forward absolute forecast error when tested over the previous rolling 240 months
- ‘err.model.24’ – Median 24-month forward absolute forecast error when tested over the previous rolling 240 months
- ‘err.model.36’ – Median 36-month forward absolute forecast error when tested over the previous rolling 240 months
Model testing is performed as follows. First, for a given market, cut all data as of T months ago. Apply estimated model equations to produce a forecast for the period T+1 through T+36. At time periods T+3, T+12, T+24, and T+36, compare the estimated index values to the actual known index values. If, for example, at period T, the actual T+12 index value is 120, and the equations resulted in a T+12 forecast index value of 118, then the 12-month forward absolute error observation for period T equals abs((118/120)-1)=1.67%. We repeat this process and collect a measurement for every month over the previous 240 months. Finally, the median value of the 240 monthly absolute error observations makes up the MdAPE for that market at the defined forecast time horizon.
Please contact us with any questions or comments at [email protected].