San Francisco 3-Bedroom Housing Price Forecasting

Author: Linsi Lin
Date: 08/02/2021

Data is available at Zillow Research
https://www.zillow.com/research/data/
Data Type: ZHVI 3-Bedroom Time Series ($)
Geography: State

options(warn=-1)

#LIBRARIES
library(fpp3)

## -- Attaching packages -------------------------------------------- fpp3 0.4.0 --

## v tibble      3.1.2      v tsibble     1.0.1 
## v dplyr       1.0.7      v tsibbledata 0.3.0 
## v tidyr       1.1.3      v feasts      0.2.2 
## v lubridate   1.7.10     v fable       0.3.1 
## v ggplot2     3.3.5

## -- Conflicts ------------------------------------------------- fpp3_conflicts --
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

library(ggplot2)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(dplyr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v readr   1.4.0     v stringr 1.4.0
## v purrr   0.3.4     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x lubridate::as.difftime() masks base::as.difftime()
## x lubridate::date()        masks base::date()
## x dplyr::filter()          masks stats::filter()
## x tsibble::intersect()     masks lubridate::intersect(), base::intersect()
## x tsibble::interval()      masks lubridate::interval()
## x dplyr::lag()             masks stats::lag()
## x tsibble::setdiff()       masks lubridate::setdiff(), base::setdiff()
## x tsibble::union()         masks lubridate::union(), base::union()

library(tseries)

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   RegionName = col_character(),
##   RegionType = col_character(),
##   StateName = col_character(),
##   State = col_character(),
##   Metro = col_character()
## )
## i Use `spec()` for the full column specifications.

#Change into time series data
sf_3bed_timeseries <- df %>% filter (RegionID== "3227") %>%
  select(.,-c(9:(ncol(df)-315))) %>%
  pivot_longer(
    cols = 1:ncol(.)
    , names_to = "date"
    , values_to = "housing_price"
  ) %>% mutate(Month = yearmonth(date)) %>%
  select (Month, housing_price) %>%
  as_tsibble(
    index = Month)

head(sf_3bed_timeseries)

## # A tsibble: 6 x 2 [1M]
##      Month housing_price
##      <mth>         <dbl>
## 1 1996 Jan        328776
## 2 1996 Feb        327698
## 3 1996 Mar        326951
## 4 1996 Apr        325638
## 5 1996 May        324615
## 6 1996 Jun        324038

sum(is.na(sf_3bed_timeseries))

## [1] 0

There is no missing value.

Time series graphics

autoplot(sf_3bed_timeseries, housing_price) +
  labs(title = "San Francisco 3-Bedroom Housing Price Time Series",
       y = "Housing price ($)")

There is a clear and increasing trend until the 2008 financial crisis, then starting from 2012, the 3-bedroom housing price picks up again.
We can also observe the cyclic behaviour from the falls that are not of not of a fixed frequency, for example, there are dips around 2017,2019 or 2021.
No strong seasonal pattern was observed.

ACF plot

sf_3bed_timeseries %>% ACF(housing_price) %>% autoplot()

The autocorrelations for 24 lags tend to be large and positive which shows that the data has a trend.
There is no “scalloped” shape in autocorrelation which means no seasonality in the data.
The data is non-stationary because the ACF of non-stationary data is often large and positive and decreases slowly.
We can have more formal test for stationarity later.

Lag plots

sf_3bed_timeseries %>%
  gg_lag(housing_price, geom = "point") +
  labs(x = "lag(housing_price, k)")

This plot displays scatterplots of San Francisco 3-Bedroom Housing Price Time Series, where the horizontal axis shows lagged values of the time series. (e.g y_t vs y_(t-1), y_t vs y_(t-2)..)
Here the colours indicate the month of the variable on the vertical axis. The positive relationships in all plots reflect the strong increasing trend in the data.

Split a time series into several components, namely a trend-cycle component, a seasonal component, and a remainder component, each representing an underlying pattern category.

dcmp <- sf_3bed_timeseries %>%
  model(stl = STL(housing_price))
components(dcmp)

## # A dable: 306 x 7 [1M]
## # Key:     .model [1]
## # :        housing_price = trend + season_year + remainder
##    .model    Month housing_price   trend season_year remainder season_adjust
##    <chr>     <mth>         <dbl>   <dbl>       <dbl>     <dbl>         <dbl>
##  1 stl    1996 Jan        328776 324227.       -89.4     4639.       328865.
##  2 stl    1996 Feb        327698 324680.      -132.      3150.       327830.
##  3 stl    1996 Mar        326951 325133.      -122.      1939.       327073.
##  4 stl    1996 Apr        325638 325587.      -202.       253.       325840.
##  5 stl    1996 May        324615 326188.      -405.     -1168.       325020.
##  6 stl    1996 Jun        324038 326790.      -429.     -2323.       324467.
##  7 stl    1996 Jul        323573 327391.      -313.     -3505.       323886.
##  8 stl    1996 Aug        324088 328095.       -84.3    -3922.       324172.
##  9 stl    1996 Sep        324904 328798.       165.     -4059.       324739.
## 10 stl    1996 Oct        326422 329501.       477.     -3557.       325945.
## # ... with 296 more rows

components(dcmp) %>% autoplot()

The grey bars to the left of each panel show the relative scales of the components. Each grey bar represents the same length but because the plots are on different scales, the bars vary in size. The large grey bar in the one above the bottom panel shows that the variation in the season component is smallest compared to the variation in the data.

Stationarity

adf.test(sf_3bed_timeseries$housing_price)

## 
##  Augmented Dickey-Fuller Test
## 
## data:  sf_3bed_timeseries$housing_price
## Dickey-Fuller = -1.4211, Lag order = 6, p-value = 0.8204
## alternative hypothesis: stationary

Test stationarity of the time series (ADF)
The null hypothesis: the time series is non stationary
The alternative hypothesis: the time series is stationary

The p-value is 0.8204 which is greater than 0.5, we fail to reject the null hypothesis and conclude that our time series is non-stationary and require differencing. We can also confirm this from the time series plot.

We can also use a unit root test determine more objectively whether differencing is required.

A number of unit root tests are available, which are based on different assumptions and may lead to conflicting answers. In our analysis, we use the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
In this test, the null hypothesis is that the data is stationary.

sf_3bed_timeseries %>%
  features(housing_price, unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      4.53        0.01

Since p-value is 0.01, we reject the null hypothesis and again conclude that the data is non-stationary.

Differencing

sf_3bed_timeseries %>%
  features(housing_price, unitroot_ndiffs)

## # A tibble: 1 x 1
##   ndiffs
##    <int>
## 1      1

Differencing helps to stabilize the mean.

First differences are the change between one observation and the next.

This process of using a sequence of KPSS tests to determine the appropriate number of first differences is carried out using the unitroot_ndiffs() feature.

When the value of ndiffs is 1, we difference the series once.

Do we need seasonal differencing?

sf_3bed_timeseries %>%
  features(housing_price,unitroot_nsdiffs)

## # A tibble: 1 x 1
##   nsdiffs
##     <int>
## 1       0

We don’t need seasonal differencing.

sf_3bed_timeseries_after_diff <- sf_3bed_timeseries %>% mutate(housing_price_diff=difference(housing_price)) %>% select (Month,housing_price_diff)

Once we’ve differenced we’ve effectively removed the trend from our data.

head(sf_3bed_timeseries_after_diff)

## # A tsibble: 6 x 2 [1M]
##      Month housing_price_diff
##      <mth>              <dbl>
## 1 1996 Jan                 NA
## 2 1996 Feb              -1078
## 3 1996 Mar               -747
## 4 1996 Apr              -1313
## 5 1996 May              -1023
## 6 1996 Jun               -577

sf_3bed_timeseries_after_diff %>% autoplot(
    difference(housing_price_diff)
)

The variance still seems to be multiplicative.

sf_3bed_timeseries_after_diff %>% gg_tsdisplay(housing_price_diff, plot_type = 'partial')

sf_3bed_timeseries_after_diff %>%
  features(housing_price_diff, unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1     0.231         0.1

Since p-value is 0.1, we fail to reject the null hypothesis at 5% significance level and conclude that the data is stationary.

Train/Test Split

train <- sf_3bed_timeseries_after_diff %>%
  filter(year(Month) <= 2017)

test <- sf_3bed_timeseries_after_diff %>%
  filter(year(Month) > 2017)

#Train set in red, test set in black
autoplot(sf_3bed_timeseries_after_diff, housing_price_diff) +
  autolayer(train, housing_price_diff, colour = "red")

Benchmark Methods
Fit all the appropriate benchmark methods to the training set and forecast the periods covered by the test set.

fit <- train %>%
  model(
    naive = NAIVE(housing_price_diff),
    drift = RW(housing_price_diff ~ drift()),
    mean = MEAN(housing_price_diff),
    snaive = SNAIVE(housing_price_diff)
  )
fc <- fit %>% forecast(h="42 months")

#Compute the accuracy of forecasts
fc %>%
  accuracy(sf_3bed_timeseries_after_diff)

## # A tibble: 4 x 10
##   .model .type     ME   RMSE    MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drift  Test  -7584. 13104. 11069. 243.   403. 1.29  1.18  0.647
## 2 mean   Test   -484. 10613.  8443. 158.   188. 0.986 0.956 0.638
## 3 naive  Test  -6618. 12498. 10427. 235.   374. 1.22  1.13  0.638
## 4 snaive Test  -5252. 14967. 12239. -15.6  266. 1.43  1.35  0.705

Mean method has the lowest RMSE=10613, we can use it as the benchmark and compare other models RMSE with this one.

#Plot the benchmark forecasting
fc %>%
  autoplot(
    sf_3bed_timeseries_after_diff,
    level = NULL
  ) +
  labs(
    y = "Housing Price ($)",
    title = "San Francisco 3-Bedroom Housing Price Time Series"
  ) +
  guides(colour = guide_legend(title = "Forecast"))

Do the residuals from the best method (mean method) resemble white noise?

fit %>% select(mean) %>% gg_tsresiduals()

Exponential Smoothing

#Because of first differencing, there is one NA in the first row of data. Remove the missing value and apply ETS.
train_ets <- na.omit(train) 

fit_AUTO_ETS <- train_ets %>%
  model(ETS(housing_price_diff))
report(fit_AUTO_ETS)

## Series: housing_price_diff 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.9129703 
## 
##   Initial states:
##       l[0]
##  -78.24595
## 
##   sigma^2:  21139634
## 
##      AIC     AICc      BIC 
## 5905.401 5905.493 5916.117

The best model is ETS(A,N,N), which means additive error, none trend (data was differenced ahead of time to remove trend that’s why it shows “N” for trend component), none seasonality.
α hat is the smoothing parameters to deal with the level pattern.(controls the flexibility of the level)
The closer α hat is to one, the “rougher” the estimate of the level (large adjustments take place).
In this case, α hat =0.91 meaning fast learning in the month-to-month movements.

#Check residuals
fit_AUTO_ETS %>% gg_tsresiduals()

#Forecast plot
fit_AUTO_ETS %>% 
  forecast(h="42 months") %>%
  autoplot(sf_3bed_timeseries_after_diff) +
  ylab(" Housing Price ($)") + xlab("Month")

#Get test error
fc_ETS <- fit_AUTO_ETS %>% forecast(h="42 months")
accuracy(fc_ETS, test)

## # A tibble: 1 x 10
##   .model                .type     ME   RMSE    MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                 <chr>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ETS(housing_price_di~ Test  -6647. 12513. 10442.  236.  375.   NaN   NaN 0.638

RMSE=12513.

ARIMA Model

arima_autofit <- train %>%
  model(ARIMA(housing_price_diff))
report(arima_autofit)

## Series: housing_price_diff 
## Model: ARIMA(1,0,0)(2,0,0)[12] w/ mean 
## 
## Coefficients:
##          ar1     sar1     sar2  constant
##       0.9510  -0.7287  -0.4713  507.4407
## s.e.  0.0192   0.0579   0.0616  207.5106
## 
## sigma^2 estimated as 12346473:  log likelihood=-2524.36
## AIC=5058.73   AICc=5058.96   BIC=5076.61

#Check residuals
arima_autofit %>% gg_tsresiduals()

The residual plots reveal the following features.
1.The time plot of the residuals shows the mean of the residuals is not close to zero. Non-constant variance means the variation of the residuals doesn’t stay the same across the historical data.
2. Many autocorrelations are outside the blue dashed line.
3. The histogram suggest that the residuals are not normally distributed because although the distribution of the residuals is centered around the mean, two tails are very long. Thus we can conclude residuals are correlated and not normally distributed.

augment(arima_autofit) %>%
  features(.resid, ljung_box, lag = 10, dof = 4)

## # A tibble: 1 x 3
##   .model                    lb_stat     lb_pvalue
##   <chr>                       <dbl>         <dbl>
## 1 ARIMA(housing_price_diff)    52.3 0.00000000162

A portmanteau test returns a small p-value, suggesting that the residuals are not white noise.

#Forecast plot
arima_autofit %>% 
  forecast(h="42 months") %>%
  autoplot(sf_3bed_timeseries_after_diff) +
  ylab(" Housing Price ($)") + xlab("Month")

#Get test error
fc_arima <- arima_autofit %>% forecast(h="42 months")
accuracy(fc_arima, test)

## # A tibble: 1 x 10
##   .model                  .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>                   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ARIMA(housing_price_di~ Test  -2467. 9509. 7065.  129.  171.   NaN   NaN 0.584

RMSE for ARIMA Model is 9509, which is better than the best benchmark model (the mean) and exponential smoothing method. So far we can conclude that ARIMA is the best model for this time series. But there is still pattern in residuals that can be further extracted.

Generate forecasts for next year’s San Francisco 3-bedroom housing price using auto.arima.

sf_3bed_timeseries %>% auto.arima() %>% forecast(h=12)

##          Point Forecast   Lo 80   Hi 80   Lo 95   Hi 95
## Jul 2021        1724495 1719820 1729170 1717345 1731644
## Aug 2021        1738482 1728441 1748522 1723126 1753837
## Sep 2021        1750180 1733391 1766969 1724503 1775857
## Oct 2021        1763861 1740640 1787083 1728347 1799376
## Nov 2021        1774769 1744872 1804665 1729046 1820491
## Dec 2021        1783649 1747041 1820257 1727662 1839636
## Jan 2022        1792104 1748278 1835929 1725079 1859129
## Feb 2022        1799907 1748556 1851259 1721372 1878442
## Mar 2022        1806016 1746752 1865279 1715380 1896651
## Apr 2022        1808910 1741565 1876256 1705914 1911907
## May 2022        1810498 1734859 1886137 1694818 1926178
## Jun 2022        1811131 1727072 1895189 1682574 1939687

sf_3bed_timeseries%>% auto.arima() %>% forecast(h=12) %>% autoplot()