Case-Shiller from 3 months into the future

Wednesday, February 6, 2008

Constructing a HPI

Like Case-Shiller or OFHEO HPI, this index is based on repeat sales of same properties. I restrict my attention to detached single-family houses and exclude condos. (Condo prices may have very different dynamics from SFRs in the same area, and it's less work to do only SFRs. Also, Case-Shiller only includes single-family housing, and I like to be able to check against their numbers.)

First we need some resale data. MLS has everything we want, but one needs to be a realtor to access it directly. (Any friendly realtors out there?) Fortunately for San Diego County, there's a site with a lot of sale records going back to 1997:

It isn't perfect (incomplete, lots of misspellings), but it's better than nothing. We supplement it with another site where we can get recent closings:

We can write a script that pulls web pages from those two sites and parses them to extract resale entries. Extracted data needs to be scrubbed a little bit. For example, the same property might be present in our data as "123 10th st", "123 tenth st" and "123 tenth street".

Next step is to identify resale pairs. For every unique property for which two or more resale entries are recorded, we sort them by date. Our data probably still contains a lot of junk (incorrectly entered sales prices, non-arms-length transactions), so it is a good idea to apply some filters. We also want to exclude "flips" or any other resale pairs involving a substantial change in the condition of the property. A possible set of filters:

* Exclude all resale pairs separated by less than 6 months
* Exclude all resale pairs where either one of two resale prices is absurdly low (<$2000) or extremely high (>$10,000,000)
* Exclude all resale pairs with price change of more than 2x in either direction in less than 12 months
* Exclude all resale pairs with average annual appreciation/depreciation rate of 50%/year

The specific choice of filters won't significantly affect the outcome, but proper filters will reduce the noise level in our HPI.

Resale pairs are grouped into areas. The idea is that homes in the same area will experience similar rates of appreciation and depreciation. Also, running the algorithm on individual zip codes typically results in too much noise because of insufficient amount of data. The more data we have, the better the quality of our final HPI. This step requires some knowledge of different neighborhoods. For example, it makes sense to put zip codes 91913 and 91915 into the same group, but their neighboring 91911 has very different demographics and housing stock and it may evolve differently.

Finally we take resale pairs in each area and try to approximate them with a single index using a modification of the weighted least squares algorithm.

I might expand on this later; for now, here's a code snippet:

// "vn1" and "vn2" - dates of first and second sale
// "price1" and "price2" - recorded prices
// "nmax" - the number of months for which we construct the index (if we're working with 1999-2007, nmax=108)
// "sol" - logarithm of our final HPI
int N = nmax-1;
double* matrix = new double[N*N];
double* rh = new double[N];
double* sol = new double[N+1];
sol[N] = 0;

memset(matrix, 0, N*N*sizeof(double));
memset(rh, 0, N*sizeof(double));

for(i=0; i<nPairs; i++)
 double weight = 1;
 // use lower weights when sales are separated by long periods of time
 weight *= 1 - 0.4 * abs(vn2[i]-vn1[i]) / 120.;

 double ratio = log((double)prices2[i] / (double)prices1[i]);
 // make the index smoother by creating extra "copies" of every resale, shifted one month back and one month forward
 for(int delta=-1; delta<=1; delta++)
  j = vn1[i] + delta;
  int k = vn2[i] + delta;
  if(j<0 || k<0)
  if(j>N || k>N)
  if(vn1[i]==N-1 && j==N)
  if(vn2[i]==N-1 && k==N)
  matrix[j+j*N] += weight;
   matrix[k+j*N] -= weight;
  rh[j] -= ratio*weight;
   matrix[k+k*N] += weight;
    matrix[j+k*N] -= weight;
   rh[k] += ratio*weight;

/** Solve the system of equations **/

for(j=0; j<N; j++)
 double x = matrix[j + j * N];
 if(fabs(x) < 1e-8)
  for(i=0; i<N; i++)
   matrix[i + j*N] = 0;
  x = matrix[j + j*N] = 1;
  rh[j] = 0;
 for(i=0; i<N; i++)
  matrix[i+j*N] /= x;
 rh[j] /= x;
 for(i=j+1; i<N; i++)
  int k;
  double mul = matrix[j+i*N];
  for(k=0; k<N; k++)
   matrix[k+i*N] -= mul * matrix[k+j*N];
  rh[i] -= mul*rh[j];

for(j=N-1; j>=0; j--)
 sol[j] = rh[j] / matrix[j+j*N];
 for(i=j-1; i>=0; i--)
  rh[i] -= sol[j] * matrix[j+i*N];
  matrix[j+i*N] = 0;


Doctor said...

Esmith: Nice blog! I particularly like the charts. I am a fan of Piggington and BMIT, and I think you have something really good here. Especially the fact that you are trying to project the C/S index three months into the future.

Have you considered using a GIS to do some spatial analysis of your data? I thought of this when reading your technical comments on neighborhoods, etc.

Would you be interested in working on creating a broader economic index? My kooky ideas and your coding/math skills might be able to do some damage! If so, shoot me an email:

SB said...

I'm a realtor, and will give you all the data you need.
I'll post your charts on my website.

It's been a long time since we needed something like this!!!

SB said...

If you need to contact me,

858 449 3481

My website, San Diego coverage only, is