First we need some resale data. MLS has everything we want, but one needs to be a realtor to access it directly. (Any friendly realtors out there?) Fortunately for San Diego County, there's a site with a lot of sale records going back to 1997:

http://users.ixpres.com/~gtriphan

It isn't perfect (incomplete, lots of misspellings), but it's better than nothing. We supplement it with another site where we can get recent closings:

http://www.sdlookup.com

We can write a script that pulls web pages from those two sites and parses them to extract resale entries. Extracted data needs to be scrubbed a little bit. For example, the same property might be present in our data as "123 10th st", "123 tenth st" and "123 tenth street".

Next step is to identify resale pairs. For every unique property for which two or more resale entries are recorded, we sort them by date. Our data probably still contains a lot of junk (incorrectly entered sales prices, non-arms-length transactions), so it is a good idea to apply some filters. We also want to exclude "flips" or any other resale pairs involving a substantial change in the condition of the property. A possible set of filters:

* Exclude all resale pairs separated by less than 6 months

* Exclude all resale pairs where either one of two resale prices is absurdly low (<$2000) or extremely high (>$10,000,000)

* Exclude all resale pairs with price change of more than 2x in either direction in less than 12 months

* Exclude all resale pairs with average annual appreciation/depreciation rate of 50%/year

The specific choice of filters won't significantly affect the outcome, but proper filters will reduce the noise level in our HPI.

Resale pairs are grouped into areas. The idea is that homes in the same area will experience similar rates of appreciation and depreciation. Also, running the algorithm on individual zip codes typically results in too much noise because of insufficient amount of data. The more data we have, the better the quality of our final HPI. This step requires some knowledge of different neighborhoods. For example, it makes sense to put zip codes 91913 and 91915 into the same group, but their neighboring 91911 has very different demographics and housing stock and it may evolve differently.

Finally we take resale pairs in each area and try to approximate them with a single index using a modification of the weighted least squares algorithm.

I might expand on this later; for now, here's a code snippet:

// "vn1" and "vn2" - dates of first and second sale

// "price1" and "price2" - recorded prices

// "nmax" - the number of months for which we construct the index (if we're working with 1999-2007, nmax=108)

// "sol" - logarithm of our final HPI

int N = nmax-1;

double* matrix = new double[N*N];

double* rh = new double[N];

double* sol = new double[N+1];

sol[N] = 0;

memset(matrix, 0, N*N*sizeof(double));

memset(rh, 0, N*sizeof(double));

for(i=0; i<nPairs; i++)

{

double weight = 1;

// use lower weights when sales are separated by long periods of time

weight *= 1 - 0.4 * abs(vn2[i]-vn1[i]) / 120.;

double ratio = log((double)prices2[i] / (double)prices1[i]);

// make the index smoother by creating extra "copies" of every resale, shifted one month back and one month forward

for(int delta=-1; delta<=1; delta++)

{

j = vn1[i] + delta;

int k = vn2[i] + delta;

if(j<0 || k<0)

continue;

if(j>N || k>N)

continue;

if(vn1[i]==N-1 && j==N)

continue;

if(vn2[i]==N-1 && k==N)

continue;

matrix[j+j*N] += weight;

if(k<N)

matrix[k+j*N] -= weight;

rh[j] -= ratio*weight;

if(k<N)

{

matrix[k+k*N] += weight;

if(j<N)

matrix[j+k*N] -= weight;

rh[k] += ratio*weight;

}

}

}

/** Solve the system of equations **/

for(j=0; j<N; j++)

{

double x = matrix[j + j * N];

if(fabs(x) < 1e-8)

{

for(i=0; i<N; i++)

matrix[i + j*N] = 0;

x = matrix[j + j*N] = 1;

rh[j] = 0;

}

for(i=0; i<N; i++)

matrix[i+j*N] /= x;

rh[j] /= x;

for(i=j+1; i<N; i++)

{

int k;

double mul = matrix[j+i*N];

for(k=0; k<N; k++)

matrix[k+i*N] -= mul * matrix[k+j*N];

rh[i] -= mul*rh[j];

}

}

for(j=N-1; j>=0; j--)

{

sol[j] = rh[j] / matrix[j+j*N];

for(i=j-1; i>=0; i--)

{

rh[i] -= sol[j] * matrix[j+i*N];

matrix[j+i*N] = 0;

}

}