Racist appraisers or a coin flip?

Another widely circulated article comes from the New York Times, discussing racism in real estate.

The short version of the article is that people of color face discrimination in real estate and it is illustrated with an experiment conducted by a couple. They had their house appraised and it came in at 330K. The wife was a lawyer, told the bank she suspected racism, and the bank ordered a reappraisal. Prior to the appraisal she removed pictures of the black family members and books by black authors. The next appraisal came in at 465K. Clear proof of racism, right?

Reproducing the experiment

Although the article sounds horrible, it suffers from numerous problems and is an example of shoddy journalism.

Let me share a similar story from my life. I walked into a high end jewelry store in Las Vegas. No one came to help me. My wife walked in later that same day and got instant help. Sexist right? (My study is better because I’ve reproduced it many times, in many stores, in many countries. My wife is always treated better than I am, deservedly so, although this is an example of profiling not sexism.)

With both stories the sample size of the experiment is small and not controlled.

The thesis as portrayed in the article (and given that the woman is a lawyer the article implies credence) is that by removing black pictures and authors the outcome changed:

Books by Zora Neale Hurston and Toni Morrison were taken off the shelves, and holiday photo cards sent by friends were edited so that only those showing white families were left on display

Now, I have books by both of these authors as well so I decided to repeat the experiment. I put Zami by Audre Lord and Tar Baby by Toni Morrison on my table. I flipped a quarter on the books (Washington, slave owner). It came up tails.

I then replaced those books with The Big Short by Michael Lewis and Gaming the Vote by William Poundstone. And I flipped a penny (Lincoln, abolitionist).

Washington landed heads down.

Lincoln ended face up.

Clearly something is going on here.

I better call the NYT.

You should be quick to point out that my experiment is utter nonsense. The face on the coin and the books it is flipped on have no impact on each other. And the story has people in it with much more detail about the experiment, so it is much more valid, right?

Wrong.

Both experiments have two evaluators (my two different coins, the two different appraisers) and both have a set of white and black authors. And both had the same number of samples, two.

In fact, my experiment is much better because the only thing that changed is the books and I could easily flip the coin many times (and in doing so would discover that half the time it ends up heads and half tails no matter which coin or which books.)

Size matters

Sample size and controls matter when conducting experiments. Lets take the experience described by the article and I will tell it several different ways.

Story 1

One day one, an appraiser who usually appraises low visited the property. He appraised it at 20% lower than most people would, but he usually appraises 20% low. On a different day an appraiser who usually appraises high visited the property. He appraised 20% high, like he always does. Now we have a 40% gap with no racism involved. Changing the books and pictures had no impact. Think 20% is too much of a gap? Here is one of many articles that suggests that is not abnormal.

If you had a history of the appraisals by the different appraisers you could see their trends. If you brought in many more appraisers you could see the pattern. But with two samples you have no way of knowing, especially since they are conducted at different times and with any number of differing factors.

Story 2

Day one is a hot overcast day. The appraiser walks into the house with slight food poisoning from a fish taco the night before. The owners have just cooked fish in the microwave. His car had a flat tire on the way over. He appraises lower than he normally does, since he is in a foul mood. On day two a different appraiser arrives. She just got a promotion that morning. The house smells like fresh baked cookies. It is a bright sunny day in the mid 70s. She appraises higher than she normally does.

The key thing here is that details that are not known by the participants or not mentioned had an outsized influence. No racism.

Story 3

Appraiser one chose comparable houses from East Bay neighborhood. Appraiser two chose from West Bay, where a house just closed the day before for a market high of 510K. These are the neighborhoods they normally use for appraisals, so they didn’t change their pattern during the process.

Again, the appraisal is based on different comparables, on a different day, with different market conditions that can be impacted by the sale of a single house, potentially an outlier.

No racism.

Story 4

The bank, after being accused of racism by the homeowner lawyer brings in an appraiser, tells them about the situation, and suggests they should address the situation. Appraiser 2 comes in high.

No racism, but the threat of public scrutiny or legal action drives the value arbitrarily high. Race is involved.

Story 5

Appraiser 1 is racist, and seeing the photos as well as books by black authors (he also happens to be versed in names of black authors), appraises low. Appraiser 2 is also racist. Seeing only white images and books by white authors (she also knows black authors and looks for them, carefully perusing the book shelf), she appraises much higher. (For some reason, the appraisers or the NYT seems to believe that people only read books by authors that share their race, but that is outside of the point.)

Racism. Not only is there racism, but both appraisers are racist and trained in trigger books. (If only one were racist you could get the same outcome, but you need a much larger spread for the math to work.)

Which Is It?

With the story given, there is no way to know. There are far too many unknowns. It could be any or all of the factors from the stories, or endless other factors. When conducting an experiment you need to take a number of samples far larger than your variables. For a real world test, such as the one described in the article, you would need a much broader set of data to draw any conclusions.

Having said that, Story 1 and Story 5 are the least likely, given that both of them work best when out of the sample of all appraisers, the two picked are either on extremes of the spectrum or are both racist. (Just to explain, since this is a case where stories confuse how we think about math. Let us suppose there are 100 appraisers in the area, and that 65% of them appraise within 10% of the average and 85% appraise within 20% of the average. Let us also assume that 10% are extreme racists. The chance of picking an appraiser that picks 20% low or high is 15%. Thus the chance of picking two of them is .15*.15 or 2.25%. The chance of picking two racists is .1*1 or 1%.)

The article thus picks a story that has an emotional edge to it and feels like it is a controlled experiment, because the homeowners have made a change and have not mentioned any number of hundreds of factors they have not thought about and are external, and the changes made are consistent with the narrative the New York Times is promoting. The NYT then uses that to further their narrative that racism is omnipresent.

Bad measures give bad outcomes

As noted previously, metrics matter. Suppose we took this article as truth, since it was published in the New York Times, a reputable paper, and we decide we must fix this situation. Given that it shows that there is not only one but likely two racists, and that the chance of getting two racists is low unless there are lots of racists, we might need to overhaul the entire system. And after that work, expense, and all of the lost jobs of the existing presumably irredeemable racists regardless of their own backgrounds, we still end up with appraisals differing because of the smell of cookies. (Which by the way, most real estate agents will tell you to bake before an open house. And they will also tell you to take down all photos of your family. Not because they want to eliminate racial prejudice, but because they want the buyers to imagine photos of their family in the house, and not pay attention to your family.)

Is there racism in housing?

Yes. Redlining existed 50 years ago. You can find also find contemporary accounts where buyers are steered to different neighborhoods based on their race, religion, etc. (I’ve experienced that directly, albeit a long time ago.) There is no doubt that it exists, and that there are long lasting effects on net worth caused by the ability to buy a house and the gain or loss in a house’s value over time. But that doesn’t mean that you are guaranteed or likely to face racism in a real estate sale today, as the article suggests.

If you look for studies on racism and real estate, you will be flooded with page after page of systemic racism articles. The better ones focus on the long term effects of historic denials to choice housing markets. The lower quality ones, like the NYT article, are designed to inflame passions. Undoubtedly there are well designed studies, and I would not be the least surprised if they find racial disparities that disadvantage people of color. The NYT article is just not one of them.

For a better study on racial effects, consider a Harvard study that looks at responses to job applications. This study reduces variables by using consistent resumes with varying names, has a larger set of data, analyzes other studies, and examines its own strengths and flaws. It convincingly concludes that names on resumes can send clues that can trigger racial disparity. In response to this, many recruiting firms anonymize resumes so that when dealing with responses to job offers those screening candidates will see a name such as P.S. instead of the candidate’s real name. There still are other issues that can cause prejudice in the process (for example men and women tend to write resumes differently, universities attended can suggest country of origin, year of graduation can suggest age, etc.) With a clear study and clear measurements, organizations made focused changes to reduce racial disparity.