CARstat: Tools

This page is hosted on AFS file server space, which is being shut down on November 13, 2018. If you are seeing this message, your service provider needs to take steps now. Visit afs.unc.edu for more information.

Statistical Tools

The USA Today Index

Chi Square

Getting Ready for a Hurricane

Updating the USA TODAY Diversity Index

Phil Meyer and Paul Overberg
January 2001
If we want to write about social diversity, we should be ready for descriptive challenges. By its nature, diversity can be approached from many viewpoints and create much debate about its definition, level and pace of change.
To measure racial and ethnic diversity and capture it in a single number, Phil Meyer of the University of North Carolina and Shawn McIntosh of USA TODAY created the USA TODAY Diversity Index in 1991. It measures the probability that any two people chosen at random from a given census area are of different races or ethnicities.
In 2000, Meyer and Paul Overberg of USA TODAY considered ways of updating the index to handle 2000 Census data, in particular the new multi-race categories. Below we explain the original index and how it has been updated for 2000. We conclude that the changes needed are not substantial.

To calculate the 1990 USA TODAY Diversity Index:
The index uses two basic principles of probability theory: (a) to obtain the probability that all of several independent events will occur, multiply their separate probabilities and (b) to obtain the probability that at least one of several independent events will occur, add their separate probabilities.
Example: Flip two coins. The probability of getting heads on the first AND tails on the second is .5 * .5 = .25. Flip one coin. The probability of getting heads OR tails is .5 + .5 = 1.
Applying it to the diversity index:

For each race, calculate its percentage frequency in the area, with the "other race" category removed from the base. This has the same effect as assigning "other race" to the population in proportion to the presence of the officially designated categories.

Convert the percentage to a decimal and treat it as the probability that ONE person chosen at random will be of that race.

Square that percentage. The probability that TWO people chosen at random will be of that particular race is the single probability multiplied by itself (squared).

Sum the squared probabilities for the separate races. This is the probability that two people are of the same race.

Prob (race) = White² + Black² + AmInd² + A-PI²

Because Hispanic origin is a separate Census question, the probability that someone is Hispanic or not must be figured separately. Figure percentages of the Hispanic and non-Hispanic population, square each percentage and add them.

Prob (Hisp) = Hisp² + non-Hisp²

Multiply the sum of the racial probabilities and ethnic probabilities. This is the probability that any two people are the SAME by race and ethnicity.

Prob (race) x Prob (Hisp)

Subtract this probability from 1 (100%) to get the probability that your two random people are DIFFERENT.¹

In a final step to simplify presentation, multiply by 100 to get an integer.

USA TODAY applied this formula to 1980 and 1990 Census data.² It found the U.S. index was 40, a 20% increase from 1980. The USA TODAY study found significant gains in diversity in every region and all but five states.³
USA TODAY used states, metropolitan areas and cities, but the analysis can be done on any level at which diversity is worth examining: counties, neighborhoods (census tracts) or locally defined areas such as school districts, wards or precincts.

Update for 2000 Census
The 2000 Census allowed respondents to choose more than one race.⁴It also split the Asian-Pacific Islander category into Asian and Native Hawaiian-Other Pacific Islander categories.
The Asian-Pacific Islander split is handled simply by treating each new category as a separate term in the index. As before, percentages are figured with the "other" category eliminated.
Coping with the multi-race option is less obvious. Four major options were considered:

Each of the 63 multi-race possibilities could be treated as a separate race. Doing the same separately for both Hispanics and non-Hispanics would produce 126 categories for which to calculate percentages and probabilities. This would make capture every conceivable kind of diversity. For instance, it would treat Tiger Woods, a self-described "cablinasian" who presumably checked white (Caucasian), black, American Indian and Asian on his Census form -- as distinct (diverse) from his white-black-American Indian father and Asian mother. Logistically, this bore little appeal but it was intuitively appealing.

A category of "two or more races" could represent all multi-race people, and a single percentage and probability could be calculated as if it were just another race. This is a simpler alternative. But it would treat a black-white mixture and an Asian-American Indian as members of the same race even though they had no racial identity in common.

Probabilities could be calculated on the basis of the five basic race groups as before. The fact that they now add to more than 100 because some people report more than one race would be ignored. The effect would be to count a random pair as diverse only if the two members differed on every racial characteristic. Thus Tiger Woods paired with either parent would be part of a non-diverse match. The justification would be that the new method of classifying race reveals common characteristics that were hidden before.

We chose a fourth option which has the opposite effect. It maximizes diversity by letting mixed-race people default to diversity instead of sameness as in No. 2. To achieve it, we modify the 1990 formula to treat Asians and Hawaiians as separate races and make no other changes. Since there is no provision for mixed-race persons in this scheme, their pairings are classed as diverse by default.

In a test against a composite of areas in the American Community Survey with a population of 8 million, we found that methods 3 and 4, although conceptually quite different, produce similar results. The index was .584 with method 3 and .594 with method 4. The small difference is explained by the low incidence of persons reporting multiple racial categories.
Our conceptual justification for counting all multiple-race pairs as representing diversity, even for pairs of identical mix, is that these individuals are diverse in themselves. Consider a random pair where both members are Black-Asian. This pair would count as diverse because each member has at least one racial characteristic that differs from at least one characteristic in the other. This method closely approximates the result in Option 1 above. It also carries the appeal of leaving the equation basically unchanged from its 1990 form.
This is the method we will use. To restate, the USA TODAY Diversity Index for 2000 is calculated:

Diversity = 1-((W² + B² + AmIndI² + A²+PI²) * (H² + Non-H²))
In a simple example just of racial diversity, imagine a city of 100,000 in 1990. (For simplicity, these and all calculations below present diversity scores as a decimal, omitting the final step of multiplying by 100.)

1990 index

White

Black

AmInd

A-PI

Diversity Index

Total
80,000

15,000

-

5,000

Prob of two same race
0.64

0.023

0

0.003

0.335

In 2000, it still has 100,000 people, but 5,000 people have reported more than one race. Below, we compute the index two ways:

-- Treating multi-race people as a racial category.
-- Letting them default to the diversity side of the division.

2000 index

White

Black

AmInd

NatH

A-PI

2+ races

Diversity Index

Total

78,000

13,000

-

-

4,000

5,000

Prob of two same race

0.608

0.017

0

0

0.002

0.003

0.371

White

Black

AmInd

NatH

A-PI

2+ races

Diversity Index

Total

78,000

13,000

-

-

4,000

5,000

Prob of two same race

0.608

0.017

0

0

0.002

X

0.373

Counting multi-race respondents as automatically diverse creates very little difference in this 5% multi-race scenario, which Census Bureau studies indicate will be a high rate for most areas of this size.
We can see more clearly the effect of excluding multi-race respondents from the calculation by analyzing the Census Bureau’s 1999 American Community Survey data. It included the multi-race checkoff just as it was offered on the Census 2000 form. The ACS was completed by a sample of households in 40 communities, including 19 counties whose population totals 8.9 million people.
Combining data for those 19 counties, we calculated a diversity index of .593 by including multi-race respondents as a separate category and .593 by letting the default to diversity. The Census Bureau conducted the ACS in counties with relatively high density to test the effects of the multi-race data on its methods. But in this composite 19-county area with relatively high diversity, the effect of the automatic treatment of multi-race respondents as is nil. This suggests that for states and large metros, our choice is sound.
To test the effect of our method on smaller areas, we also calculated a diversity index for 1990 and 1999 for each of those counties. For the 1990 calculation, we used the original form of the index and data from the 1990 Census. For 1999, we did two calculations. The first used the original index and the Census Bureau’s 1999 county population estimates.⁵The second used the new form of the index (excluding multi-racial respondents) and 1999 ACS data.

County 1990 Census 1999
est. 1999
ACS

Bronx County, NY .762 .770 .783

San Francisco County, CA .663 .700 .736

Tulare County, CA .559 .586 .634

Yakima County, WA .473 .537 .594

Pima County, AZ .475 .523 .558

Jefferson County, AR .504 .517 .529

Madison County, MS .503 .511 .516

Broward County, FL .404 .489 .510

Rockland County, NY .353 .414 .438

Hampden County, MA .321 .394 .422

Calvert County, MD .293 .371 .379

Franklin County, OH .319 .359 .373

Multnomah County, OR .270 .341 .370

Lake County, IL .292 .349 .365

Douglas County, NE .266 .327 .345

Black Hawk County, IA .160 .187 .209

Flathead County, MT .060 .071 .088

Sevier County, TN .030 .046 .084

Schuykill County, PA .028 .065 .047

These calculations show that the 1999 ACS data reveals a slightly higher diversity at most sites than the 1999 estimates, which we expect. Only one site show significantly lower diversity in the 1999 ACS data than in the 1999 estimates. That site, Schuykill County, Pa., shows lower black and Hispanic ACS totals than 1999 estimate totals. Both groups are less than 1% of the population, and small differences there can translate to big changes in percentage-derived measures. In general, this detailed county analysis strengthens our confidence in the method we have chosen for updating the index.
We have not tested the index on smaller areas for comparability between 1990 and 2000 data. It seems likely that proportionally more diversity will be revealed in progressively smaller areas, yielding increasing differences between 1990 and 2000 scores on the USA TODAY Diversity Index. This can be illustrated by examining an extreme case. Imagine a census block of 100 people where everyone reported they were white in 1990. In 2000, only those same people were living there but half decided they were white-American Indian, hoping to take advantage of jobs at a reservation casino nearby. The block’s diversity score would vault from 0 to 50, even though all the people were the same. Even in this case, the index would be faithfully reporting the diversity uncovered by the new data, but any comparison to 1990 would require at least a footnote.

FOOTNOTES

In a single equation, the USA TODAY Diversity Index was:

Diversity = 1-((W%² + B%² + AmIndI%² + A-PI%² ) x (H%² + non-H%²))
return to text

It eliminated the racial category of "Other," which showed a high error rate because many who chose this option were Hispanics who confused ethnicity with race on the census form. USA TODAY dropped those people and figured percentages based on the remaining races.

return to text

"Analysis puts a number on population mix," USA TODAY, April 11, 1991, Page 10A.

return to text

All federal agencies must make this switch by 2003. See Office of Management and Budget rules and guidelines at http://www.whitehouse.gov/OMB/fedreg/ombdir15.html and http://www.whitehouse.gov/OMB/bulletins/b00-02.html

return to text

"1990 to 1999 Annual Time Series of County Population Estimates

Race by Hispanic Origin," http://www.census.gov/population/www/estimates/co_crh.html
return to text
REFERENCES
Fay, Robert, Jeffery Passel and J. Gregory Robinson (1988): The Coverage of Population in the 1980 Census, US Department of Commerce.
Meyer, Philip and Shawn McIntosh (1992): "The USA Today Index of Ethnic Diversity," International Journal of Public Opinion Research. Spring, p. 56.
USA Today (1991): "Analysis puts a number on population mix". 11 April, p. 10A.
Back to tool list

Chi Square Test
Significance Tests
Problem: Something that seems newsworthy might be a meaningless coincidence. If the news can be boiled down to a difference between two numbers, a significance test can help you decide whether to take the coincidence possibility seriously.
Comparing two percents: the Chi-Square test.
Chi square is a test of randomness. Its most common application is comparing percents. If they are different, it could be because of some important factor or it could be due to chance. The test helps you compare the difference you have with the difference that would be expected purely by chance.
It's traditional to reject the coincidence explanation whenever the probability is 0.05 or less (i.e. less than 1 in 20). But there's no law against setting your rejection criterion anywhere you want. Think of it as the probability that you'll be wrong when you assume that coincidence is not the explanation. (This is called "rejecting the null hypothesis.") The smaller the significance level, the less probability that you'll make that particular error, which statisticians call "Type I error."
The other kind of error is assuming that concidence is the explanation when it's really not. The more you protect yourself from Type I error (by setting a low threshold of probability), the greater your risk of making this Type II error and missing a good story.
A finding with a low probability of Type I error is said to be "statistically signficant," but "significant" in this sense does not necessarily mean important in a substantive sense. Chi Square is sensitive to sample size, and so in a large enough sample, all kinds of trivial differences will be statistically significant. You still need common sense.
Example: A newspaper reporter noticed a politically connected law firm was more successful than others in getting clients' zoning variances approved by the City Council. Council members claimed that it was a coincidence, and the law firm just happened to get better cases.

P-C Firm All Others

Successful 76% 52%

Not successful 24% 48%

Is the difference between 76% and 52% large enough that something other than chance is going on here? The Chi Square test compares the observed results with the results that would be expected if there was no relationship between political-connectedness of the law firm and its success rate, and any difference was generated by chance alone.
The Chi Square calculation is based on raw numbers.

Outcome P-C Firm All Others Total

Successful 166 319 485

Not successful 52 300 352

Total 218 619 837

There is a shortcut formula for Chi Square in a four-cell table. (We're showing nine cells above because we are including the row, column, and table totals in the margins. These values are called marginals because they are not part of the table but are in its margins.) For purposes of calculation, let's label the values:

A B

C D

Chi Square (X²) = N * [(A*D) - (B*C)]²
--------------------------------------
(A+B) * (C+D) * (A+C) * (B+D)
The higher the Chi Square Value, the less probable it is. At 3.84, the probability is 0.05. (See Fisher's Critical Values of Chi Square in the back of almost any statistics book.) Here's an excerpt that applies to fourfold tables like the one above.

X² 2.71 3.84 5.02 6.63 7.88 10.83

P 0.10 0.05 0.025 0.01 0.005 0.001

In this case, Chi Square is 40 and p < .001. The coincidence explanation can readily be rejected.
This equation is easier to do by hand than it looks. You need a calculator that allows parenthetical expressions to determine the order of operations. Once you have done the top part, just divide the result by each of the marginal totals, one at a time. That saves you from having to deal with the big hairy number that results from doing the multiplication in the denominator, and it leads to the same result.
You can easily create a spreadsheet to do problems like this one. And statistical programs like SAS and SPSS will do both the calculation and the table look-up for you.
The general formula for chi square is [Sigma] (O - E)²/ E
which means that for each cell in the table, you subtract the expected value from the observed value, square the result, and divide by the expected value. Then you repeat this operation for each cell and sum the results.
What do we mean by expected value? The mathematical expectancy for any cell in a row-and-column (two-way) table is the row total times the column total divided by the table total. For cell A in the example above (number of successful cases argued by the politically connected firm) the expected value is 218 (the column total) times 485 (the row total) divided by 837 (the table total). The result, 126, is what would be expected if the cell values from the given marginals were determined by chance alone.
For tables larger than two by two, the p-value depends on its degrees of freedom, which means simply the number of cells whose values you have to know in order to infer the values of all of the cells (given the marginals). The easy formula for degrees of freedom is the number of rows mines one times the number of columns minus one: (R-1)*(C-1). For our two-by-two example, it's (2-1)*(2-1) = 1*1 = 1.
Chi Square is an approximation, and the method was worked out before there were computers. Some newer versions of SPSS and other programs will do the heavy crunching to calculate the exact probability of getting a difference as great or greater than the one in the data.
Both Chi Square and exact probability tests have an advantage that journalists can appreciate: they are nonparametric tests, meaning that they require no assumptions about the nature of the underlying distribution of your variables in the population being examined. The trade-off is that the test is not as sensitive as other tests that do assume certain characteristics in the underlying population parameters.
--Barbara Hansen and Phil Meyer
Back to tool list

Getting ready for a hurricane
Natural disasters like hurricanes, floods, earthquakes and such are hard to predict. But chances are, if your area has been hit before, it’ll be hit again someday. And, as the saying goes, chance favors the prepared mind: Smart computer-assisted reporters should prepare even for such unlikely events by acquiring and getting comfortable with the databases they’ll need if – no, when – the unthinkable does happen.
Here are some tips based on my experience at The Miami Herald dealing with the aftermath of Hurricane Andrew in 1992. Obviously, different areas are vulnerable to different kinds of disasters, but many of the same principles hold whether your area is along the coast, in a Midwest flood plain, or near a fault line. Best of all, most of these suggestions are for data that’s useful even if there isn’t a disaster.

Get a copy of your property tax roll, or any other real estate database that has detailed information (value, size, year built, location, etc.) about each property in your area. If you do get hit, and then in the aftermath can get a database describing the damage at each property, you can merge them to produce a profile of the damage, map it, etc.

Learn a good mapping program, like ArcView or Atlas*GIS or MapInfo. Then get a detailed map file of your area from your planning office or whomever has some local GIS ability. You'll want this for mapping damage.

In advance of disaster, prepare a disaster history database of your area. Several years before Andrew hit, I built a hurricane database with information on every storm that had hit Florida since 1900. It included date, category, damage amount, number killed/injured, etc. We used it for a hurricane season information graphic showing the storm tracks by decade, and by months (all June storms, all July storms, etc.), plus a lot of other information about hurricane climatology. After Andrew, we dusted it off, slapped in the new track and some new text, and ran it again.

Be ready with "could have been worse" data. In other words, gather the information you need to be able to say how vulnerable your area is in terms of number of people, buildings and dollars are at risk across your area. Then, when disaster hits, you can make valid estimates of the percentage that actually was affected. After Andrew, I did an analysis that showed if the storm had wobbled in just 20 miles farther north -- a meteorological hairsbreadth -- the damage would have tripled. I did it by totaling the population and property value in mile-wide east-west strips all up the Florida coast, then calculating the total for each potential 20-mile wide hurricane path. It took just a could of hours because I already had detailed census information and property tax values.

Find out how damage assessment will be done in your area, if you are hit. Which agency would handle something like that? (Don't count on the Red Cross damage assessment files to be very useful for CAR work. We got their files, but it turned out not to be really a database, but rather just descriptive information.) Ideally, you want house-by-house data; chances are, it will be done by your local building and zoning department, though maybe by FEMA.

If you think building standards might be an after-disaster issue (as they turned out to be in Miami) then start laying the groundwork for getting the files of building inspections. We were able to document instances of inspectors supposedly inspecting hundreds of homes per day. Even better than waiting for the disaster, do that story in advance and maybe you can keep a disaster from being worse.

Get those local campaign contribution files up to date. If there is a question about construction quality, you'll want to document the importance of the building industry in local elections, particularly of those officials who decide on building codes.

--Steve Doig
Cronkite School of Journalism
Arizona State University
Back to tool list

Page created by:
Greg Makris
UNC-CH School of Journalism and Mass Communication
© 1998Latest Revision 1/28/01 by PMK