Social Security name data

Boy name counts

Girl name counts

I got this data from the social security administration website and then put it into flat file format with some scripts. They had it in a sparse format with a file per year. That saves space, but really the data sets are not so large that modern computers require a sparse representation.

The original data seems to move around but last I looked I found it here. For more information look here. As of August 2012 there is a page of data qualifications here.

Each column is a year from 1880 to 2010 inclusive. Each row is a name. The number there shows how many babies that year had that name, except that a 0 just means 0,1,2,3, or 4. There are separate files for boys and girls. They do not list very rare names probably for privacy concerns.

It would be great to know how many names appeared exactly 1, 2, 3, or 4 times for boys and for girls in each year. It is hard to see how that would violate any privacy issues. It is possible to make an educated guess by extrapolating the small counts down to 4 or 3 or even 2. Somehow I expect the 1s might be exceptional (and bigger than what you'd get by extrapolation).

There are lots of interesting patterns in the data. There are more girl names than boy names. Girl names seem to change more quickly than boy names.

In R:

boy = read.table("boycounts.txt",head=T,row=1)
girl = read.table("girlcounts.txt",head=T,row=1)
for( i in 1:ncol(boy) )names(boy)[i] = names(girl)[i] = as.character(1880+i-1)


Justin Dyer and I used it here to illustrate reliability of observed count rankings.
The paper was a featured paper in JASA.

author = {Dyer, Justin S. and Owen, Art B.},
title = {Correct Ordering in the Zipf-Poisson Ensemble},
journal = {Journal of the American Statistical Association},
volume = {107},
number = {500},
pages = {1510--1517},
year = {2012}