My Failed Book Analyzing 580,000 Craigslist Personals

Posted on

In 2015, I started a new project that I hoped would be fertile ground for data viz analysis work about online dating.

I wanted to scrape a bunch of dating data and feed it through textual analysis to work through interesting ideas about romance in the online age. Heavily inspired by the book Dataclysm by Chris Rudder, a co-founder of OKCupid and author of the sadly now-defunct OKTrends Blog, I thought this would be the perfect angle to break into general interest nonfiction, letting me write about something I thought was truly transforming what love and sex looked like.

I was going to break into publishing through my combined writing ability and sheer analytical chops.

I didn’t realize just how thoroughly my reach exceeded my grasp.

Collecting the Data

Thinking about how I could get access to analyze-able information was my first hurdle.

Sites like OKCupid keep their data close. You could write a bot to page through the app, but you’ll only see the tip of the data iceberg - what’s available from your perspective. You could have multiple bots scrape the site, allowing you to stitch together the information pulled through different accounts, but that requires more sophisticated data collection - I didn’t want to reverse engineer selection algorithms.

As I looked through the different services, I also started to get a sense of the modern dating value proposition: Not just providing the virtual space for people to meet, but collecting information for machine-assisted matchmaking. As I came across sites that follow this model, some through simple personality questions (OKCupid), linguistic analysis (eHarmony), and other criteria, I started to develop an idea of the dating app industry, where matchmakers competed with simple forums - the digital equivalent of paper classified personals (or lonely hearts ads).

And of this latter, classifieds, category, the clear winner was the same site that had conquered classifieds, generally: Craigslist.

united states of craigslist

Craigslist had a lot of advantages: It’s hyper-local (as Citylab’s wonderful “United States of Craigslist” above shows); anonymous; well-trafficked; at least privacy-aware (it offers email relays, for example); and with a structure that makes it easily scrapeable.

It’s also filthy.

Most of the Craigslist personals section, in its day, was a meat market - a sexual clearing house for different fetishes and a skinner box for lonely single men. I realized analyzing this data set wouldn’t be so much looking into the hearts of our digital avatars, so much as peering into their beds.

That sounded fascinating. And combined with all the technical advantages of choosing it as a project, I decided to play around with building a scraper, and poking around to see what I discovered.

The scraper was a simple, cringe-worthy affair I’m embarrassed to release, written in Casper JS that simply took a seed file of all the subsite paths (generated by a separate script) and, in very procedural terms, stepped through each regional site, spidered through its Personals content, and vacuumed its contents into a series of category-named, newline-delimited plain text files, ordered into directories and subdirectories corresponding to the states and cities where the subsites were located. I chose that method because I was eager to hack something out quick, and I wanted a stable, versionable storage medium (I kept the project files in git, of course), and I wanted to be able to inspect and interact with the data by itself.

It was rudimentary, but worked, and allowed me to collect, over a couple days in March 2015, the contents of about a quarter million posts.

To analyze this I started to play with nltk (Natural Language Tool Kit) python module, leveraging a Jupyter Notebook that allowed me to iterate through the data just, you know, plugging shit in and seeing what I could find.

What I Found

I did a few different types of analysis.

I played around with frequency and concordonces; I split the data into a bar graph, for a view with a little more space for comparison.

But the most fun thing I did was set up the ability to generate some on-demand chloropleth maps of the United States, where different states are shaded by the relative frequency with which they use a term. For these graphs, I used a python visualization library, altair, that would accept the frequency distribution of a term’s usage across states as a pandas Dataframe, then slice everything up into the appropriate quintile buckets and shade things appropriately.


Racialized sexual preferences are sort of fascinating, and just kicking around terms I discovered some things I expected - and some that really surprised me.

One thing I expected: Oftentimes the frequency of a racial term’s usage mapped to the parts of the country where that ethnicity had the largest population. For example, looking at the term “asian” you can see it clustering around the west coast.

But one of the things I did not expect was that a bunch of racial terms - both “black” and “white” but also racialized sexual slang - were most concentrated in the deep south.

white frequency choropleth

black frequency choropleth

This is just the general usage of these terms, with no context or logic to detect the unique number of uses, or how they appear. This is simply the beginning of a process that would try to formulate and test a hypothesis around racialized sexual preferences.

Still, as a jumping off point, it’s fascinating.


You can spend an entire afternoon playing with the geographic concentration of various drug slang terms, corraborating them with census data or otherwise exploring the relation between drugs and sex.

Although terms like “weed” and “420” are pretty popular across the board, you can definitely detect a light clustering around states like Colorado, California, Oregon et all, where marijuana has either been legalized for recreational or medicinal use.

420 frequency choropleth


Did you know that some people in our wonderful country harbor elaborate sexual fantasies about our 45th president?

The way that Trump and his hyper-machismo has played out over the past few years has, it turns out, fed the erotic imagination of a whole cohort of Americans. In the 2017 data, you can see clear evidence of Trump’s entry into American public life in the way he and his paleolithic approach to gender politics appear in sexual fantasies. Here are two men advertising for group sex in Long Island, specifically referencing Trump and Clinton and conservatism as this weird sexual shorthand for American masculinity and “real men.”

While you voted for Clinton (and may truly detest Trump) and are very very liberal, you want to be fucked and used by two Republican men who will give you what you have been missing in bed. You know that your friends do not know how kinky and nasty you think. You’re probably in your twenties or thirties but want to give your body up to real men who will use your body as their personal playground.

Many Trump references are also along the lines of “I’ll never fuck a Trump supporter.” Here’s a man looking for a woman in Boulder with a very specific request.

Put something about hating Trump in the subject header….Looking forward to taking care of you.

Partisan self-sorting in action!

The way Trump and politics has seeped into every part of our culture is perhaps best expressed by the fact that he even makes an appearance in our most lurid fantasies. Not even the bedroom is an escape.

Pitching in Modern Publishing

Publishing is feeling a crunch. As editors go for surefire successes like celebrity cookbooks or ripped-from-the-headlines political memoirs, the space for the non- elite to publish is shrinking.

But there are still a good number of literary agencies that accept unsoliticited manuscripts. The advantage of nonfiction is that at least the bar for submission is a bit higher: Presumably nonfiction authors have more skills / knowledge to distinguish them than a fiction hopeful submitting the next Great American Novel. Nevertheless, query letters often languish in inboxes along a million other unsent-for and unwanted pitches.

After writing my query letter and doing the first rough draft of my book proposal, I created a simple spreadsheet CRM for the agents I was contacting, listing their name, agency, contact info, and whether I had contacted them (and when) and if they’d responded (and when). Then I went down the list.

You can send query letters or Book Proposal snippets for months without a response. Most agencies won’t take the time to reach out to the authors of unsuccessful submissions, there’s simply too many of them.

I was extraordinarily lucky to get interest right away. My query letter had a pretty good response rate (maybe 10%) and I was having a back and forth (even a phone call!) with a few different agents kicking the tires on the idea.

But there were a few things I was missing, and they would become clear as my conversations with interested parties evolved.

Why I Failed

There are several reasons why this project failed.

1. I didn’t have a platform - and I didn’t succeed at borrowing one.

I got the feedback from literary agents early that the book was interesting and held a sort of lurid appeal - but without a platform the project was stillborn.

Now, there is still a way to write this book - or rather, a way that books like this typically get published. The steps would be:

A. Publish an article in a general interest magazine like The Atlantic, The Outline, the New Yorker, or the culture section of the New York Times or the Washington Post.
B. Use that buzz and (ideally) rabid social media discussion to start pitching agents.

It’s just a two-step process - so simple!

In addition to giving you the idea validation you so desperately need, working with a professional magazine editor will make your pitch tighter and more focused. It also parallels the type of audience you’ll be making your appeal to in the actual book, the “educated layman” who is curious, analytical and informed (in their own domain), but ultimately not familar with the topic at hand, or only superficially.

It also forces you to distill your idea into a narrative. Part of the weakness of this idea (and my proposal) is that a found all these interesting tidbits, but they don’t add up to anything. For this book to work, each fact has to be building on and furthering a larger idea, and not just illuminating some interesting piece of data. This feeds into failure reason #2.

2. I didn’t have a narrative

A story. It turns out you need one. I tried different angles on the overall subject matter (“An Atlas of American Kink”-style coffee table book, a personal memoir about all the weird stuff I was wading through, a topical analysis of the material tying the site to trends in our generally oversexed culture) but the fact that I didn’t have a coherent arc tying together my planned chapters made it a non-starter. So many of the agents I talked to were excited about the general premise, but frankly needed to see more to commit.

3. I don’t have the statistical authority for this project. And I don’t have a partner who does.

Something I love about programming is how dead easy it is to play with ideas and services you simply do not understand. With some of what I was doing, I have the art to apply the tool, but not necessarily verify it. In addition to the above frequency, concordances, and visualizations, which are pretty straightforward, I also tried things like sentiment analysis. Sentiment analysis and other techniques are fun and can be surprisingly effective out-of-the-box, but you really need someone who understands and works with NLP on a deep enough level to make those careful tweaks necessary for an ironclad, publishable project. There are currently analytical tools and graphics (even in this post!) that are a bit straightforward - to the point of being naiive. Some of that is intentional - this is just a first pass to get people interested, and the task simply requires more time and attention - and part of it is requiring the presence of a technical co-author (or at least technical editor) to weigh in.

4. The service doesn’t exist anymore!

Though I’d mostly given up before this happened, in early 2019 Craigslist axed their entire personals section over a fear they could be liable under a new law attacking sex trafficking (Craigslist, along with similar sites like Backpages, sometimes act as havens for both conventional prostitution and sex trafficking). Although it make my own data set rarer, the end of Craigslist personals naturally makes anything on the subject seem dated and unimportant.

Or What’s a Heaven For?

Along with Tumblr later deciding to forgo XXX content, Pervert Internet has had a bad year. The Craigslist personals were a weird, messed-up, sometimes abusive and sometimes transcendant repository for our collective lizard brain - or at least, the part of it that fucks.

Losing it we lose just another part of the internet who’s colorful offensiveness was part of its perhaps gross, always profane, but never boring, charm.

Thinking about my experiences with this project, I’m drawn back to Robert Browning’s poem. It might be worth it to pull a few extra lines from “Andrea del Sarto”

I, painting from myself and to myself,
Know what I do, am unmoved by men’s blame
Or their praise either. Somebody remarks
Morello’s outline there is wrongly traced,
His hue mistaken; what of that? or else,
Rightly traced and well ordered; what of that?
Speak as they please, what does the mountain care?
Ah, but a man’s reach should exceed his grasp,
Or what’s a heaven for? All is silver-grey,
Placid and perfect with my art: the worse!
I know both what I want and what might gain,
And yet how profitless to know, to sigh
“Had I been two, another and myself,
“Our head would have o’erlooked the world!” No doubt.

If I’d only been able to partner with UT’s Computational Media Lab or gotten accepted into a conference where I could workshop the talk - or found the right adjunct professor or data scientist to collaborate with - I’m sure I’d be right at the top of the general interest nonfiction bestseller list, sipping champagne with Bill Bryson and Malcom Gladwell.

No doubt.