Exploring the NYPD Civilian Complaints dataset

In this post is I’m going to use my work-in-progress data visualization WordPress plugin to explore the Civilian Complaints Against New York City Police Officers dataset published by ProPublica (source) to demonstrate how my plugin works.

While the plugin is under active development, you will have to download it from its GitHub repo and upload it manually to your WordPress blog.

The main philosophy behind this plugin is that you will use your favorite command line tools or apps to clean up your data and use the plugin as a minimal presentation layer.

To work with the example dataset, I am going to use the csvkit command line program. I haven’t used it before, but it supports SQL syntax, which I’m fairly proficient with. You can, of course, follow along with your preferred tool.

First, let’s see what we’re working with. This command will list the names of columns in our dataset.

csvcut --names nyc-cops-allegations-2020.csv

# 1: unique_mos_id
# 2: first_name
# 3: last_name
# 4: command_now
# 5: complaint_id
# 6: month_received
# 7: year_received
# 8: month_closed
# 9: year_closed
# 10: command_at_incident
# 11: rank_abbrev_incident
# 12: rank_abbrev_now
# 13: rank_now
# 14: rank_incident
# 15: mos_ethnicity
# 16: mos_gender
# 17: mos_age_incident
# 18: complainant_ethnicity
# 19: complainant_gender
# 20: complainant_age_incident
# 21: fado_type
# 22: allegation
# 23: precinct
# 24: contact_reason
# 25: outcome_description
# 26: board_disposition

Now, let’s look at some of the interesting columns. Let’s start with something simple. Let’s do a breakdown of complaints by officer’s rank.

csvsql --query "SELECT rank_incident AS Rank, count(rank_incident) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY rank_incident" nyc-cops-allegations-2020.csv > nypd-rank-incident.csv

This creates a CSV file with a list of officer ranks and a count of complaints for each rank. The exported data doesn’t have to be sorted, the plugin has a few basic options, like sorting or limiting the number of data points.

You can visit the project page for examples of the plugin’s UI, here’s a short clip of adding a sample CSV file.

And here’s our own file visualized.

Civilian Complaints Against NYPD Officers
RankIncidents
Police Officer22,509
Sergeant5,934
Detective3,307
Lieutenant1,301
Captain182
Deputy Inspector96
Inspector27
Chiefs and other ranks2

Note that I am using a logarithmic scale, a common way to deal with datasets with a large difference between values. Remember to disclose when using a logarithmic scale, as this can create confusion.

I can also render this dataset as a table, and since I use Bootstrap on my site, I can use one of the table classes to make the data easier to read.

And here’s the result, styled with Bootstrap’s CSS.

Civilian Complaints Against NYPD Officers
RankIncidents
Police Officer22,509
Sergeant5,934
Detective3,307
Lieutenant1,301
Captain182
Deputy Inspector96
Inspector27
Chiefs and other ranks2

Next, let’s see if we can see a pattern in frequency of complaints throughout the year.

csvsql --query "SELECT month_received AS Month, count(month_received) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY month_received" nyc-cops-allegations-2020.csv > nypd-month-received.csv

The result of this query will use numbers for months, so to make the data easier to read, we can open the exported file and change the numbers to names of months.

And now we can upload the dataset.

Civilian Complaints Against NYPD Officers
MonthIncidents
January2,697
February2,844
March3,154
April2,881
May2,968
June2,780
July2,727
August2,980
September2,966
October2,776
November2,353
December2,232

Now, let’s look at the officers with most complaints. We can use the following command to get the full, unsorted list of officers with a count of complaints filed against them.

csvsql --query "SELECT first_name || ' ' || last_name AS Officer, count( first_name || ' ' || last_name ) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY first_name || ' ' || last_name" nyc-cops-allegations-2020.csv > nypd-officers.csv

We could then use options provided by the data visualization block to sort the values and limit to a specific number of officers. We could also change our query to apply the sorting directly and trim the dataset, resulting in a smaller file.

csvsql --query "SELECT first_name || ' ' || last_name AS Officer, count( first_name || ' ' || last_name ) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY first_name || ' ' || last_name  ORDER BY Incidents DESC LIMIT 10" nyc-cops-allegations-2020.csv > nypd-officers-sorted.csv
Civilian Complaints Against NYPD Officers
OfficerIncidents
Mathew Reich75
Daniel Sbarra75
Joseph Tallarine73
Gary Messina73
Christophe Mccormack72
William Taylor65
David Cheesewright63
Paul Mcmahon56
Mike Civil56
Michael Raso50
# Breakdown by ethnicity
csvsql --query "SELECT complainant_ethnicity AS Ethnicity, count(complainant_ethnicity) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_ethnicity" nyc-cops-allegations-2020.csv > nypd-complainant_ethnicity.csv

# Breakdown by gender
csvsql --query "SELECT complainant_gender AS Gender, count(complainant_gender) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_gender" nyc-cops-allegations-2020.csv > nypd-complainant_gender.csv

# Breakdown by age
csvsql --query "SELECT complainant_age_incident AS Age, count(complainant_age_incident) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_age_incident" nyc-cops-allegations-2020.csv > nypd-complainant_age_incident.csv

The results will need a bit of a cleanup.

First of all, let’s take the age breakdown and group the values, say, 1-18, 19-45, and 45+.

csvsql --query "SELECT [complainant_age_incident], case when [complainant_age_incident] between 1 and 18 then '1-18' when [complainant_age_incident] between 19 and 45 then '19-45' when [complainant_age_incident] > 45 then '45+' end as [age_group]from 'nyc-cops-allegations-2020'" nyc-cops-allegations-2020.csv > nypd-complainant-age-breakdown.csv

This will create a new file which we can then process.

csvsql --query "SELECT age_group AS 'Age Group', count(age_group) AS Incidents FROM 'nypd-complainant-age-breakdown' GROUP BY age_group" nypd-complainant-age-breakdown.csv > nypd-complainant-age-incident.csv

We can also remove empty rows from our files. Importing each dataset will give us this series of charts.

Age GroupIncidents
1-182,679
19-4521,244
45+4,615
GenderIncidents
Female5,021
Gender non-conforming2
Male24,058
Not described57
Transman (FTM)5
Transwoman (MTF)20
EthnicityIncidents
American Indian64
Asian532
Black17,114
Hispanic6,424
Other Race677
Refused259
Unknown1,041
White2,783

The results are pretty predictable here, particularly when looking at breakdown by ethnicity, with Black people being highly overrepresented, as proved by virtually every study ever conducted on this topic.

The chart showing the gender breakdown is hard to read due to the numbers for Male and Female are much bigger than the rest. We could make a note of this and ask the reader to use the filters above the chart to filter out individual columns, or we could perhaps create one chart showing counts for Male, Female, and Others, and a separate chart that breaks down Others. Or we could use a bar chart with a logarithmic scale, as we did with our first chart.

Civilian Complaints Against NYPD Officers
GenderIncidents
Male24,058
Female5,021
Not described57
Transwoman (MTF)20
Transman (FTM)5
Gender non-conforming2

These are all fairly simple examples in terms of what we’re doing with data, and you can go on and create more advanced comparisons, and the plugin will be able to render them for you.

Just to use one example, let’s see if we can spot a correlation between ethnicity of the person filing a complaint, and how long it took to resolve it.

We have the following columns to work with:

  • complainant_ethnicity
  • year_received
  • month_received
  • year_closed
  • month_closed

Without a day of the month, it’s hard to get accurate results, since a complaint opened on March 28th and closed on April 1st would have the same values as one that was opened on March 1st and closed at the end of April.

For the sake of a demonstration, I’m going to use the following query:

csvsql --query "SELECT complainant_ethnicity AS Ethnicity, round( avg( CAST( round( CAST( JulianDay( CAST( year_closed AS INTEGER ) || '-' || substr( '00' || CAST( month_closed AS INTEGER ), -2 ) || '-01' ) - JulianDay( CAST( year_received AS INTEGER ) || '-' || substr( '00' || CAST( month_received AS INTEGER ), -2 ) || '-01' ) AS INTEGER )/30.5, 0 ) AS INTEGER ) ), 1 ) as 'Resolution Length' FROM 'nyc-cops-allegations-2020' GROUP BY complainant_ethnicity" nyc-cops-allegations-2020.csv > nypd-complaint-ethnicity-resolution-length.csv

Again, the results will not be perfect and only serve for demonstration purposes. Do note that I combined values with missing ethnicity information with the “Unkown” column.

EthnicityResolution Length
Black11 months
Hispanic10 months
Other Race10 months
White9 months
American Indian9 months
Asian9 months
Refused9 months
Unknown8 months

And that’s a wrap. If this project looks interesting to you, feel free to reach out via email or on Twitter.

More tutorials

A tinted, zoomed in screenshot of a JSON object showing server information about a Mastodon instance.
A tinted screenshot of two charts, one showing the popularity of various fediverse platforms (with Mastodon far ahead of the rest), and the other chart showing distribution of domain creation dates, mostly clustered around 2023.
A tinted screenshot showing the @mtaupdates Mastodon profile and a few example posts with subway status alerts.

💻 Browse all