In this post is I’m going to use my work-in-progress data visualization WordPress plugin to explore the Civilian Complaints Against New York City Police Officers dataset published by ProPublica (source) to demonstrate how my plugin works.
The main philosophy behind this plugin is that you will use your favorite command line tools or apps to clean up your data and use the plugin as a minimal presentation layer.
To work with the example dataset, I am going to use the csvkit command line program. I haven’t used it before, but it supports SQL syntax, which I’m fairly proficient with. You can, of course, follow along with your preferred tool.
First, let’s see what we’re working with. This command will list the names of columns in our dataset.
csvcut --names nyc-cops-allegations-2020.csv # 1: unique_mos_id # 2: first_name # 3: last_name # 4: command_now # 5: complaint_id # 6: month_received # 7: year_received # 8: month_closed # 9: year_closed # 10: command_at_incident # 11: rank_abbrev_incident # 12: rank_abbrev_now # 13: rank_now # 14: rank_incident # 15: mos_ethnicity # 16: mos_gender # 17: mos_age_incident # 18: complainant_ethnicity # 19: complainant_gender # 20: complainant_age_incident # 21: fado_type # 22: allegation # 23: precinct # 24: contact_reason # 25: outcome_description # 26: board_disposition
Now, let’s look at some of the interesting columns. Let’s start with something simple. Let’s do a breakdown of complaints by officer’s rank.
csvsql --query "SELECT rank_incident AS Rank, count(rank_incident) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY rank_incident" nyc-cops-allegations-2020.csv > nypd-rank-incident.csv
This creates a CSV file with a list of officer ranks and a count of complaints for each rank. The exported data doesn’t have to be sorted, the plugin has a few basic options, like sorting or limiting the number of data points.
You can visit the project page for examples of the plugin’s UI, here’s a short clip of adding a sample CSV file.
And here’s our own file visualized.
|Chiefs and other ranks||2|
Note that I am using a logarithmic scale, a common way to deal with datasets with a large difference between values. Remember to disclose when using a logarithmic scale, as this can create confusion.
And here’s the result, styled with Bootstrap’s CSS.
|Chiefs and other ranks||2|
Next, let’s see if we can see a pattern in frequency of complaints throughout the year.
csvsql --query "SELECT month_received AS Month, count(month_received) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY month_received" nyc-cops-allegations-2020.csv > nypd-month-received.csv
The result of this query will use numbers for months, so to make the data easier to read, we can open the exported file and change the numbers to names of months.
And now we can upload the dataset.
Now, let’s look at the officers with most complaints. We can use the following command to get the full, unsorted list of officers with a count of complaints filed against them.
csvsql --query "SELECT first_name || ' ' || last_name AS Officer, count( first_name || ' ' || last_name ) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY first_name || ' ' || last_name" nyc-cops-allegations-2020.csv > nypd-officers.csv
We could then use options provided by the data visualization block to sort the values and limit to a specific number of officers. We could also change our query to apply the sorting directly and trim the dataset, resulting in a smaller file.
csvsql --query "SELECT first_name || ' ' || last_name AS Officer, count( first_name || ' ' || last_name ) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY first_name || ' ' || last_name ORDER BY Incidents DESC LIMIT 10" nyc-cops-allegations-2020.csv > nypd-officers-sorted.csv
# Breakdown by ethnicity csvsql --query "SELECT complainant_ethnicity AS Ethnicity, count(complainant_ethnicity) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_ethnicity" nyc-cops-allegations-2020.csv > nypd-complainant_ethnicity.csv # Breakdown by gender csvsql --query "SELECT complainant_gender AS Gender, count(complainant_gender) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_gender" nyc-cops-allegations-2020.csv > nypd-complainant_gender.csv # Breakdown by age csvsql --query "SELECT complainant_age_incident AS Age, count(complainant_age_incident) AS Incidents FROM 'nyc-cops-allegations-2020' GROUP BY complainant_age_incident" nyc-cops-allegations-2020.csv > nypd-complainant_age_incident.csv
The results will need a bit of a cleanup.
First of all, let’s take the age breakdown and group the values, say, 1-18, 19-45, and 45+.
csvsql --query "SELECT [complainant_age_incident], case when [complainant_age_incident] between 1 and 18 then '1-18' when [complainant_age_incident] between 19 and 45 then '19-45' when [complainant_age_incident] > 45 then '45+' end as [age_group]from 'nyc-cops-allegations-2020'" nyc-cops-allegations-2020.csv > nypd-complainant-age-breakdown.csv
This will create a new file which we can then process.
csvsql --query "SELECT age_group AS 'Age Group', count(age_group) AS Incidents FROM 'nypd-complainant-age-breakdown' GROUP BY age_group" nypd-complainant-age-breakdown.csv > nypd-complainant-age-incident.csv
We can also remove empty rows from our files. Importing each dataset will give us this series of charts.
The results are pretty predictable here, particularly when looking at breakdown by ethnicity, with Black people being highly overrepresented, as proved by virtually every study ever conducted on this topic.
The chart showing the gender breakdown is hard to read due to the numbers for Male and Female are much bigger than the rest. We could make a note of this and ask the reader to use the filters above the chart to filter out individual columns, or we could perhaps create one chart showing counts for Male, Female, and Others, and a separate chart that breaks down Others. Or we could use a bar chart with a logarithmic scale, as we did with our first chart.
These are all fairly simple examples in terms of what we’re doing with data, and you can go on and create more advanced comparisons, and the plugin will be able to render them for you.
Just to use one example, let’s see if we can spot a correlation between ethnicity of the person filing a complaint, and how long it took to resolve it.
We have the following columns to work with:
Without a day of the month, it’s hard to get accurate results, since a complaint opened on March 28th and closed on April 1st would have the same values as one that was opened on March 1st and closed at the end of April.
For the sake of a demonstration, I’m going to use the following query:
csvsql --query "SELECT complainant_ethnicity AS Ethnicity, round( avg( CAST( round( CAST( JulianDay( CAST( year_closed AS INTEGER ) || '-' || substr( '00' || CAST( month_closed AS INTEGER ), -2 ) || '-01' ) - JulianDay( CAST( year_received AS INTEGER ) || '-' || substr( '00' || CAST( month_received AS INTEGER ), -2 ) || '-01' ) AS INTEGER )/30.5, 0 ) AS INTEGER ) ), 1 ) as 'Resolution Length' FROM 'nyc-cops-allegations-2020' GROUP BY complainant_ethnicity" nyc-cops-allegations-2020.csv > nypd-complaint-ethnicity-resolution-length.csv
Again, the results will not be perfect and only serve for demonstration purposes. Do note that I combined values with missing ethnicity information with the “Unkown” column.
|Other Race||10 months|
|American Indian||9 months|