How to Find Data

Embed from Getty Images

There’s a lot of data out there. But where do you start to find what you need?

Some basic strategies that work pretty well: Google it. That’s never a bad place to start and it only takes a second. (And use Google in a smart way. Use key words specific to your data. Use filetype: to narrow your search for specific file types. For example, use filetype: csv for only csv file formats. Use the results to dig deeper and discover related agencies that may have the data).

  • Figure out who should have the data? Who might have it? Is this information only the NYPD or the IRS can collect? The Departments of City Planning, Buildings, Housing, Finance and Taxation all keep tabs on who owns property in New York City, where that property is located and what it can be used for. If you know who ought to have the numbers you’re looking for, you can start your search by asking them.
  • Look at recent reporting about the subject. Who has been releasing reports? Who has been cited in stories? Go ask them for data, or ask them for help finding it.
  • Wikipedia is a fantastic resource. Don’t be afraid of it. Most information there comes with a citation — don’t take some Wikipedia author’s word for it, but do look at the source they cited and confirm that the numbers are there.
  • Look for think tanks and aid organizations that specialize in the issue you’re interested in.
  • Ask a librarian

Know your sources

You can get data anywhere, so it is up to you to decide whether or not you’re working with reliable data. You should know where your sources are coming from — do they have an agenda that can help you understand how they’re framing the data they put out? You can roughly guess who is behind NRA Institute for Legislative Action, but what about Law Center to Prevent Gun Violence? Don’t assume that a think tank is reliable just because it kind of feels professional.

A famous example is the misleading website www.martinlutherking.org. Though the site appears to be an informational site about the civil rights leader Martin Luther King, Jr., it actually is a mouthpiece for the white supremacist group Stormfront.org. You can verify the ownership of domain sites using www.betterwhois.com.

Provenance

It is also up to you to know where your data is coming from. Did the organization hire a research firm to conduct a comprehensive study? Or did they post a little box on their website asking visitors how they feel?

Be skeptical: an advocate (or government agency) insisting that these numbers mean something doesn’t make it so.

Where to look?

The Journalism School’s Research Center maintains an excellent roundup of guides, many of which will point you to great data sets. Check out the census, business and crime guides in particular.

NICAR’s database library is a great resource. So is Amanda’s tumblr’s “data sources” tag.

Here’s a working guide from last semester: https://github.com/amandabee/cunyjdata/wiki/Where-to-Find-Data

Advertisements

Spreadsheets Walkthrough

Delimiters and Functions with Flu Data

To review spreadsheet basics, download the data from Google’s Flu Trends. The data is just text with a lot of commas. The goal is to get your data into tidy rows and columns in a spreadsheet, so you can start looking for interesting patterns or trends.

  1. Copy all the data (Cmd+A, then Cmd+C) and in Excel, paste the data (Cmd+V).
  2. The pasted data appears all in one column.

  1. Select column A, and choose Data > Text to Columns.

  1. The Convert Text to Columns Wizard appears, which allows you to define the “delimiter”, or the character that separates columns of data. Choose the comma as your delimiter. You can see a preview of the re-formatted data in the preview window below.

  1. In the next section of the Text to Column dialog box, you can format the date data so it’s preserved as Year-Month-Date.

  1. Now that all the data has been successfully transferred to your spreadsheet, you can start using formulas and/or other sorting/ filtering functions to explore the data. Let’s find the maximum number of flu searches in each country. At the bottom of the spreadsheet, at the end of column B, enter =Max(). All formulas begin with the equals sign, then the function name, and then a pair of parentheses. In between the parentheses, enter the range of data from which you want to find the maximum. You can select the cells with your mouse, or enter the beginning and end cell, separated with a colon.

  1. To extend your Max function to ALL your columns, simply click the bottom-right corner of the cell and drag it to the right. The function is extended and Excel “auto-increments” the range so the maximum is determined for each appropriate column.

Now that you know the maximum value for each country, you can create an =Max() function to identify which country (column) has the maximum. Your range would be the cells in the row that displays the max values.

You don’t always need to use functions. Use Data > Sort (Shift+Command+R) to sort your spreadsheet along a single column, in either descending or ascending order. That’s an easy way to re-order your data to see maximums and minimums.

Screen Shot 2013-09-12 at 4.30.55 PM

Screen Shot 2013-09-12 at 4.31.27 PM

Charting and Visual Encoding

These are the studies and readings that we discussed in class regarding visual encoding of your data:

Summary findings of encodings, from most accurate to least accurate:

  1. Position
  2. Length
  3. Angle
  4. Area
  5. Density and color saturation
  6. Color hue

Know some of the common chart types:

  • Bar charts: Trends for categories
  • Line charts: Trends for continuous series/continuous changes between x-axis (time series)
  • Scatter plot: correlation
  • Bubble plot: scatter plot + additional variable
  • Pie chart: show proportions
  • Area charts/stacked graphs: proportions

Excel Charts to SVG

Bitmap formats (JPEG, PNG, BMP, GIF) are images that are displayed with pixels, or tiny colored dots. Vector graphics, on the other hand, are rendered by the computer based on mathematical formulas and code. For example, a circle would be defined by a radius of a certain size, the color of the outline, the color of the filled inside, its location, and so on. Because vector graphics are defined mathematically, they don’t have a fixed resolution. You can zoom in or out of a vector image, and the display remains sharp. If you zoom into a bitmap file, you’ll see the pixels that make up the image.An example of charts rendered as vector graphics, from the Guardian.

 

Microsoft Excel can output charts as JPEG, GIF, BMP, or PNG (bitmap formats), or as PDF (vector). You can convert the PDF file into an SVG (Scaleable Vector Graphics), which is a common format for displaying vector graphics in a browser. Your images will be sharper and easier to edit in a vector graphics editing program like Adobe Illustrator.

 

  1. In Microsoft Excel, size your chart (width and height) in the FORMAT tab.
    Screen Shot 2014-03-25 at 10.21.59 AM
  2. Select the text in the horizontal and vertical axes, and choose Format Selection to change the Font to Arial. The default Excel font is a bit unusual and sometimes throws off the conversion of text in later steps.
    Screen Shot 2014-03-25 at 10.23.27 AM
  3. Right-click (Ctrl-click) on the chart frame and choose Save As Picture.
    Screen Shot 2014-03-25 at 10.22.30 AM
  4. Save your chart picture as a PDF.
    Screen Shot 2014-03-25 at 10.22.41 AM
  5. Open the PDF in Adobe Illustrator. If you’re comfortable with Illustrator, you can edit your chart.
    Screen Shot 2014-03-25 at 10.46.38 AM
  6. Choose File > Save As, and save as a SVG (compressed optional).
  7. If you don’t have access to Illustrator, you can use the free online tool CloudConvert(https://cloudconvert.org/svg-to-pdf), which can convert your PDF to SVG. The only downside is that you can’t edit your graphics.
  8. The resulting SVG file is simply an HTML text file that describes the shapes for your chart. You can open it up in a browser, or copy and paste it into another HTML document to display your chart.
    Screen Shot 2014-03-25 at 10.52.10 AM

Unfortunately, you can’t just paste the SVG code directly into a WordPress post, as WP (out of the box) doesn’t support SVG formats. There are plug-ins that allow SVG uploads and other hacks, but a simple solution is to upload your SVG to DigitalStorage and iframe it into your WP post, like this: (but WordPress.com doesn’t allow certain iframes)

http://russellchun.com/cuny/datajournalism/samplevector.html

Data Viz Checklist

Embed from Getty Images

A complete data visualization tells a whole story, not half a story. This checklist will help ensure that your story is complete. You should also review ProPublica’s Style Guide, which quite thorough and thoughtful.

The Data

[ ] Question the context Do you understand the context this data was collected in? Your captions should reflect that context.

[ ] Know and normalize your numbers. Did you say “average” when you really mean “median”? Are your numbers normalized so that you’re comparing apples and apples? If you need to combine some columns or drop others, are your choices rational (and not simply convenient?)

[ ] Talk to someone who understands the numbers Your story isn’t complete unless you’ve actually spoken with at least one source who has some familiarity with the numbers that you’re looking at.

[ ] Link to the raw data Newsrooms will take different positions on this one, but for class assignments, you must include a link to both the source of your data (the site you downloaded it from) and the data you used to generate the visualization. If those are one and the same, great. But if you had to clean the data or manipulate it to make your visualization, share that cleaned data with your readers.

[ ] Cite your sources Who did you get the data from, and when? Provenance matters: if the ACLU has compiled a years worth of quarterly NYPD reports and published them as a single free standing data set, did you get the data from NYPD or ACLU?

The Visualization

[ ] Lighten the cognitive load Don’t make your readers work. Every visual translation is a layer between content and reader. If readers have to flip between tabs to make comparisons, or refer to too many legends, they’re working too hard understand your story.

[ ] Choose visual encoding wisely What do properties like position, length, angle, area, hue, saturation, or brightness represent in your visualization?

[ ] Focus on the data Are you maximizing the data-to-pixel ratio? Use pixels wisely by avoiding chart chunk and decoration that doesn’t add to a reader’s understanding of the data.

The Text

[ ] Add captions that help the story along A good caption expands on your visualization without describing it. I could say “this checklist includes the ingredients for a complete data visualization” but you already know you’re looking at a checklist. So use the caption to add something.

[ ] Hyperbole is the death of a story. Make sure you can back up every statement of fact in your text. If you need to introduce a story with a line like “More Americans than ever are buying pants these days” make sure you know it’s true, even if your story is not about the growth in the pants market at all. Always be accurate.

[ ] Never “click here” — if you have to say “click here” your project is missing something. Show it, don’t say it.

[ ] Put your reporting in context with links. You aren’t reporting in a vacuum — highlight the reports and reporting that put your story in context.

[ ] Edit for redundancy Every word counts. If you find that you’re using the same phrase again and again, that’s a sign that you need to rework your text. Redundant:

Total Arrests in 2008: 15
Total Arrests in 2009: 39
Total Arrests in 2010: 422
Total Arrests in 2011: 3

Much better:

Total arrests:
2008: 15
2009: 39
2010: 422
2011: 3

[ ] Edit for brevity by looking for filler text and rewrite it out. Phrases like “here you can see” or “in this chart” or “this graph shows” just fill up precious space. Tell the reader why the numbers are interesting (and do it without saying “these numbers are interesting because …”)

[ ] Edit for grammar and style Take one last pass and check your grammar.

How to make your Excel charts look more professional

The default settings in Excel spit out pretty terrible looking charts. But it’s not too difficult to turn something like this:

Screen Shot 2013-10-21 at 4.07.54 PM

Into something more like this:

Screen Shot 2013-10-21 at 4.08.03 PM

Note the specific steps for the transformation:

  • The move from a legend to direct labeling
  • The reduction of excessive axes lines and tick mark labels to reduce visual clutter
  • Highlighting the data of interest and de-emphasizing the others
  • Adding the headline, text, and source

Follow this tutorial from Storytelling with Data to see the step-by-step guide that took the default graph style to its more refined, and ultimately more effective, finished state.

Finally: an animated GIF of the process of stripping away unnecessary graphics to focus on the data and the message:

Pivot Table Walk-through

In this walk-through, we’ll use Olympic athlete data from London’s 2012 summer olympics. UseGuardian London Olympic data — look for the “download the data” link and be sure to save a copy so you can edit it. This blog post will use Google Spreadsheet’s Pivot Table function.

You can summarize your data with pivot tables in Excel, Google Spreadsheets or LibreOffice Calc. The details will between software, but the basic steps are the same. Follow these steps to look at the data in pivot tables on Google Spreadsheets:

1. Data > Pivot Table Report

2. A new tab gets created at the bottom with your Pivot table report. The table is empty, and on the right-hand side are fields for rows, columns, values, and filters.
3. Click Rows > Add Field, and choose Country name.
The country names appear on the left side of the matrix (as rows).
4. Click Columm > Add Field, and choose Sex.
The values for Sex, “F” and “M”, appear on the top of the matrix (as columns).
5. Now you have to populate the matrix with values. What to choose? Click Values > Add Field, and choose Name.
The Name category contains each athlete, so if you want to count the number of athletes from each country that are male or female, then this is the right selection. The matrix fills in. However, the values and totals don’t look right. The table is filled with “0”s.
6. For Values, choose Summarize by > CountUnique.
Pivot tables counts each unique entry in the Name category. Now your pivot table tabulates all the athletes according to Country and Sex. You can see how there are many different options for displaying values. (You should really be using COUNTA, which counts all entries, and not CountUnique, which only counts unique entries. As you recall from class, if any athletes have identical names, then you’ll have an undercount!).
7. You can sort the rows differently. Right now, they are sorted by country, alphabetically. Choose Rows > Sort by > CountUnique of Name in… >Sex > F.
Now your pivot table re-orders the rows by the number of Female athletes. Which country had the most female athletes? The United States.
8. Click Filter > Add Field  and choose Sport name. Choose Show to get a list of all the values for Sport name.
9. Choose Clear, and then select only the sports that you want to see. For example, choose Badminton, and the OK. Your pivot table updates to show only the male and female athletes that play Badminton.
You should see that China sent the most female athletes in Badminton (8 athletes).
NOTE: If the “Report Editor” on the right-hand side (with the fields for Rows, Columns, Values, and Filters) disappears, don’t fret! You can get it back by simply clicking on any data inside of your the pivot table.