Lesson goal: Data Science: Messy geography of the U.S.

Previous: Some challenges with CO2 concentration | Home | Next: Work with student grades

Here's a data science exercise with some very raw data. Geographical data of the United States (adapted from Turbo Prolog's cira 1990 "GeoBase" code) is in a 700-line CSV file, called geo.csv. Lines can have a different number of columns, and entities on a given line can be numbers or strings. And, unlike sample.dat and co2.csv from the other exercises, this file is too big to be processed manually (i.e. "by eye" or "by hand").

The clue to what's in a line is by the string in the very first column, which can be state, city, river, border, highlow, moutain, road, or lake. If this identifier is:

  • state, then the following columns will be: name, abbreviation, capital, area, admission-rank, population, city1, city2, city3, city4
  • city, state-it's-in, abbreviation, name, population
  • river, name, length, states-the-river-runs-through
  • border, state, abbreviation, states-that-share-the-border
  • highlow, state, abbreviation, highest-point, height, lowest-point, height
  • mountain, state-it's in, abbreviation, name, height
  • lake, name, area, states-the-lake-is-in
  • road, number, states-the-road-passes-through


There are many questions one might wonder about this data. Here are some you might try:
  • What are names and abbreviations of all states in the U.S.?
  • What states start with a 'C'?
  • What cities are in California?
  • What is the biggest city in the U.S.?
  • What is the longest river in the U.S.?
  • Which rivers are longer than 1,000 kilometers?
  • What is the name of the state with the lowest point in it?
  • Which states border Alabama?
  • Which rivers do not run through Texas?
Code that will answer (some of) these are in the examples.

Now you try. Write code to answer some of our proposed questions, or even to just "explore" the data.

Type your code here:


See your results here: