"2","Extracting and Working with CSV files"

In the first part of "Comma Values", we got CSV formatted data out of MongoDB. Now,in part two, we're going to look at the tools to manipulate those CSV files, specifically CSVKit. CSVKit is a treasure trove of CSV manipulation tools whether you are importing, exporting or just fixing up CSV data.

What is CSVKit?

CSVKit is a Python based set of utilities which can be used standalone or pipelined together. It's been around since 2011, when data journalist Christopher Groskopf took a handy utility, csvcut, rewrote it and then began adding tools to create CSVkit.

Now CSVKit is quite mature and it can take data from other tabular formats, or tables in SQL databases, and turn them into RFC conformant CSV files. It has a CSV cleaner for files that claim to be CSV but are malformed internally. There's commands to filter and truncate CSV files and filter or search(grep) CSV files for rows of data, There's also a rather splendid CSV join command to merge two files together using common fields. It's quite a tool and if you are doing anything with CSV, it's worth having to hand. We'll look at some examples of its use here and where it adds value, but first...

Getting CSVKit

This is pretty simple. As long as you have Python on your system, a pip install csvkit should do all the necessary work. If there's a problem the "Installation" page in the documentation has essential troubleshooting, especially for Ubuntu users and users of older versions of Python. Now, on to the examples.

CSVing your data - in2csv

Getting your data into CSV format can be a chore. There's lots of fixed width formats that look tabular but are hard to parse. Reformatting them into CSV could well be the best way to get them into a database. The in2csv command has that covered for you with it's fixed width support. Say we had a file called fixed.txt that looked like this:

This is the end    20   Red  1.50  
Designers "& more" 15   Bluee3.20  
The world of radios60000Grey 9.20  

You could rustle up a script to turn that into CSV, but it'd be simpler if you just create a schema file for this fixed format which defines where the fields you expect start and end. Like this is one, schema.csv:

column,start,length  
bookname,1,19  
assigned,20,5  
color,25,5  
price,30,10  

Then we only need to run in2csv like so:

$ in2csv -f fixed -s schema.csv fixed.txt
bookname,assigned,color,price  
This is the end,20,Red,1.50  
"Designers ""& more""",15,Bluee,3.20
The world of radios,60000,Grey,9.20  

And we get a CSV file with headers and it's properly escaped – look at the second entry where the double-quotes already there were escaped as double double-quotes and the entire string value surrounded with double-quotes.

The in2csv command doesn't just take in fixed width formats. You probably guessed that the -f parameter above was setting the format. As well as fixed, there's support for csv (to add headers), xls and xlsx (for spreadsheet data), and geojson, json and "ndjson" to extract data from JSON files and turn it to CSV. In the latter case, a JSON file like:

$ cat json1.json
[
  { "Name": "Bowie", "Age": 60, "Researching": "Whippets" },
  { "Name": "Frederick", "Age": 55, "Researching": "Frames" },
  {  "Age": 55, "Researching": "Neatness", "Name": "Galor"}
]
$ in2csv -f json json1.json
Name,Age,Researching  
Bowie,60,Whippets  
Frederick,55,Frames  
Galor,55,Neatness  

It looks simple, but notice that the third object in the JSON data is in a different order, a difference in order which is removed by in2csv as it normalises things to CSV. There's a number of other mutations in2csv can apply to a file to flatten it too; if the JSON file is a dictionary rather than an array, it can be directed to extract from a particular key.

CSVing your SQL database

There's quite a few SQL databases which are supported by Python's SQLAlchemy. As CSVKit is built on this, it means those databases also work with CSVKit. For Compose users, the most important one will be PostgreSQL, but if you need to get data from SQL Server, MySQL, Oracle, SQLite or Sybase, the same techniques work. There's two commands to look at in CSVKit, sql2csv and csvsql. The first, sql2csv is the one that enables data extraction as CSV. It lets you run a command against an SQL database and get the results as CSV files. All it needs is a connection string and a query. Here is an example used with a Compose database of films:

$ sql2csv --db "postgres://admin:password@aws-eu-west-1-portal.0.dblayer.com:10199/dvdrental" --query "select * from film where release_year=2006"
film_id,title,description,release_year,language_id,rental_duration,rental_rate,length,replacement_cost,rating,last_update,special_features,fulltext  
133,Chamber Italian,A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria,2006,1,7,4.99,117,14.99,NC-17,2013-05-26 14:50:58.951000,['Trailers'],'chamber':1 'fate':4 'husband':11 'italian':2 'monkey':16 'moos':8 'must':13 'nigeria':18 'overcom':14 'reflect':5  
384,Grosse Wonderful,A Epic Drama of a Cat And a Explorer who must Redeem a Moose in Australia,2006,1,5,4.99,49,19.99,R,2013-05-26 14:50:58.951000,['Behind the Scenes'],'australia':18 'cat':8 'drama':5 'epic':4 'explor':11 'gross':1 'moos':16 'must':13 'redeem':14 'wonder':2  

Here we've queried a table which is quite rich in PostgreSQL types, specifically a text array and a tsvector for full text search, and asked it for all fields from 2006 films. The sql2csv command has turned the column names into a header row and formatted up all our rows into well formed CSV.

The csvsql command does the opposite of sql2csv by letting CSV data be loaded into an SQL database. It generates the CREATE TABLE command needed to store the contents of a CSV file. If we take that CSV data we just generated and stored it in a file films.csv, then we can run csvsql like so:

$ csvsql -i postgresql --table newfilms films.csv
CREATE TABLE newfilms (  
        film_id INTEGER NOT NULL,
        title VARCHAR(27) NOT NULL,
        description VARCHAR(130) NOT NULL,
        release_year INTEGER NOT NULL,
        language_id INTEGER NOT NULL,
        rental_duration INTEGER NOT NULL,
        rental_rate FLOAT NOT NULL,
        length INTEGER NOT NULL,
        replacement_cost FLOAT NOT NULL,
        rating VARCHAR(5) NOT NULL,
        last_update TIMESTAMP WITHOUT TIME ZONE NOT NULL,
        special_features VARCHAR(21) NOT NULL,
        fulltext VARCHAR(193) NOT NULL
);

Here we get the create statement, but you will notice that the command can't work out what was a text array and full text vector. That's the information loss you get from stripping all the type information to make a CSV file which is why, unless you have a very good reason, you don't extract data as CSV.

To make csvsql actually execute that CREATE TABLE we need to replace the -i postgresql, which just hints at the SQL dialect with a --db url like we used with the sql2csv command earlier; csvsql will then work out the database dialect. When you are actually connected to a database you can add --insert and that will then insert the data in the CSV file into the newly created table. If you've already created the table, then --no-create added to the command line skips creation and goes straight to the inserting.

There is one more trick up this command's sleeve. You can give csvsql just an SQL query and a CSV file (or two) and it will perform the query on the CSV files:

$ csvsql --query "select title,description,release_year from films where rating='NC-17'" films.csv
title,description,release_year  
Chamber Italian,A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria,2006  
Adaptation Holes,A Astounding Reflection of a Lumberjack And a Car who must Sink a Lumberjack in A Baloon Factory,2006  
Aladdin Calendar,A Action-Packed Tale of a Man And a Lumberjack who must Reach a Feminist in Ancient China,2006  
Alice Fantasia,A Emotional Drama of a A Shark And a Database Administrator who must Vanquish a Pioneer in Soviet Georgia,2006  
...

It's almost magical. Behind the scenes an in-memory SQLite database is loaded up with the CSV files and the query then executed against that. It's not fast but it's yet another blade in CSVKit's set. Of course there are quicker ways to do filtering as we did there; let us move onto the core of CSVKit...

The CSV processing commands

There's a solid set of CSV processing commands in CSVKit.

Csvclean

Our first stop is the csvclean command. We've got to this point extracting data from various databases but often you may be faced with a CSV file of, shall we say dubious, quality. With csvclean you can separate out the rows with syntax errors or incorrect numbers of values. csvclean filename.csv will put the good rows into filename_out.csv and the bad rows and related errors into filename_err.csv. Add -n and the command will dry-run just telling you where the errors are.

Csvcut

Next up is csvcut where the idea of CSVSKit originated. This command deals with cutting columns out of CSV files. Take the csvsql command from earlier, where we only wanted the title, description and release_year fields. We can do that a lot more efficiently with csvcut:

csvcut -c title,description,release_year films.csv  

You can use -c (aka --columns) to name the fields you want or -C (that's an upper-case C or you can use --not-columns) to name the fields you don't want. You can, if you want, also use indices for the fields. This is much faster than that SQL command.

Csvgrep

But we're not filtering the rows as we did in the csvsql example. This is where csvgrep comes in. It's a row-wise filter to complement csvcut's column-wise filter. For our example we can use it like this:

csvgrep -c rating -m "NC-17" films.csv  

The -c parameter lets us pick the fields we want to match with in each row. Here, we use the rating field. Then we use -m to give a string to match with ("NC-17") and finally the CSV file name. This produces the "NC-17" rated movies in the list. If we want to completely replicate the csvsql example, we could pipe the commands together like so:

csvgrep -c rating -m "NC-17" films.csv | csvcut -c title,description,release_year  

Csvjoin

If you don't give these commands a filename to work with they default to using STDIN, which makes it easy to string them together in classic Unix style. In a more SQL vein, there's CSVkit's csvjoin, capable of doing a join operation on a number of CSV files so given two files such as country.csv:

country,short  
United States,US  
United Kingdom,UK  
France,FR  

and characters.csv

name,job,resident  
Arthur Puty,Presenter,UK  
A.N.Chorman,Newscaster,US  
Crow T. Robot,Robot,US  
Pepe Le Pew,Skunk,FR  

We can join the full name of the country to the character by doing:

$ csvjoin -c resident,short --outer characters.csv country.csv | csvcut -C resident,short
name,job,country  
Arthur Puty,Presenter,United Kingdom  
A.N.Chorman,Newscaster,United States  
Crow T. Robot,Robot,United States  
Pepe Le Pew,Skunk,France  

The -c parameter says which fields are used for the join, the --outer sets the style of the join. The join fields both make it through into the output CSV so we use csvcut to remove both of them using the -C "exclude these fields" options. Tada, CSV file joining. A word of warning though: it all happens in memory so don't do very large files unless you have masses of memory on hand.

Csvsort and Csvstack

Finally, there's csvsort which lets you sort a CSV file by a field and csvstack which lets you smartly merge similar CSV files into one larger grouped CSV file. Both handy things to have around, and as part of CSVKit's swiss army knife, but we're heading back into the realm of the database's specialities so let's look at getting our data back to the database.

From CSV to JSON

The most likely route back to the database, assuming you aren't going back to an SQL database where csvsql has you covered, is via JSON. Yes, ok. You got there first, there's a csvjson command and it does what it says, converts CSV files to JSON. More precisely, it turns each line of the CSV file into a JSON object with keys names drawn from the header field names in the CSV. So if you run csvjson on our films csv file we get this:

$ csvjson films.csv
[{"film_id": "133", "title": "Chamber Italian", "description": "A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria", "release_year": "2006", "language_id": "1", "rental_duration": "7", "rental_rate": "4.99", "length": "117", "replacement_cost": "14.99", "rating": "NC-17", "last_update": "2013-05-26 14:50:58.951000", "special_features": "['Trailers']", "fulltext": "'chamber':1 'fate':4 'husband':11 'italian':2 'monkey':16 'moos':8 'must':13 'nigeria':18 'overcom':14 'reflect':5"}, {"film_id": "384",...

Notice the JSON starts with a [ meaning all the objects created will be in a comma separated JSON array and the file will end with ]. You may want to stream the JSON objects into an application and not have an array surrounding them. To lose the array and just have every JSON object on newline delimited lines, add the --stream option – especially helpful if you're using the Compose Transporter's file adapter to import the data. If you find the output hard to read, add --indent 4 to indent with four spaces. Let's try those two together:

$ csvjson --indent 4 --stream films.csv
{
    "film_id": "133",
    "title": "Chamber Italian",
    "description": "A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria",
    "release_year": "2006",
    "language_id": "1",
    ...

You can also make your JSON file into a key/value map using the --key option and the name of a field whose value you want to be the key in that map. You can't --stream this one though. Let's make our films into a map with the title as the key:

$ csvjson --indent 4 --key title films.csv | more
{
    "Chamber Italian": {
        "film_id": "133",
        "title": "Chamber Italian",
        "description": "A Fateful Reflection of a Moose And a Husband who must Overcome a Monkey in Nigeria",
        "release_year": "2006",
...

Not ideal for importing into most databases, unless it's as a single document in its own right but a useful transformation to be able to do easily.

Of course this isn't the only export option. With csvformat you can create your own custom file format to transform your CSV file, with your own field widths, quoting styles, tab handling and more. Handy for when you have got your output from a database in CSV format and just need to flatten it out a little more for a legacy system. The csvpy command is one for Pythonistas; it will load up a CSV file with CSVKit's own library and drop you into a Python shell so you can manipulate it.

The last two commands are useful for all CSV file users. Getting a well formatted print of a CSV file can be a chore but csvlook will take care of that for you. Working out what's in your CSV file is a journey which can begin with a good set of statistics which is where csvstat comes in. It will do useful analysis of each field in the CSV file you hand it. Here's a snippet from it running on our films.csv file:

$ csvstat films.csv
  1. film_id
        <class 'int'>
        Nulls: False
        Min: 1
        Max: 1000
        Sum: 500500
        Mean: 500.5
        Median: 500.5
        Standard Deviation: 288.6749902572095
        Unique values: 1000
  2. title
        <class 'str'>
        Nulls: False
        Unique values: 1000
        Max length: 27
  3. description
        <class 'str'>
        Nulls: False
        Unique values: 1000
        Max length: 130
....

Each field is evaluated for what type it appears to be - int, float, str, datetime, and whether it is ever null. For numeric fields csvstat derives a range of common statistics. Where it's worth doing, the 5 most common values are noted and a count of unique values in that field is available. The maximum length of string fields is also shown. Finally there's a row count. This is all really useful if you plan on sending the data on into a database and want to be precise defining a schema or validation.

Wrapping up CSVKit

CSVKit, as you can see, is pretty much all you need to handle CSV files on their journey between applications and databases. You'll find much more information in the fine documentation which includes a walkthrough tutorial that works with a real dataset in Excel format. You'll also find all the common arguments that the various utilities share and more.

In the next part of this series, we'll look at getting CSV-formatted data in and out of the other databases Compose has available.