Friday, March 4, 2011

One-off programming

We needed to send some data to a partner for an analysis. They need the files in .csv format and some other draconian specifications. To get the data, I had to run a set of SQL statements that resulted in 4 files, and I had to do this for 5 different companies, which resulted in 20 files.

However, our partner company needed all the data in 4 files only. So I needed to take 5 files and combine them all into 1. In addition, I had to add the store number as the first column in the CSV on every record in every CSV.

Sounds like a job for Excel, right? Well, I've never had luck making Excel understand CSV files (along with all of the double quote and comma characters) and when I opend the files in Ultra Edit, it completely fell down on itself. These are BIG files.

Enter Stage Left: C#. I fired up VS 2010 and wrote 132 lines including curly braces, empty lines, comments, using statements, etc. This app first takes all 5 store files for a certain query, adds "MerchantID," to the first line in the file, then appends on a string such as "44000," to every line and saves it out to a new file. Step 1 complete. Now I have 20 files with the Company Identifier as the first column in each.

Step 2, read in all 5 files for each query, use the first line from the first file, and then just append on all of the data lines from all of those files, into 1 giant aggregate file. Do this for all 4 of the query types.

Now I was left with just 4 files, all companies combined together and identified on every record, and the original data was still just as pristine as it ever was (meaning, I didn't have to worry about Excel "helping" me with doubling the double quote character, adding extra commas, etc.)

When I ran the app, it took about 12 seconds to process all of that data. Ultraedit couldn't handle it (in the way that I wanted to use it), and Excel even struggled with the largest file (lots of waiting while Excel played with memory).

This is one of the benefits to being a programmer. If you want to make your life easier while you learn quite a bit about how your machine actually works under the covers, do yourself a favor and start writing programs to deal with text files. You'll learn about how string manipulation affects your processor and memory, about how file IO works, how streams work (in C# at least), and you'll also build yourself a few tools to do file manipulation that you will invaluable in very many unplanned situations.