CSV | Another word for snow.

It’s no secret that the public sector are in love with CSV. You only have to look at sites like data.gov.uk to see that. If it’s not CSV then it’s Excel, which comes down to pretty much the same thing. The reason is simple, there’s loads of CSV and you can create and consume it easily with office level tools. However, in the IT world CSV tends to be an intermediate format to something like SQL, or in my case XML. I often get situations where the ‘seed’ data in a project comes in as CSV from which N XML files need to be made, one from each row, e.g. people.csv into N <person/> files. The follow-on problem to this is that some of the original CSV files can be big. That’s not big as in Big Data big, but too large to load into an editor or run as a process with the whole thing in memory i.e.. “We’ll have to send you the csv, zipped, it’s 2 million rows.” irritatingly big.

Now of course most of the platforms that you might want to use to consume this stuff comes with tools, but you need to know them and if you want to as I do turn the CSV into XML as well there might be couple of places you need to explain this and specific idioms to remember from the last time that you didn’t write down. All these things tend to come to a head when you’ve a day to create some whizzy demo site from a file someone emailed you.

If I get a file even vaguely in XML and I want another XML then I tend to use XQuery or XSLT. If not I tend to use Ant or Apache Camel. These days Camel is my favourite as it neatly merges the modern need for both transport and transformation into one system. So, I’ve a CSV file on the system, what to do next?

First choice is whether you can consume you file whole or you need to read it line by line or in chunks. The latter is the normal situation, it’s not often you get just a few hundred lines to read and streaming it in allows you to read any size of file. Whichever way you go, you can use the CSV data format as your base (there’s also the heavy hitter Bindy module I’m ignoring for this post). This adds the ability to marshal (or transform) data to and from a Java Object (like a file) to CSV in a <route/>. At it’s simplest, it means you can read a file from a folder into memory and unmarshall it into a Java List (actually a List inside a List) like so:

<route id="csvfileone">
    <from uri="file:inbox" />
    <unmarshal><csv delimiter="|" skipFirstLine="true"/></unmarshal>
    <log message="First Line is ${body[0][0]}"/>
    <to uri="seda:mlfeed"/>
</route>

Here I’ve used the option to ignore the header line in my file and use pipe as delimiter rather than comma. The whole thing is sent to a seda queue and I’ll assume something is processing the List at that end. Just to prove it really is a List (you can talk to Lists in <simple/> which is also cool), I’ve logged the first line. Now if you want to read a small file and pick out say the first, second and fourth field from a given line this might be all you need. The problem with this approach is that you don’t need a huge file before memory and performance become issues.

If you’re looking at a big file, then what you can do is use the Splitter to, well split it into lines (or blocks of lines) first and then unmarshall each line afterwards. This is ideal, if as I do, each line is to become a separate file in your output later. Now the route looks like this:

<route id="csvfilereader">
    <from uri="file:inbox" />
    <split streaming="true" parallelProcessing="true">
       <tokenize token="\r\n|\n" xml="false" trim="true" />
           <filter>
               <simple>${property.CamelSplitIndex} &gt; 0</simple>
               <unmarshal><csv/></unmarshal>
               <to uri="seda:mlfeed"/>
           </filter>
    </split> 
</route>

To reduce memory the splitter is told to stream the file into chunks. Note a side effect of this is that the lines in the file won’t necessarily turn up in the order they were in the input file. The splitter has also been told to process each chunk in parallel which speeds up the process. The Tokenize language is used to tell the splitter how to perform the split. In this case, it’s to use either Windows or Unix line endings (got to love that) and to trim the results. Each line is then fed into our queue unmarshalled as before. Note I couldn’t use skipFirstLine here as each entry is only one line so I’ve added a <filter> based on the counter from the split instead. One of the things I like about Camel is the way you can start of with a simple route and then add complexity incrementally.

Now I’ve a simple and robust way to suck up my CSV file, I just need to turn each record into XML by transforming the data with a bit of self-explanatory Groovy:

class handler {
    public void makeXML(Exchange exchange) {
    def response= "";
    def crn = "";
 
 /* Example data
CRN, Forename, Surname, DoB, PostCode
[11340, David, Wright, 1977-10-06, CV6 1LT]' 
*/
 
    csvdata.each {
        crn = it.get(0)
        response = "<case>\n"
        response += "<crn>" + crn + "</crn>\n"
        response += "<surname>" + it.get(1) + "</surname>\n"
        response += "<forename>" + it.get(2) + "</forename>\n"
        response += "<dob>" + it.get(3) + "</dob>\n"
        response += "<postcode>" + it.get(4) + "</postcode>"
         response += "</case>"
    } 
    exchange.getIn().setBody(response)
    exchange.getIn().setHeader(Exchange.FILE_NAME,crn + '.xml')
}

As a bonus, I’ve dropped the unique id (CRN) field into a Header so the it will get used as the filename and each output file will be called something like 11340.xml. Last of all, I need to wrap the code up in a route to read the queue, create the file and spit it out into a folder:

 <route id="xmlfilewriter">
     <from uri="seda:mlfeed"/>
     <log message="hello ${body[0][0]}"/>
     <to uri="bean:gmh?method=makeXML"/>
     <to uri="file:outbox"/>
 </route>

Of course, in the real world, you’d probably not store the file this way and it would go straight to Hadoop, or MarkLogic etc. Also, of course, it could stay in CSV and you could do other cool things to it. That’s what I like about Camel, flexibility.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Another word for snow.

XML, technology and other stuff.

Category Archives: CSV

Camel and CSV when you need XML