Safety in Numbers

Brighter Planet's blog

Posted by Seamus on Tuesday, September 11, 2012.

Split XML files with `sgrep`, a classic UNIX utility from 1995

sgrep is better than split or csplit for breaking up XML files by element – you can even use it to create a constant-memory streaming “parser.”

$ sgrep -o "XXXSTART%rSTOPXXX" '"<TourismEntity" .. "</TourismEntity"' transmission_file.xml
XXXSTART<TourismEntity>
    <State>New York</State>
    <Saying>I♥NY</Saying>
  </TourismEntitySTOPXXXXXXSTART<TourismEntity>
    <State>Virginia</State>
    <Saying>Is For Lovers</Saying>
  </TourismEntitySTOPXXXXXXSTART<TourismEntity>
    <State>Wisconsin</State>
    <Saying>America's Dairyland</Saying>
  </TourismEntitySTOPXXX

(see below for why that output is useful)

tl;dr

sgrep and a simple Ruby program (given below) let you stream XML elements into an #emit method that can do whatever you want. What’s more, the memory usage is constant (and small); memory usage doesn’t grow like if you parse the entire XML document into memory like with nokogiri.

Using sgrep to split XML

Combine sgrep with, for example, a Ruby program:

#!/usr/bin/env ruby

# your target element here
ELEMENT_START = '<TourismEntity'
ELEMENT_STOP = '</TourismEntity'

# your emit code here - in this case I'm just writing it to a separate file named tourism_entity-NUM.txt
def emit(tourism_entity)
  $tourism_entity_count ||= 0
  $tourism_entity_count += 1
  File.open("tourism_entity-#{$tourism_entity_count}.txt", 'w') { |f| f.write tourism_entity }
end

SGREP_BIN = %w{ sgrep sgrep2 }.detect { |bin| `which #{bin}`; $?.success? }
MAGIC_START = 'XXXSTART'
MAGIC_STOP = 'STOPXXX'

leftover = ''
IO.popen([ SGREP_BIN, '-n', '-o', "#{MAGIC_START}%r#{MAGIC_STOP}", %{"#{ELEMENT_START}" .. "#{ELEMENT_STOP}"}, ARGV[0] ]) do |io|
  while additional = io.read(65536)
    buffer = leftover + additional
    while (start = buffer.index(MAGIC_START)) and (stop = buffer.index(MAGIC_STOP))
      element_body = buffer[(start+MAGIC_START.length)...stop] + '>'
      # what "emit" does is up to you
      emit element_body
      buffer = buffer[(stop+MAGIC_STOP.length)..-1]
    end
    leftover = buffer
  end
end

So let’s go back to the example, transmission_file.xml:

<TransmissionFile>
  <TourismEntity>
    <State>New York</State>
    <Saying>I♥NY</Saying>
  </TourismEntity>
  <TourismEntity>
    <State>Virginia</State>
    <Saying>Is For Lovers</Saying>
  </TourismEntity>
  <TourismEntity>
    <State>Wisconsin</State>
    <Saying>America's Dairyland</Saying>
  </TourismEntity>
</TransmissionFile>

You will get:

$ ruby emit_tourism_entity.rb transmission_file.xml 
$ tail +1 tourism_entity-*
==> tourism_entity-1.txt <==
  <TourismEntity>
    <State>New York</State>
    <Saying>I♥NY</Saying>
  </TourismEntity>
==> tourism_entity-2.txt <==
  <TourismEntity>
    <State>Virginia</State>
    <Saying>Is For Lovers</Saying>
  </TourismEntity>
==> tourism_entity-3.txt <==
  <TourismEntity>
    <State>Wisconsin</State>
    <Saying>America's Dairyland</Saying>
  </TourismEntity>

What’s happening is:

Ruby spawns sgrep using a pipe
sgrep spits out a stream of element bodies separated by “XXXSTART” and “STOPXXX” into the pipe
Ruby reads from the pipe and watches for element bodies separated by the aforementioned magic tokens
When Ruby sees a whole element body, it runs #emit

Why are you so amazed by this program from 1995

Because just look at that beautiful syntax:

$ sgrep '"{" .. "}"' eval.c

And because memory usage is really low, and it’s really fast.

I have less than 100 elements and just want to split up the file

Both of these will break up the XML file into separate files without the need for a Ruby wrapper:

$ split -p '<TourismEntity' transmission_file.xml
$ csplit -s -k transmission_file.xml '/<TourismEntity/' '{100}'

But there are little problems, like you max out at 100 separate files (i.e. elements), and other things.

What blog is this?

Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.

Who's behind this?

We're Brighter Planet, the world's leading computational sustainability platform.

Who's blogging here?

Patti Prairie CEO