Posted by Seamus on Tuesday, June 14, 2011.

Stream parser with Ragel and Ruby

You can use Ragel to make simple stream parsers in Ruby. By “stream parser,” I mean one that reads in files a chunk at a time instead of all at once—thereby keeping memory use constant.

Say you have a file like

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer a
tristique lectus. Vestibulum ante ipsum primis in faucibus orci luctus et
aliquet laoreet, iacSTARTFOOThere are lots of great ideas here.ENDFOOulis
a lorem. Integer interdum, dolor aliquam accumsan eleifend, nisl tortor
mollis ipsum, et semper arcu mi nec felis. Nunc scelerisque cursus dolor
eu tristique. Mauris porta pulvinar dolor. Integer egestas lacinia leo, ut
mollis sapien fermentum non. Maecenas ultricies nibh at justo ornare eu
ullamcorper justo aliquet. Cras id augue eget nunc auctor mattis vitae
quis massa. STARTFOOYou just have to look closelyENDFOOMauris suscipit
justo in erat scelerisque imperdiet.

You want to pull out

There are lots of great ideas here.
You just have to look closely

As I show in my ragel_ruby_examples tests, you could even read the stream 1 byte at a time.

Here’s the Ragel part:

machine simple_tokenizer;
  action MyTs {
    my_ts = p
  }
  action MyTe {
    my_te = p
  }
  action Emit {
    emit data[my_ts...my_te].pack('c*')
    my_ts = nil
    my_te = nil    
  }
  foo = 'STARTFOO' any+ >MyTs :>> 'ENDFOO' >MyTe %Emit;
  main := ( foo | any+ )*;

…and here’s the Ruby reading/buffering mechanism…

CHUNK_SIZE = 1_000_000 # bytes (instead of reading in the whole file all at once)
  # Note: use with simple_tokenizer
  def perform
    pe = :ignored
    eof = :ignored
    %% write init;
    # % (this fixes syntax highlighting)
    leftover = []
    my_ts = nil
    my_te = nil
    File.open(path) do |f|
      while chunk = f.read(CHUNK_SIZE)
        data = leftover + chunk.unpack('c*')
        p = 0
        pe = data.length
        %% write exec;
        # % (this fixes syntax highlighting)
        if my_ts
          leftover = data[my_ts..-1]
          my_te = my_te - my_ts if my_te
          my_ts = 0
        else
          leftover = []
        end
      end
    end
  end

Alternatively you could use Ragel’s scanner functionality:

machine simple_scanner;
  action Emit {
    emit data[(ts+8)..(te-7)].pack('c*')
  }
  foo = 'STARTFOO' any+ :>> 'ENDFOO';
  main := |*
    foo => Emit;
    any;
  *|;

Which requires buffering code like:

CHUNK_SIZE = 1_000_000 # bytes (instead of reading in the whole file all at once)
  # Note: use with simple_scanner
  def perform
    pe = :ignored
    eof = :ignored
    %% write init;
    # % (this fixes syntax highlighting)
    leftover = []
    File.open(path) do |f|
      while chunk = f.read(CHUNK_SIZE)
        data = leftover + chunk.unpack('c*')
        p ||= 0
        pe = data.length
        %% write exec;
        # % (this fixes syntax highlighting)
        if ts
          leftover = data[ts..pe]
          p = p - ts
          ts = 0
        else
          leftover = []
          p = 0
        end
      end
    end
  end

Again, you can see more at ragel_ruby_examples, specifically simple_tokenizer.rl.

Final note: I realize this isn’t precisely a parser or even a tokenizer… but these sorts of examples are what I would have wanted when I was getting started with Ragel and Ruby. What’s more, hopefully the Ragel community will chime in and improve the examples.

What blog is this?

Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.

Who's behind this?

We're Brighter Planet, the world's leading computational sustainability platform.

Who's blogging here?

  1. Patti Prairie CEO