Posted by Seamus on Wednesday, January 18, 2012.

Fuzzy match in Ruby

Our fuzzy_match library for Ruby can help link (cross-reference) records across data sources—for example, match up aircraft records from the Bureau of Transportation Statistics and the Federal Aviation Administration:

screenshot of the BTS aircraft data source screenshot of the FAA aircraft data source

90% of the way by default

Let’s look at only the Boeing 737 records for now…

bts_records = [
  'Boeing 737-800', 'Boeing 737-5/600lr', 'Boeing 737-500',
  'Boeing 737-400', 'Boeing 737-300lr', 'Boeing 737-300',
  'Boeing 737-100/200', 'Boeing 737-200c'
]
faa_records = [
  '737-100', '737-200, Surveiller (CT-43, VC-96)',
  '737-300', '737-400', '737-500', '737-600',
  '737-700, BBJ, C-40', '737-800, BBJ2', '737-900',
  '737 Stage 3 (US ONLY)',
]
require 'fuzzy_match'
puts [ 'BTS'.ljust(24), 'FAA' ].join    # print a nice table header
matcher = FuzzyMatch.new(faa_records)   # set up a matcher object
bts_records.each do |bts|
  faa = matcher.find(bts)               # given BTS record as input, find a matching FAA record
  puts [ bts.ljust(24), faa ].join      # print a row showing the match
end

which produces

$ ruby example.rb
BTS                     FAA
Boeing 737-800          737-800, BBJ2
Boeing 737-5/600lr      737-600
Boeing 737-500          737-500
Boeing 737-400          737-400
Boeing 737-300lr        737-300
Boeing 737-300          737-300
Boeing 737-100/200      737-100
Boeing 737-200c         737-100  # <- oops!

Add rules to get to 95%

Fuzzy matching may catch 90% by itself, but you will have to define rules to get to 95%.

In this case, the error is “Boeing 737-200c” matching “737-100”. Let’s use an “identity” rule for “7X7-XXX”…

identities = [
  %r{(7\d7)-?(\d\d\d)} # when comparing two records that both contain 7X7, make sure all the digits (but not the dash) are equal
]
matcher = FuzzyMatch.new(faa_records, :identities => identities)

which produces the correct match

Boeing 737-200c         737-200, Surveiller (CT-43, VC-96)

Rules and options

Check out the fuzzy_match documentation for all the kinds of rules…

  • :blockings
  • :normalizers
  • :identities
  • :stop_words

and also options you can pass to find

  • :read
  • :must_match_blocking
  • :must_match_at_least_one_word
  • :first_blocking_decides

That’s it!

What blog is this?

Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.

Who's behind this?

We're Brighter Planet, the world's leading computational sustainability platform.

Who's blogging here?

  1. Patti Prairie CEO