Posted by Seamus on Wednesday, January 18, 2012.

Fuzzy match in Ruby

Our fuzzy_match library for Ruby can help link (cross-reference) records across data sources—for example, match up aircraft records from the Bureau of Transportation Statistics and the Federal Aviation Administration:

90% of the way by default

Let’s look at only the Boeing 737 records for now…

bts_records = [
  'Boeing 737-800', 'Boeing 737-5/600lr', 'Boeing 737-500',
  'Boeing 737-400', 'Boeing 737-300lr', 'Boeing 737-300',
  'Boeing 737-100/200', 'Boeing 737-200c'
faa_records = [
  '737-100', '737-200, Surveiller (CT-43, VC-96)',
  '737-300', '737-400', '737-500', '737-600',
  '737-700, BBJ, C-40', '737-800, BBJ2', '737-900',
  '737 Stage 3 (US ONLY)',
require 'fuzzy_match'
puts [ 'BTS'.ljust(24), 'FAA' ].join    # print a nice table header
matcher =   # set up a matcher object
bts_records.each do |bts|
  faa = matcher.find(bts)               # given BTS record as input, find a matching FAA record
  puts [ bts.ljust(24), faa ].join      # print a row showing the match

which produces

$ ruby example.rb
BTS                     FAA
Boeing 737-800          737-800, BBJ2
Boeing 737-5/600lr      737-600
Boeing 737-500          737-500
Boeing 737-400          737-400
Boeing 737-300lr        737-300
Boeing 737-300          737-300
Boeing 737-100/200      737-100
Boeing 737-200c         737-100  # <- oops!

Add rules to get to 95%

Fuzzy matching may catch 90% by itself, but you will have to define rules to get to 95%.

In this case, the error is “Boeing 737-200c” matching “737-100”. Let’s use an “identity” rule for “7X7-XXX”…

identities = [
  %r{(7\d7)-?(\d\d\d)} # when comparing two records that both contain 7X7, make sure all the digits (but not the dash) are equal
matcher =, :identities => identities)

which produces the correct match

Boeing 737-200c         737-200, Surveiller (CT-43, VC-96)

Rules and options

Check out the fuzzy_match documentation for all the kinds of rules…

  • :blockings
  • :normalizers
  • :identities
  • :stop_words

and also options you can pass to find

  • :read
  • :must_match_blocking
  • :must_match_at_least_one_word
  • :first_blocking_decides

That’s it!

