Safety in Numbers
Brighter Planet's blog
Fuzzy match in Ruby
Our fuzzy_match
library for Ruby can help link (cross-reference) records across data sources—for example, match up aircraft records from the Bureau of Transportation Statistics and the Federal Aviation Administration:
90% of the way by default
Let’s look at only the Boeing 737 records for now…
bts_records = [
'Boeing 737-800', 'Boeing 737-5/600lr', 'Boeing 737-500',
'Boeing 737-400', 'Boeing 737-300lr', 'Boeing 737-300',
'Boeing 737-100/200', 'Boeing 737-200c'
]
faa_records = [
'737-100', '737-200, Surveiller (CT-43, VC-96)',
'737-300', '737-400', '737-500', '737-600',
'737-700, BBJ, C-40', '737-800, BBJ2', '737-900',
'737 Stage 3 (US ONLY)',
]
require 'fuzzy_match'
puts [ 'BTS'.ljust(24), 'FAA' ].join # print a nice table header
matcher = FuzzyMatch.new(faa_records) # set up a matcher object
bts_records.each do |bts|
faa = matcher.find(bts) # given BTS record as input, find a matching FAA record
puts [ bts.ljust(24), faa ].join # print a row showing the match
end
which produces
$ ruby example.rb
BTS FAA
Boeing 737-800 737-800, BBJ2
Boeing 737-5/600lr 737-600
Boeing 737-500 737-500
Boeing 737-400 737-400
Boeing 737-300lr 737-300
Boeing 737-300 737-300
Boeing 737-100/200 737-100
Boeing 737-200c 737-100 # <- oops!
Add rules to get to 95%
Fuzzy matching may catch 90% by itself, but you will have to define rules to get to 95%.
In this case, the error is “Boeing 737-200c” matching “737-100”. Let’s use an “identity” rule for “7X7-XXX”…
identities = [
%r{(7\d7)-?(\d\d\d)} # when comparing two records that both contain 7X7, make sure all the digits (but not the dash) are equal
]
matcher = FuzzyMatch.new(faa_records, :identities => identities)
which produces the correct match
Boeing 737-200c 737-200, Surveiller (CT-43, VC-96)
Rules and options
Check out the fuzzy_match
documentation for all the kinds of rules…
:blockings
:normalizers
:identities
:stop_words
and also options you can pass to find
…
:read
:must_match_blocking
:must_match_at_least_one_word
:first_blocking_decides
That’s it!
What blog is this?
Safety in Numbers is Brighter Planet's blog about climate science, Ruby, Rails, data, transparency, and, well, us.
Who's behind this?
We're Brighter Planet, the world's leading computational sustainability platform.