Delivering a Graph Based Search solution to slightly wrong data

When it comes to databases, having good clean data is always important. More so with Graphs which deal with concepts as nodes and their relationships between them. Inevitably, you will run into messy data and have to deal with it. In a lot of the projects our customers work on they are dealing with connecting multiple data sources to get to a “golden record” or single source of truth. A lofty goal, sometimes impossible to achieve, but we can use the relationships of the data to help us come close.

One option is to extract the features (or tags) of a composite object and see if any other object shares most of these features. If that is the case then they are possibly the same object and should be merged instead of creating a new record. A partial subgraph match is something akin to a recommendation engine in Neo4j and pretty trivial to write. Take a look back at a few oldblog posts for ideas.

We’re going to do something a little different today. Let’s take a look at how to deal with addresses. The model is very simple. A User has an Address. A User (or rather the user data we have) can have mistakes. Let’s say we get a second source of address data and we have addresses that don’t match. One thing we can do is to fuzzy match the address properties against known addresses of the user and see if any come close.

Let’s create some sample data. Here I have a user and their address:

CREATE (max:User {username:'maxdemarzi'})
CREATE (a1:Address {line1:'175 North Harbor Dr.', city:'Chicago', state:'IL', zip:'60605', zip_plus_4:'1234'})
CREATE (max)-[:HAS_ADDRESS]->(a1)

Now we want to know if an address slated for the same user should be added or merged in with an existing address:

{"address":"176 Norht Harbor Dr. Chicago IL 60601"}

Let’s start by finding the user in question:

Node user = db.findNode(Labels.User, "username", username);

Next we find all the addresses connected to that user. In this case I am traversing the HAS_ADDRESS relationship, getting the properties of the addresses and adding them to a set.

HashSet<String> addresses = new HashSet<>();
for (Relationship rel : user.getRelationships(Direction.OUTGOING, RelationshipTypes.HAS_ADDRESS)) {
Node address = rel.getEndNode();
addresses.add(
address.getProperty("line1", "") + " " +
address.getProperty("city", "") + " " +
address.getProperty("state", "") + " " +
address.getProperty("zip", "") + "-" +
address.getProperty("zip_plus_4", "")
);
}

I will do a little clean up here and simplify the addresses to make the matching easier.

String potential_address = addressSimplifier.simplify((String) input.get("address"));

We uppercase the strings, and use a regex map to normalize each address. This is what this looks like:

public class AddressSimplifier implements Simplifier {
@Override
public String simplify(String pString) {
String value = pString.toUpperCase();
Iterator<String> i = commonItems.keySet().iterator();
while(i.hasNext()) {
String regex = i.next();
String newWord = commonItems.get(regex);
value = value.replaceAll(regex, newWord);
}
return value;
}
private static HashMap<String, String> commonItems = new HashMap<String,String>() {{
put(" ST$", " STREET");
put(" ST ", " STREET ");
put("^P O| P O|^POST OFFICE| POST OFFICE", "PO");
...

We will borrow the simmetrics library(a Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity, etc) to make this easier. In this case the metric we will use is Damerau Levenshtein .

StringMetric metric = new DamerauLevenshtein();
for (String address : addresses) {
Float score = metric.compare(potential_address, addressSimplifier.simplify(address));
if (score > 0.50) {
HashMap<String, Object> result = new HashMap<>();
result.put("address", address);
result.put("score", score);
results.add(result);
}
}

Finally we can sort it :

Collections.sort(results, scoreComparator);
return Response.ok().entity(objectMapper.writeValueAsString(results)).build();

and return the results to the user:

[{
"score": 0.8478261,
"address": "175 North Harbor Dr. Chicago IL 60605-1234"
}]

Ok… now I know what you are thinking… dude, that’s pretty lame. I can do that with a relational database or search engine pretty easily. Yes you are right. But we only talked about one user’s addresses. What about the addresses of their friends, their family, their workplace, their organizations, etc? What about those? In Neo4j we can just do this:

for (Relationship rel : user.getRelationships(Direction.BOTH,
RelationshipTypes.KNOWS,
RelationshipTypes.BELONGS_TO,
RelationshipTypes.WORKS_FOR)) {
Node addressOwner = rel.getOtherNode(user);
getAddresses(addressOwner, addresses);
}

…and now we can go from a user to anything that is connected to them, find those addresses and compare them to our target address as well. The full source code of this example is on github as always.

This is the concept behind Graph Based Search. We aren’t searching a dumb index that doesn’t know anything beyond the search string. We are filtering our search to the connected values from one or more starting points not doing a global search. There may be 3000 John Smiths in your dataset, but only 1 happens to be a consultant working for a company that is providing services to one of your clients, and that is the John Smith you want to see at the top of the list. Graph Based Search is smarter search, are you using it?

Delivering a Graph Based Search solution to slightly wrong data

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本