Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

The Reddit Meme Graph with Neo4j

$
0
0

Saturday night after not enough drinks, I came across these tweets by @LeFloatingGhost .


The Reddit Meme Graph with Neo4j
This definitely looks like a meme graph. We can do that too
The Reddit Meme Graph with Neo4j
Recorded Session

If you want to see me struggle get this going live, watch my session here


The Reddit Meme Graph with Neo4j

If you want to see an interactive version of this post , check it out at the Graph Gist Collection .


The Reddit Meme Graph with Neo4j
Find us some memes
The Reddit Meme Graph with Neo4j

There is this really nice CSV from Reddit of the top memes around:

https://github.com/umbrae/reddit-top-2.5-million/blob/master/data/memes.csv

We want to grab the raw URL: https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv

And grab an empty Neo4j Sandbox from http://neo4jsandbox.com .

What’s the data like? Check CSV WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url LOAD CSV WITH HEADERS FROM url AS row RETURN count(*); %R%P%P%P%P%P%P%P%P%P%P%U │"count(*)"│ %^%P%P%P%P%P%P%P%P%P%P%a │"1000" │ └──────────┘ WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url LOAD CSV WITH HEADERS FROM url AS row RETURN row limit 3;

%R%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%U │"row" │ %^%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%a │{"over_18":"False","name":"t3_1edsw9","permalink":"http://www.reddit.com/r/memes/comments/1edsw9/can│ │_we_please_start_a_crazy_amy_meme_for_amy_of/","url":"http://www.quickmeme.com/meme/3uer85/","domain│ │":"quickmeme.com","distinguished":null,"score":"1831","downs":"1010","link_flair_css_class":null,"su│ │breddit_id":"t5_2qjpg","thumbnail":"http://b.thumbs.redditmedia.com/qpz4enS1CCFIs8Ys.jpg","id":"1eds│ │w9","author_flair_css_class":null,"link_flair_text":null,"selftext":null,"ups":"2841","num_comments"│ │:"120","edited":"False","title":"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Compan│ │y?","created_utc":"1368627364.0","is_self":"False"} │ ├────────────────────────────────────────────────────────────────────────────────────────────────────┤ ...

Load them memes WITH 'https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/memes.csv' as url LOAD CSV WITH HEADERS FROM url AS row WITH row LIMIT 10000 CREATE (m:Meme) SET m=row // we take it all into Meme nodes

Added 100 labels, created 100 nodes, set 1700 properties, statement completed in 120 ms.

Get some memes MATCH (m:Meme) return m limit 25;
The Reddit Meme Graph with Neo4j
MATCH (m:Meme) return m.id, m.title limit 5;

%R%P%P%P%P%P%P%P%P%d%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%U │"m.id" │"m.title" │ %^%P%P%P%P%P%P%P%P%j%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%a │"1edsw9"│"Can We Please Start a Crazy Amy Meme For Amy of Amy's Baking Company?" │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1ihc34"│"Given the competitive nature of redditors, I assume you all feel the same way."│ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1gmt99"│"This man left this woman..." │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤ │"1ds9y4"│"How to cure bad breath..." │ ├────────┼────────────────────────────────────────────────────────────────────────────────┤

But we want the words !

Let’s grab the first meme and get going.

Split the text into words. MATCH (m:Meme) WITH m limit 1 RETURN split(m.title, " ") as words; ["Can","We","Please","Start","a","Crazy","Amy","Meme","For","Amy","of","Amy's","Baking","Company?"] CAN YOU HEAR ME? MATCH (m:Meme) WITH m limit 1 RETURN split(toUpper(m.title), " ") as words; ["CAN","WE","PLEASE","START","A","CRAZY","AMY","MEME","FOR","AMY","OF","AMY'S","BAKING","COMPANY?"] Remove Punctuation

Create an array of punctuation with split on empty string.

return split(",!?'.","") as chars; [",","!","?","'","."] And replace each of the characters with nothing ” with "a?b.c,d" as word return word, reduce(s=word, c IN split(",!?'.","") | replace(s,c,'')) as no_chars;

%R%P%P%P%P%P%P%P%P%P%d%P%P%P%P%P%P%P%P%P%P%U │"word" │"no_chars"│ %^%P%P%P%P%P%P%P%P%P%j%P%P%P%P%P%P%P%P%P%P%a │"a?b.c,d"│"abcd" │ └─────────┴──────────┘

We got us some nice words MATCH (m:Meme) WITH m limit 1 // lets split the text into words RETURN split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words; %R%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%U │"words" │ %^%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%a │["CAN","WE","PLEASE","START","A","CRAZY","AMY","MEME","FOR","AMY","OF","AMYS","BAKING","COMPANY"]│ └─────────────────────────────────────────────────────────────────────────────────────────────────┘ Enough words, where are the nodes? Let’s create some word nodes

(merge does get-or-create)

MATCH (m:Meme) WITH m limit 1 WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m MERGE (a:Word {text:words[0]}) MERGE (b:Word {text:words[1]}); Our first two words MATCH (n:Word) RETURN n;
The Reddit Meme Graph with Neo4j
Unwind the ra(n)ge

But we want all in the array, so let’s unwind a range.

MATCH (m:Meme) WITH m limit 1 WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx MERGE (a:Word {text:words[idx]}) MERGE (b:Word {text:words[idx+1]});

MATCH (n:Word) RETURN n;

No Limits MATCH (m:Meme) WITH m // no limits WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx MERGE (a:Word {text:words[idx]}) MERGE (b:Word {text:words[idx+1]});
The Reddit Meme Graph with Neo4j

MATCH (n:Word) RETURN count(*);

Chain up the memes

Connect the words via :NEXT and store the meme-ids on each rel in an ids property

And for the first word (idx = 0) let’s also connect the Meme node to the first Word

MATCH (m:Meme) WITH m WITH split(reduce(s=toUpper(m.title), c IN split(",!?'.","") | replace(s,c,'')), " ") as words, m UNWIND range(0,size(words)-2) as idx // turn the range into rows of idx MERGE (a:Word {text:words[idx]}) MERGE (b:Word {text:words[idx+1]}) // Connect the words via :NEXT and store the meme-ids on each rel in an `ids` property MERGE (a)-[rel:NEXT]->(b) SET rel.ids = coalesce(rel.ids,[]) + [m.id] // to later recreate the meme along the next chain // connect the first word to the meme itself WITH * WHERE idx = 0 MERGE (m)-[:FIRST]->(a);

Set 546 properties, created 614 relationships, statement completed in 65 ms.

Yay done! MATCH (m:Meme)-[:FIRST]->(w:Word)-[:NEXT]->(w2:Word) RETURN * LIMIT 33;
The Reddit Meme Graph with Neo4j
Which words appear most often MATCH (w:Word) WHERE length(w.text) > 4 RETURN w.text, size( (w)--() ) as relCount ORDER BY relCount DESC LIMIT 10;

%R%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%d%P%P%P%P%P%P%P%P%P%P%U │"w" │"relCount"│ %^%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%P%j%P%P%P%P%P%P%P%P%P%P%a │{"text":"AFTER"} │"56" │ ├──────────────────┼──────────┤ │{"text":"REDDIT"} │"34" │ ├──────────────────┼──────────┤ │{"text":"ABOUT"} │"33" │ ├──────────────────┼──────────┤ │{"text":"TODAY"} │"33" │ ├──────────────────┼──────────┤ │{"text":"SCUMBAG"}│"32" │ ├──────────────────┼──────────┤ │{"text":"EVERY"} │"31" │ ├──────────────────┼──────────┤ │{"text":"FIRST"} │"30" │ ├──────────────────┼──────────┤ │{"text":"ALWAYS"} │"28" │ ├──────────────────┼──────────┤ │{"text":"FRIEND"} │"27" │ ├──────────────────┼──────────┤ │{"text":"THOUGHT"}│"24" │ └──────────────────┴──────────┘

Now let’s find our memes again // first meme MATCH (m:Meme) WITH m limit 1 // from the :FIRST :Word follow the :NEXT chain MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->() // let's follow the chain of words starting // from the meme, where all relationships contain the meme-id WHERE ALL(r in rels WHERE m.id IN r.ids) RETURN *;
The Reddit Meme Graph with Neo4j
Show meme by id

We can also get meme from the CSV list,

e.g. id ‘1kc9p2′ ‘As stupid as memes are they can actually make valid points’

MATCH (m:Meme) WHERE m.id = '1kc9p2' MATCH path = (m)-[:FIRST]->(w)-[rels:NEXT*..15]->() WHERE ALL(r in rels WHERE m.id IN r.ids) RETURN *;
The Reddit Meme Graph with Neo4j

Done. Enjoy !

PS: If you want to connect your own stuff, grab a Neo4j Sandbox or use Neo4j on your machine.

If you have questions, ask me, Michael, on Twitter or on Slack


Viewing all articles
Browse latest Browse all 6262

Trending Articles