Azure DocumentDB powers the modern marketing intelligence platform

Affinio is an advanced marketing intelligence platform that enables brands to understand their users in a deeper and richer level. Affinio’s learning engine extracts marketing insights for its clients from mining billions of points of social media data. In order to store and process billions of social network connections without the overhead of database management, partitioning, and indexing, the Affinio engineering team chose Azure DocumentDB .

You can learn more about Affinio’s journey in this newly published case study . In this blog post, we provide an excerpt of the case study and discuss some effective patterns for storing and processing social network data.

Azure DocumentDB powers the modern marketing intelligence platform

Why are NoSQL databases a good fit for social data?

Affinio’s marketing platform extracts data from social network platforms like Twitter and other large social networksin order to feed into its learning engine and learn insights about users and their interests.The biggestdatasetconsistedofapproximatelyone billion social media profiles, growing at 10 million per month.Affinio also needs to store and process a number of other feedsincluding Twitter tweets (status messages),geo-location data, and machine learning results of which topics are likely to interest which users.

A NoSQL databaseis a natural choice for these data feeds for a number of reasons:

The APIs from popular social networksproduced datain JSON format. The datavolumeis in the TBs,and needs to be refreshed frequently (with both the volume and frequencyexpected toincrease rapidly over time). Data from multiple social media producers is processed downstream, and each social media channel has its own schema that evolves independently. And crucially, a small development team needs to be able to iterate rapidly on new features, which means that the database must be easy to setup, manage, and scale. Why does Affiniouse DocumentDB over AWS DynamoDB and Elasticsearch

The Affinio engineering team initially built their storage solution on top of Elasticsearch on AWS EC2 virtual machines. While Elasticsearch addressed their need for scalable JSON storage, they realized that setting up and managing their own Elasticsearch servers took away precious time from their development team. They then evaluated Amazon’s DynamoDB service which was fully-managed, but it did not have the query capabilities that Affinio needed.

Affinio then tried Microsoft Azure DocumentDB , Microsoft’s planet-scale NoSQL database service. DocumentDBis a fully-managed NoSQL database with automatic indexing of JSON documents , elastic scaling of throughput and storage, and rich query capabilities which meets all their requirements for functionality and performance. As a result, Affinio decided to migrate its entire stack off AWS and onto Microsoft Azure.

“Before moving to DocumentDB, my developers would need to come to me to confirm that our Elasticsearch deployment would support their data or if I would need to scale things to handle it. DocumentDB removed me as a bottleneck, which has been great for me and them.”

-Stephen Hankinson, CTO, Affinio

Modeling Twitter Data in DocumentDB An Example

As an example, we take a look at how Affinio stored data from Twitter status messages in DocumentDB. For example, here’s a sample JSON status message (truncated for visibility).

{
"created_at":"Fri Sep 02 06:43:15 +0000 2016",
"id":771599352141721600,
"id_str":"771599352141721600",
"text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!",
"user":{
"id":2557284469,
"id_str":"2557284469",
"name":"Azure DocumentDB",
"screen_name":"DocumentDB",
"location":"",
"description":"A blazing fast, planet scale NoSQL service delivered by Microsoft.",
"url":"http://t.co/30Tvk3gdN0"
}
}

Storing this data in DocumentDB is straightforward. As a schema-less NoSQL database, DocumentDB consumes JSON data directly from Twitter APIs without requiring schema or index definitions. As a developer, the primary considerations for storing this data in DocumentDB are the choice of partition key, and addressing any unique query patterns (in this case, searching with text messages). We'll look at how Affinio addresses these two.

Picking a good partition key:DocumentDB partitioned collections require that you specify a property within your JSON documents as the partition key. Using this partition key value, DocumentDB automatically distributes data and requests across multiple physical servers. A good partition key has a number of distinct values and allows DocumentDB to distribute data and requests across a number of partitions. Let’s take a look at a few candidates for a good partition key for social data like Twitter status messages.

"created_at" has a number of distinct values and is useful for accessing data for a certain time range. However, since new status messages are inserted based on the created time, this could potentially result in hot spots for certain time value like the current time "id" this property corresponds to the ID for a Twitter status message. It is a good candidate for a partition key, because there are a large number of unique users, and they can be distributed somewhat evenly across any number of partitions/servers "user.id" this property corresponds the ID for a Twitter user. This was ultimately the best choice for a partition key because not only does it allow writes to be distributed, it also allows reads for a certain user’s status messages to be efficiently served via queries from a single partition

With "user.id" as the partition key, Affinio created a single DocumentDB partitioned collection provisioned with 200,000 request units per second of throughput (both for ingestion and for querying via their learning engine).

Searching within the text message:Affinio needs to be able to search for words within status messages, and didn’t need to perform advanced text analysis like ranking. Affinio runs a Lucene tokenizer on the relevant fields when it needs to search for terms, and it stores the terms as an array inside a JSON document in DocumentDB. For example, "text" can be tokenized as a "text_terms" array containing the tokens/words in the status message. Here’s an example of what this would look like:

{
"text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!",
"text_terms":[
"rt",
"documentdb",
"dotnet",
"sdk",
"v1.9.4",
"just",
"released"
]
} Since DocumentDB

Azure DocumentDB powers the modern marketing intelligence platform

Trending Articles

SM3268AB 8CE三星量产无法格式化

[下载工具]Think4V utubedown(Youtube高清视频下载工具) v2.1.6 官方版2.1.3

出售: SINE Othello 電源線

博讯｜张磊帮助下，李源潮的儿子被耶鲁录取

FullEventLogView 1.73 免安裝中文版 - 事件檢視器取代工具

同門四角戀？李沛旭喇舌「小郭雪芙」曾智希，蔡淑臻拍完婚紗...怒毀婚

五代RAV4 降車身（機械車位因素）

[攻略] 《魔獸世界》6.2.2 白色魚人蛋再現！來去收編魚人寶寶特基！

jetBrains Product crack 2024 Java based

2013 KUGA 6G轉動方向盤會聽到摳摳摳的異音，有人知道原因嗎?

【豌豆字幕組】[藥屋少女的呢喃（藥師少女的獨語）/ Kusuriya no Hitorigoto][25][繁體][1080P][MP4]

好用的照片后期处理软件【DxO PhotoLab Elite 5.4.0.4765 (x64) 多语言便携版】..

出售: Thixar Silence Plus 啫喱板

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

三條崙討海人故事…重建烏倉寮憶43年前船難

致喬立建設道歉聲明

[一般] 神州全地圖掉寶資料

方易通7862 8/128G 無360 刷機

動感校園小記者・瑪利諾修院學校｜採訪王瑋駿陳晞文帶領試玩風帆

有藍電流行車紀錄器分享文嗎