Affinio is an advanced marketing intelligence platform that enables brands to understand their users in a deeper and richer level. Affinio’s learning engine extracts marketing insights for its clients from mining billions of points of social media data. In order to store and process billions of social network connections without the overhead of database management, partitioning, and indexing, the Affinio engineering team chose Azure DocumentDB .
You can learn more about Affinio’s journey in this newly published case study . In this blog post, we provide an excerpt of the case study and discuss some effective patterns for storing and processing social network data.

Why are NoSQL databases a good fit for social data?
Affinio’s marketing platform extracts data from social network platforms like Twitter and other large social networksin order to feed into its learning engine and learn insights about users and their interests.The biggestdatasetconsistedofapproximatelyone billion social media profiles, growing at 10 million per month.Affinio also needs to store and process a number of other feedsincluding Twitter tweets (status messages),geo-location data, and machine learning results of which topics are likely to interest which users.
A NoSQL databaseis a natural choice for these data feeds for a number of reasons:
The APIs from popular social networksproduced datain JSON format. The datavolumeis in the TBs,and needs to be refreshed frequently (with both the volume and frequencyexpected toincrease rapidly over time). Data from multiple social media producers is processed downstream, and each social media channel has its own schema that evolves independently. And crucially, a small development team needs to be able to iterate rapidly on new features, which means that the database must be easy to setup, manage, and scale. Why does Affiniouse DocumentDB over AWS DynamoDB and ElasticsearchThe Affinio engineering team initially built their storage solution on top of Elasticsearch on AWS EC2 virtual machines. While Elasticsearch addressed their need for scalable JSON storage, they realized that setting up and managing their own Elasticsearch servers took away precious time from their development team. They then evaluated Amazon’s DynamoDB service which was fully-managed, but it did not have the query capabilities that Affinio needed.
Affinio then tried Microsoft Azure DocumentDB , Microsoft’s planet-scale NoSQL database service. DocumentDBis a fully-managed NoSQL database with automatic indexing of JSON documents , elastic scaling of throughput and storage, and rich query capabilities which meets all their requirements for functionality and performance. As a result, Affinio decided to migrate its entire stack off AWS and onto Microsoft Azure.
“Before moving to DocumentDB, my developers would need to come to me to confirm that our Elasticsearch deployment would support their data or if I would need to scale things to handle it. DocumentDB removed me as a bottleneck, which has been great for me and them.”
-Stephen Hankinson, CTO, Affinio
Modeling Twitter Data in DocumentDB An ExampleAs an example, we take a look at how Affinio stored data from Twitter status messages in DocumentDB. For example, here’s a sample JSON status message (truncated for visibility).
{"created_at":"Fri Sep 02 06:43:15 +0000 2016",
"id":771599352141721600,
"id_str":"771599352141721600",
"text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!",
"user":{
"id":2557284469,
"id_str":"2557284469",
"name":"Azure DocumentDB",
"screen_name":"DocumentDB",
"location":"",
"description":"A blazing fast, planet scale NoSQL service delivered by Microsoft.",
"url":"http://t.co/30Tvk3gdN0"
}
}
Storing this data in DocumentDB is straightforward. As a schema-less NoSQL database, DocumentDB consumes JSON data directly from Twitter APIs without requiring schema or index definitions. As a developer, the primary considerations for storing this data in DocumentDB are the choice of partition key, and addressing any unique query patterns (in this case, searching with text messages). We'll look at how Affinio addresses these two.
Picking a good partition key:DocumentDB partitioned collections require that you specify a property within your JSON documents as the partition key. Using this partition key value, DocumentDB automatically distributes data and requests across multiple physical servers. A good partition key has a number of distinct values and allows DocumentDB to distribute data and requests across a number of partitions. Let’s take a look at a few candidates for a good partition key for social data like Twitter status messages.
"created_at" has a number of distinct values and is useful for accessing data for a certain time range. However, since new status messages are inserted based on the created time, this could potentially result in hot spots for certain time value like the current time "id" this property corresponds to the ID for a Twitter status message. It is a good candidate for a partition key, because there are a large number of unique users, and they can be distributed somewhat evenly across any number of partitions/servers "user.id" this property corresponds the ID for a Twitter user. This was ultimately the best choice for a partition key because not only does it allow writes to be distributed, it also allows reads for a certain user’s status messages to be efficiently served via queries from a single partitionWith "user.id" as the partition key, Affinio created a single DocumentDB partitioned collection provisioned with 200,000 request units per second of throughput (both for ingestion and for querying via their learning engine).
Searching within the text message:Affinio needs to be able to search for words within status messages, and didn’t need to perform advanced text analysis like ranking. Affinio runs a Lucene tokenizer on the relevant fields when it needs to search for terms, and it stores the terms as an array inside a JSON document in DocumentDB. For example, "text" can be tokenized as a "text_terms" array containing the tokens/words in the status message. Here’s an example of what this would look like:
{"text":"RT @DocumentDB: Fresh SDK! #DocumentDB #dotnet SDK v1.9.4 just released!",
"text_terms":[
"rt",
"documentdb",
"dotnet",
"sdk",
"v1.9.4",
"just",
"released"
]
} Since DocumentDB