Mongo Metrics is a new series in collaboration with Compose's Resident Data ScientistLisa Smith that shows you how to extract insights and toy with data stored in Compose MongoDB. This series is a MongoDB flavor of our popular Metrics Maven series.
While most of us don't immediately think "Data Science" when we think of MongoDB, it turns out that the MongoDB, through its aggregations pipeline, is a fantastic data store for assembling and analyzing complex data. In this new series, we'll build on our previous Metrics Maven articles onmean,median andmode, as well as our previous article on the MongoDB aggregations pipeline , to achieve Mean, Median, and Mode calculations using MongoDB. In this first article in the series, we'll cover the mean .
Getting to the MeanWe'll borrow the product catalog from a fictional pet supply company used ina previous article and put it in a collection called products .
product | category | productLine | price | numberInStock ----------------------------------------------------------------------- leash | dog wear | Bowser | 15.99 | 48 collar | dog wear | Bowser | 10.99 | 76 name tag | dog wear | Bowser | 5.99 | 204 jacket | dog wear | Bowser | 24.99 | 12 ball | dog toys | Bowser | 6.99 | 27 plushy | dog toys | Bowser | 8.99 | 30 rubber bone | dog toys | Bowser | 4.99 | 52 rubber bone | dog toys | Tippy | 4.99 | 38 plushy | dog toys | Tippy | 6.99 | 16 ball | dog toys | Tippy | 2.99 | 47 leash | dog wear | Tippy | 12.99 | 34 collar | dog wear | Tippy | 6.99 | 88 name tag | dog wear | Tippy | 5.99 | 165 jacket | dog wear | Tippy | 20.99 | 50We'll also create a transaction collection which contains the following data:
order_id | date | item_count | order_value ---------------------------------------------------------------- 50000 | ISODATE("2016-09-02") | 3 | 35.97 50001 | ISODATE("2016-09-02") | 2 | 7.98 50002 | ISODATE("2016-09-02") | 1 | 5.99 50003 | ISODATE("2016-09-02") | 1 | 4.99 50004 | ISODATE("2016-09-02") | 7 | 78.93 50005 | ISODATE("2016-09-02") | 0 | null 50006 | ISODATE("2016-09-02") | 1 | 5.99 50007 | ISODATE("2016-09-02") | 2 | 19.98 50008 | ISODATE("2016-09-02") | 1 | 5.99 50009 | ISODATE("2016-09-02") | 2 | 12.98 50010 | ISODATE("2016-09-02") | 1 | 20.99We'll have to modify the data format slightly for MongoDB - one document in the product collection looks like the following:
{ _id: ObjectID("589cd56b6ca2fef0f7737fbc"), product: "leash", category: "dog wear", productLine: "Bowser", price: 15.99, numberInStock: 48 }and one document in the transactions collection looks like the following:
{ _id: ObjectID("589cd56b6ca2eef0f7737b0a"), orderId: 50001, date: ISODATE("2016-09-02"), itemCount: 3, orderValue: 35.97 } Lean MeanCalculating the Mean, also known outside the mathematical world as the "average", is our first stop on the Mongo Metrics tour. The mean is computed by summing a field in each document returned from a query, and dividing that by the number of documents returned. For example, to calculate the mean of the prices across all of the products in our database, we would add the prices together and divide by the number of documents.
To better understand this section, you'll want to have a foundational understanding of the $match and $group operators in the MongoDB aggregation pipeline. If you need some background, you can check out our previous article on MongoDB aggregations by example .
MongoDB provides an $avg operator that delivers exactly what we want. The first thing we need to do is make sure that we're only running our average calculation on transactions that have a non-null value in the orderValue field:
{ $match: { orderValue: { $exists: true } } }Next, we'll use the $group operation to generate the sum across the orderValue fields.
{ $group: { _id: null, averageTransactionAmount: { $avg: "$orderValue" } } }Combining the two gives us the full aggregation query:
db.transactions.aggregate([ { $match: { orderValue: { $exists: true } } }, { $group: { _id: null, averageTransactionAmount: { $avg: "$orderValue" } } } ]);Which will give us the following output:
{ averageTransactionAmount: 18.16272727272 }Since this is a dollar amount, it would be ideal to have our results rounded to the hundreds place.
Rounding in MongoDBMongoDB doesn't have an operation for rounding so we'll need to do a little bit of fancy footwork with our averageTransactionAmount . The steps we'll take to get down to the hundredths place is as follows:
Multiply the amount by 100 (1816.2727272) Truncate the amount to it's integer value (1816) Divide the amount by 100 (18.16)We aren't doing true rounding here, but that's ok for our purposes.
Since our math operations accept any expression that evaluates to a number, we can add our calculation directly to the group stage of our pipeline:
{ $group: { _id: null, averageTransactionAmount: { $divide: [ { $trunc: { $mult: [ {$avg: "$orderValue"}, 100 ] } }, 100 ] } } }Which produces the following results:
{ averageTransactionAmount: 18.16 }While this may seem like an involved process, it's pretty straight forward if you follow the steps listed above. Our final aggregation query including rounding looks like this:
db.transactions.aggregate([ { $match: { orderValue: { $exists: true } } }, { $group: { _id: null, averageTransactionAmount: { $divide: [ { $trunc: { $mult: [ {$avg: "$orderValue"}, 100 ] } }, 100 ] } } } ]); Next StepsFinding the mean is a great first starting point for metrics, but it's not perfect. One of the major flaws with mean is that adding one value which lies far outside the majority of the data can heavily skew the results.
Let's take a look at an extreme example:
100, 105, 110, 112, 120, 500Looking at the data above, it's obvious that the biggest value (500) lies far outside the rest of the values. This value, called an outlier , affects our mean value since mean takes into account every single data point. With our outlier, the mean value is the following:
(100 + 105 + 110 + 112 + 120 + 500) / 6 = 174.5Without the outlier, the mean more closely represents what we're looking for with the average - an idea of where the majority of the values lie:
(100 + 105 + 110 + 112 + 120) / 5 = 109.4The idea that a single value that lies substantially above or below the majority of values can skew the results makes mean unreliable as a singular metric. To get a better picture of the data, we need another answer.
In our next article in this series, we'll explore using the median to reduce the impact of our outliers.
If you have any feedback about this or any other Compose article, drop the Compose Articles team a line atarticles@compose.com. We're happy to hear from you.
attribution