Quantcast
Channel: CodeSection,代码区,数据库(综合) - CodeSec
Viewing all articles
Browse latest Browse all 6262

Drilling into analytics with NoSQL data stores

$
0
0

Drilling into analytics with NoSQL data stores

One question many app developers find themselves asking is, “OK, I have plenty of data from my application’s users. Now how do I gain insight from my application’s data?” Likely, much of the developer’s dilemma here stems from the NoSQL data stores which support their web or mobile app, because NoSQL databases (like JSON document stores, which we’ll discuss here) are notoriously resistant to traditional analytics approaches.

The rise of NoSQL is intimately tied to the rise of “the app economy.” As the web (and our use of it) evolved, the nature of the data we create and store has changed and it has become much bigger. As relational databases that ran all the desktop client-server applications moved to the web, a new problem developed. Yesterday’s desktop applications came out of a software box from CompUSA (may they rest) and were installed to support an entire company’s user base. But web applications grow to many thousands, if not millions, of users. The traditional relational database simply isn’t efficient at supporting the data needs of web-scale apps. Hence the rise of NoSQL.

The challenge we now have of bringing analytics to NoSQL ultimately boils down to the debate around the merits of schema vs. schema-less data structures. This is an age-old debate which, in our new world of the “Three V’s of Data” (Volume, Variety and Velocity), is here again with a vengeance.

How did we get here?

Let’s take a quick historical tour of the schema vs. schema-less debate and how we got here. Once upon a time, data was stored on a number of files that would repeat that data. For example, let’s imagine a hardware store a few decades ago. This hardware store is very progressive and has implemented a bookkeeping system on a PC.

Back then, if you bought a Stanley Drill, it was made to be the last drill you ever bought (a favorite talking point of my grandfather Earl’s). However, you would still wear out drill bits, which meant you kept going back to the hardware store to replace them. And with our hardware store’s PC bookkeeping system, each time you came back to buy a new drill bit, your customer information would be stored with the sales data in a table like this:


Drilling into analytics with NoSQL data stores

But there were two major pains associated with this method.

Pain 1:

This data took up a lot of space in a world where storage was very expensive. I think back to my first Apple II+ and then I look at my iPhone it’s incredible to think that the storage capacity we hold in our hands today used to fill a room of mainframes. So, the repetition of data in these files back in those days was very costly.

Pain 2:

Updating data. If the hardware store owner wanted to get a list of their customers to send out a holiday circular, we see that the data entered above is incorrect. Unless George Carpenter has gone the path of Prince in 1993 and officially changed his name to “Georg3,” there has been a typo thereby creating a second result on a query of this sales table that could create a second holiday circular that will be mailed to the same address, thereby causing unnecessary extra cost to our neighborhood small business and annoying our customer with unwanted junk mail. If we expand this example to something a bit larger, like a department store chain, it grows into a huge problem for bottom line concerns like marketing expenditures and customer retention.

Long story short, the original answer to these pain points was the invention of the relational database. With a relational database, customers could be assigned a CustomerID and separated into their own table, where their records could be updated and edited in one place. Then, using a Foreign Key, you could quickly determine the identity of the customer with a query that joined these tables. We would also do the same for the product data. Let’s look at what that new model looks like.

Breaking up our Customer Sales in a relational architecture implementing our Foreign Keys, we get three tables as follows.


Drilling into analytics with NoSQL data stores

Our new schema looks like this:


Drilling into analytics with NoSQL data stores

Now, this may all be basic information, causing you to tilt your head and say:

“Why are you telling me this? Everyone knows this.”

Well, I’ll tell you. This schema is actually helping us police our application. With a relational database system (RDBMS), you are not allowed to enter in data that “doesn’t exist.” Because you have placed that Foreign Key in the sales table, you can’t put a record as a sale that has no customerID associated with it. The system breaks. This is very helpful if the success of your app depends on not having junk data or not losing data in a sale. You now have much more control over the entities in your application. Adding an attribute to your Products table is not some nightmare of editing giant records such that inserting an extra field is cumbersome, time consuming and prone to error.

Going back to our first example, what if you wanted to add the color of the product to that sales table? In our new schema world, we just need to add it to the Product table and set the color. All legacy sales queries now have that data available you just have to run the report again with the new field included.

Schemas are great for reporting and analytics. You know where things are. The schema is your map and you can pick which attributes you are going to analyze. In our example above, it would be very easy to get the sum and count of sales by product, by zip code and by individual customers. And with our new Product_Color field, we can do some digging on what colors sell the most and when.

So, What About NoSQL Analytics? “Great. But aren’t we here to discuss getting some insight from a NoSQL database?”

Yes. So one benefit of a NoSQL data store, like a JSON database, is that because there isn’t a pre-defined schema, you can iterate quickly and data can flow into your document store as needed. This is in contrast to the RDBMS “schema police” mentioned above, who make sure those sales records all pass the rules of the defined schema, acting as sticks in the spokes they bring data-in-motion, flowing from our apps, our web events, and our devices, to a grinding halt. For many, the move to NoSQL came about because in the time it took the relational database to make sure the user’s data was “okay” for storage, the user had already left and the app failed.

Calling NoSQL “schema-less” is a bit of a misnomer in this use case. The data is being stored in a type of structure that is flexible to additions and subtractions of attributes. The benefit is the ability to perform “fast writes” incredibly fast insertions of data. With no schema police, you can keep up with your users. Reading from a JSON document store (e.g. IBM Cloudant) is great for apps and we know front-end javascript loves it some JSON. The problem comes when you want to query this data to start gaining insights from it. Creating “design docs” that will grab counts and do some simple metrics aggregations will keep you going, but when you want to hook up an analytics application, in most cases, it’s going to be looking for a schema.

So what about when you want to start aggregating daily statistics into summary tables for charts and graphs? You are definitely going to be looking to move your JSON data into a structure built for speed. (This isn’t just about JSON either. Many relational databases also start to fall down when we enter the world of analytics).

Remembering our simple schema above, imagine a schema with a sales table that i

Viewing all articles
Browse latest Browse all 6262

Trending Articles