DSE Graph is a scalable, real-time graph database which was released at the end of Juneas a new addition to the DSE platform. After recovering from the turbulence of a major release, the time has come to peel back the curtain and look into the engine room: What are the major features and innovations that make DSE Graph an enterprise-grade graph database?

A graph database is a database system purpose-built for managing highly connected data. Unlike other database systems, including RDBMS and NoSQL, graph databases make it easy to model and query for relationships.
DSE Graph uses the property graph data model and Gremlin query language of the Apache TinkerPop project the open-source, vendor-neutral graph database standard governed by the Apache Software Foundation.

The property graph data model can express complex data models as-is without a logical mapping a characteristic that’s often described as “whiteboard friendly”. The Gremlin query language can succinctly express query paths and subgraph patterns without the need for cumbersome JOINs or custom application code, making it easy to retrieve entities connected via complex relationships from a big graph of data. Apache TinkerPop is a central ingredient and in many ways the primary interface to DSE Graph.
Implementing the property graph data model and supporting a graph query language is sufficient to expose a database system as a graph database, but like putting lipstick on a pig this often results in slow performance and unexpected system behavior. What makes a good graph database is a balanced combination of efficient property graph data representation, fast graph-centric index structures, and smart query optimization. DSE Graph achieves this combination in a distributed, scale-out environment with no single point of failure and continuous availability using the following technologies.
Index-Free AdjacencyDSE Graph stores graphs in their adjacency list representation . All properties and edges that touch a particular vertex are stored in a consecutive, sorted list on the node in the cluster to which the vertex is assigned. This representation allows us to navigate through the graph from vertex to vertex without having to call into an index structure. By contrast, storing edges in a large table which would be the normal approach for RDBMS or NoSQL stores requires an expensive, global index to locate vertex data.
Adjacency list sort order facilitates efficient retrieval of subsets of the adjacency list. As graphs grow in size, queries often only require small subsets of the entire vertex data. In those cases, we exploit the sort order to limit the data retrieval and speed up query processing.
A key innovation of DSE Graph is an efficient mapping of the adjacency list representation onto the tabular storage format of Cassandra. Its implementation required changes to Cassandra’s storage engine in 3.0 and changes throughout the entire DSE stack to propagate a graph-optimized data representation.

This innovation allows DSE Graph to stand on top of the powerful distributed database foundation provided by Apache Cassandra without having to sacrifice storage efficiency or query performance.
Furthermore, DSE Graph can plug directly into the enterprise features of the DSE platform : OpsCenter management, data encryption, authentication, secure communication, multi-instance support, and auditing.
Vertex-Centric Indexes
For large graphs, it is not unusual for a single vertex adjacency list to grow to thousands of edges. Iterating over all those edges can be very time consuming for certain access patterns.
For instance, suppose we want to retrieve a customer’s ten most recent messages. If that customer has written thousands of messages, finding those ten can take a significant amount of time and requires retrieving a lot of data.
Vertex-centric indexes are access-specific index structures built and maintained per vertex to speed up such queries. For the example above, we would install a vertex-centric index for `wroteMessage` edges by timestamp.
Unlike index structures in conventional database systems which scale logarithmically with the size of the entire dataset, vertex-centric indexes are maintained per vertex and hence the cost of maintenance is logarithmic in size of the adjacency list per individual vertex. In other words, maintaining and querying vertex-centric indexes remains inexpensive even as the overall graph grows huge. For that reason, vertex-centric indexes are essential for maintaining fast traversal query performance on very large graphs.
Vertex Partitioning
A vertex and its adjacency list is assigned to and stored on a single machine in the database cluster as its primary replica. This assignment determines data locality and DSE Graph aims to place vertices such that frequently co-traversed vertices end up on the same machine which improves traversal performance.
DSE Graph’s partitioning techniques will be covered in future posts.
Edge PartitioningMost natural graphs have a scale-free degree distribution which means that some vertices are highly connected and have very large adjacency lists. Storing those vertices and their adjacency list on a single machine would create hotspots and may even be infeasible for huge graphs.
DSE Graph supports edge partitioning by which fragments of the adjacency list are partitioned across all machines in the cluster using a performance enhancing technique that supports co-processing with locally stored vertices without intra-cluster communication.
Query Optimizer
In addition to the index structures and partitioning techniques outlined above, DSE Graph also supports materialized view indexes in Cassandra, secondary indexes and the full indexing power of Solr via tight integration through