Graph Databases are great at storing connected data. In this blog post I am going to cover how Neo4j, most widely used graph database, can be used in Document Management System. The most complex part of document management system is the Access Control Management. When a user logs in, he/she should be able to view only folders/document that they have access for. Also if a user searches for documents, he should be shown only those documents for which he has access. I will first start with RDBMS and show how complex it is to model this requirement in RDBMS. And then will move to Graph Database modeling for same requirement and we will see how easy and effortless it is to implement this with Neo4j and Cypher Query Langague.
Let's take a scenario and create it's model in RDBMS. Say we have 2 users, "User1" and "User2". Both are respectively mapped to roles "Role1" and "Role2". Suppose we have following hierarchy of folders and documents
Project1 -> Requirements Doc1 -> Doc1.pdf
Project2 -> Requirements Doc2 -> Doc2.pdf
Let's say, all users with role "Role1" has access to view folder "Project1" and it's sub-folders/documents. And say all users with "Role2" will have access to view folders "Project1" and "Project2" and it's sub-folders/documents. We also assume that if a user with a role has access to a folder, then they will also have access to all it's sub-folders and documents unless a specific configuration restricts a user to access it. We can represent this requirement in following diagram.
According to this requirement, when "User1" performs a search for documents, he should be able to view only "Doc1.pdf". However when "User2" performs similar search, he should be able to view both "Doc1.pdf" and "Doc2.pdf". Now let's try to create RDBMS model for this requirement. In RDBMS world, we normally map entities to tables. So for this particular requirement, we have entities like User, Role and Document (file or folder). Accordingly we will end up with following tables.
User : To store user ids and their details
Role : To store role details
Document : To store file and folder details and has reference to their parent for hierarchy in which they can be navigated. Here we will only keep reference of immediate parent.
User_Role : To assign Roles to Users
Document_Role : To configure what all Roles can access a Document (file or folder).
Here is the sample data to represent our requirement.
|Table : User|
|Table : Role|
|Table : Document|
|Table : User_Role|
|Table : Document_Role|
Now with this data model in place, if we have to write a SQL to get all the documents on which a User has access (may be for search document purpose), we need to check if a user has access to any of the parent folders for a document and include the document in result if so. Writing such a SQL will not be a trivial job. A document can have "n" number of parent folders. In our case, for "Doc1.pdf", there are 2 parent folders i.e. "Requirement Docs1" and "Project 1". So if a role has access on any of these folders, "Doc1.pdf" should be included in result set. Clearly this can not be done with a single SQL statement as there are no fixed number of parent folders. Most likely we will have to write a stored procedure which will have following step.
-> Fetch all the "Folders" and "Documents" where user has direct access as per his role configuration.
-> Insert above result in some temporary table.
-> Iterate through all the records in temporary table where document has a type "FOLDER".
-> For each iterated "folder", write another SQL to fetch all documents which are either directly inside this folder, or they are inside any of it's sub- folder. Again a loop would be required.
-> Finally return distinct documents.
I haven't written a SQL for above requirement, as you can understand it's anyways going to be complex one. If we want to get rid of this and fetch all documents in single SQL, we might have to change our RDBMS model. One of the solution could be to have a separate table for mapping document's with their parents. Something like "document_parent", which will have reference to all possible parents for a document. And then we might be able to write a single SQL to get all documents by joining with this table. This might work if you have limited number of records. But if you have millions of records with very large number of folder/sub-folder hierarchy, this will easily be performance bottleneck. For example if we move "Projects1" and "Projects2" folder inside another folder say "Projects", we will have to update in "document_parent" table for all their sub-folders/documents as a new parent is being added in the hierarchy.
There might be other solutions as well to approach this problem. But clearly RDBMS doesn't provide an easy solution for this problem. Some will have performance issues with "inserts" while other will have performance issues with "selects" if you have large dataset. Now let's see how we can model this in Graph Database (Neo4j) and how easy it is to query this kind of data.
While modeling in Graph Database, we will represent Entities like User, Role, Document etc. as "Nodes" and the way they are connected as "RelationShips". So we will have following nodes and relationships.
I created this in neo4j and following diagram represents this model as viewed from Neo4j "Data Browser".
As can be seen from above Graph Model, it closely resembles the structure of our requirements as mentioned in 1st figure. And that is the beauty of Graph Database. Where RDBMS has a disconnect between how data is stored and how it is viewed in real world, Graph Database stores the data in same way as it is viewed. One of the advantage of representing data in Graph model is that it can be easily queried as per requirements. Let's now write a Cypher Query (Neo4j Query Language) to fetch all the documents that a user has access to.
That's very concise and yet expressive. Isn't it!!!. Let's analyse this Cypher query step by step
--> START - Since we want to fetch all the documents that a user "User1" has access to, we first locate "User1" node from lucene index. By default index name is "node_auto_index" unless you have created it in different index.
--> MATCH - Once we have got the starting node, all we need to do is write a "PATTERN" to match desired paths. So we first traverse from user node to role node by u1-[:HAS_ROLE]->(r1). And then all the folders where role "r1" has access to by (r1)-[:HAS_ACCESS*]->(f) and finally to all the documents for which folder "f" is parent (f)-[:PARENT_OF*]->(d). Note the usage of "*" in relationship. It allows to traverse to the document at all possible depths. If required you can control the depth with say *1..5 to traverse only till 5th level from parent folder in hierarchy.
--> WHERE - Since our documents can be of "FOLDER" or "FILE" type we restrict our query to return only "FILE".
--> RETURN - Lists the nodes that we want to return from this query.
Above query is not only easier to write and understand, it also performs far better than RDBMS SQL. In RDBMS, a join results in index look up which degrades over time as more and more data is added in DB. The same is not true for Graph Database as it takes same time to traverse a path irrespective of overall data size.