Table of contents
- Introduction to NoSQL
- What is NoSQL
- Differences from SQL databases
- NoSQL data models
- Business Drivers for NoSQL
- Scalability
- Flexible schema
- High availability
- Low latency
- NoSQL Data Architectural Patterns
- Key-value store
- Document store
- Column store
- Graph database
- Using NoSQL to Manage Big Data
Introduction to NoSQL
What is NoSQL
NoSQL stands for "Not only SQL" or "Non-relational".
NoSQL databases provide an alternative to traditional relational databases.
They are designed to overcome some limitations of relational databases:
Scaling
Agility
Performance
Flexibility
Examples:
Key-Value stores: Redis, Voldemort
Document stores: MongoDB
Column stores: Cassandra, HBase
Graph databases: Neo4j, Titan
Differences from SQL databases
Schema
NoSQL databases have dynamic schemas
SQL databases have rigid schemas
Scaling
NoSQL scales horizontally
SQL scales vertically
Distribution
NoSQL is designed to be distributed
SQL is designed for single-server
Query language
NoSQL uses limited query languages
SQL uses powerful SQL
ACID compliance
NoSQL sacrifices ACID for performance
SQL provides ACID transactions
Data model
NoSQL uses non-relational models
SQL uses a relational model
Joins
- NoSQL does not support joins like SQL
NoSQL data models
Key-value store
- Data stored as keys associated with values
Document store
- Data stored as documents (JSON/BSON) with dynamic schema
Column store
- Data stored in columns rather than rows
Graph database
- Data stored as nodes and edges
Wide column store
- A variant of column store with column families
The choice of data model depends on requirements.
Each model has pros and cons in performance, scalability, flexibility, etc.
Business Drivers for NoSQL
Scalability
NoSQL databases are designed to scale horizontally by adding more servers.
They can scale to massive amounts of data and handle high volumes of read/write operations.
This makes them suitable for applications with very large and fast-growing datasets.
Flexible schema
NoSQL databases have dynamic or flexible schemas as opposed to the rigid schemas of SQL databases.
Schema changes do not require downtime and can be done on the fly.
This agility makes NoSQL databases suitable for rapidly evolving datasets and models.
High availability
Most NoSQL databases are designed with high availability and fault tolerance in mind.
They employ replication and data distribution techniques to ensure data is always accessible.
This makes them suitable for applications that require constant uptime and access to data.
Low latency
Since NoSQL databases do not require complex joins and transactions, they can provide lower latency.
This is especially true for read operations that can be served from memory.
Low and predictable latency makes NoSQL suitable for real-time applications that have strict response time requirements.
NoSQL Data Architectural Patterns
Key-value store
Data is stored as a collection of key-value pairs.
Simple and fast data model.
Suitable for:
Caching data
Storing session data
Storing user profiles
Example databases: Redis, Voldemort
Document store
Data is stored as documents (JSON/BSON) with dynamic schema.
Documents have a nested structure.
Suitable for:
Storing and querying semi-structured data
Storing log data
Storing product catalogs
Example databases: MongoDB, Couchbase
Column store
Data is stored in columns rather than rows.
Columns are grouped into column families.
Suitable for:
Handling large datasets with sparse data
Analytical workloads with heavy read loads
Example databases: Cassandra, HBase
Graph database
Data is stored as nodes and edges (relationships).
Suited for data with complex relationships.
Suitable for:
Social networking queries
Recommendation engines
Knowledge graphs
Example databases: Neo4j, Titan
The choice of data model depends on:
The type of data and queries
Performance requirements
Scalability requirements
Data relationships
Using NoSQL to Manage Big Data
Horizontal scaling
NoSQL databases are designed to scale horizontally by adding more servers.
As data and traffic grow, more servers can be added to the cluster.
This allows NoSQL databases to easily scale to massive amounts of big data.
Schema-less design
Since NoSQL databases have dynamic schemas, they can accommodate rapidly growing and changing datasets.
New attributes can be added to documents on the fly without affecting existing data.
This makes NoSQL a good fit for big data projects with evolving data models.
Real-time access
Many NoSQL databases are optimized for low latency and high throughput.
They can provide real-time access to big data by caching frequently accessed data in memory.
This makes NoSQL suitable for applications requiring real-time insights from big data.
Flexible queries
While SQL queries are powerful, they become inefficient at the big data scale.
NoSQL databases offer more flexible query mechanisms that can scale to massive data volumes.
This includes map-reduce functions, secondary indexes, and filtering on specific attributes.
Introduction to MongoDB
What is MongoDB?
MongoDB is a popular open-source document database (NoSQL database)
It stores data in flexible, JSON-like documents.
It is horizontally scalable and high-performance.
MongoDB is written in C++.
Data model - documents, collections, schemas
Data is stored in documents rather than tables.
A document is a JSON-like data structure that consists of field-value pairs.
Documents have a dynamic schema - different documents in a collection do not have to have the same fields.
Documents with similar characteristics are grouped into collections.
Collections live within databases.
MongoDB has a dynamic schema - you do not define the schema in advance, it is defined by the data itself.
CRUD operations in MongoDB
Create: Use the
insert()
orinsertOne()
/insertMany()
methods to insert documents into collections.Read: The
find()
andfindOne()
methods are used to query documents. You can use queries, projections and sorting.Update: The
update()
andupdateOne()
/updateMany()
methods are used to update existing documents.Delete: The
remove()
anddeleteOne()
/deleteMany()
methods are used to delete documents from a collection.
MongoDB Architecture
Sharding
Sharding allows MongoDB to split data across multiple servers.
In MongoDB, sharding is done on the _id field by default.
A shard key is used to determine how data is distributed across shards.
MongoDB uses a routing algorithm to determine which shard a document belongs to based on its shard key value.
Sharding allows MongoDB to scale horizontally almost linearly by adding more shards.
Replication
Replication in MongoDB involves copying and mirroring data on multiple servers.
It provides data redundancy and high availability.
MongoDB uses a primary-secondary replication model.
There is one primary node that handles writes and secondaries that handle reads.
When the primary node fails, a secondary is automatically elected as the new primary.
Indexing
Like any database, indexing improves the performance of queries in MongoDB.
MongoDB supports several index types:
Single field index
Compound index
Multikey index
Hashed index
Text index
Geospatial 2d and 2dsphere index
Indexes are created on one or more fields in a collection.
MongoDB automatically creates indexes on the _id field and any indexed fields.
Storing Data in MongoDB
Insert, update, and delete documents
Documents are inserted into collections using the
insertOne()
andinsertMany()
methods.db.collection.insertOne({ name: "John", age: 30 }) db.collection.insertMany([{ name: "Mary", age: 25 }, { name: "Steve", age: 35 }])
Documents can be updated using the
updateOne()
andupdateMany()
methods.db.collection.updateOne({ name: "John" }, { $set: { age: 31 } })
Documents can be deleted using the
deleteOne()
anddeleteMany()
methods.db.collection.deleteOne({ name: "John" }) db.collection.deleteMany({ age: { $lt: 25 } })
Embedded documents
MongoDB supports embedding related data as sub-documents within a document.
This is useful when the related data has a one-to-one or one-to-few relationship.
For example, a blog post may have embedded comments:
{ "_id": 1, "title": "My First Blog", "body": "Lorem ipsum...", "comments": [ { "user": "john", "message": "Great blog!" }, { "user": "jane", "message": "Nice write up!" } ] }
Referencing other documents
MongoDB also supports referencing related data by storing the ID of the related document.
This is useful when the related data has a one-to-many relationship.
For example, a user document may reference many post documents:
{ "name": "John", "posts": [ObjectId("5b11ca86aefd4f0474fcc4bb"), ObjectId("5b11d1cdaefd4f0474fcc4bd")] } { "_id": ObjectId("5b11ca86aefd4f0474fcc4bb"), "title": "My First Blog" } { "_id": ObjectId("5b11d1cdaefd4f0474fcc4bd"), "title": "My Second Blog" }
Querying MongoDB Data
find() and findOne()
The find()
method is used to query documents in a collection. It returns a cursor to the matched documents.
The findOne()
method returns only one document and is useful when you want to retrieve a single document that matches the query criteria.
Basic queries in MongoDB use the same selectors as JSON:
Equality:
{"name": "John"}
Comparison:
{"age": {$gt: 30}}
Logical:
{"$or": [{"age": 18}, {"age": 30}]}
Regular expression:
{"name": /^J/}
For example:
db.users.find({"age": {$gt: 30}})
db.users.findOne({"name": "John"})
Projection
Projection allows you to specify which fields to include or exclude in the result documents.
You can project fields using the 1
to include and 0
to exclude:
db.users.find({}, {"name": 1, "age": 1, "_id": 0})
Sorting, limiting and skipping results
You can sort the results of a query using the sort()
method:
db.users.find().sort({"age": 1}) // Sort by age in ascending order
You can limit the number of results using the limit()
method:
db.users.find().limit(5) // Limit to 5 results
You can skip the first n
results using the skip()
method:
db.users.find().skip(5) // Skip first 5 results
Aggregation framework
The aggregation framework allows you to perform aggregations like:
$group
to group by some criteria and apply aggregate functions$match
to filter the data$project
to transform the data$sort
to sort the dataAnd many more stages.
For example, to calculate the average age by gender:
db.users.aggregate([
{
$group: {
_id: "$gender",
avgAge: { $avg: "$age" }
}
}
])