Stories
Discover boundless stories from unique narrators (storytellers 🙃)
MongoDB: Comprehensive Guide
17501 • Apr 11, 2025
Table of Contents
- Introduction to MongoDB
- MongoDB Architecture and Core Concepts
- Key Differences from SQL Databases
- Entity Relationships in MongoDB
- MongoDB Under the Hood
- MongoDB Operations
- MongoDB with Python
- MongoDB Deployment
- MongoDB Security
- Performance Optimization
- Real-world Use Cases
- Best Practices
- Summary
Introduction to MongoDB
MongoDB is a document-oriented, NoSQL database designed for high performance, high availability, and automatic scaling. Developed by MongoDB Inc. (formerly 10gen), it was first released in 2009 and has since become one of the most popular NoSQL databases in the world.
The name "MongoDB" comes from "humongous," reflecting its design goal to handle huge amounts of data efficiently. Unlike traditional relational databases, MongoDB stores data in flexible, JSON-like documents, which allows for variable structure among documents within the same collection.
MongoDB was designed to address several shortcomings of traditional SQL databases:
- Flexibility: The ability to store and process unstructured or semi-structured data
- Scalability: Built from the ground up to scale horizontally across multiple servers
- Performance: Optimized for high write throughput and query performance
- Developer Productivity: Intuitive data model that aligns with modern programming languages
MongoDB Architecture and Core Concepts
Documents
The fundamental unit of data in MongoDB is a document. A document is a set of key-value pairs, similar to JSON objects, but stored in a format called BSON (Binary JSON). Documents allow embedding complex structures like arrays and nested documents.
Example of a MongoDB document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"email": "john.doe@example.com",
"age": 30,
"address": {
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
},
"hobbies": ["reading", "hiking", "photography"],
"created_at": ISODate("2021-05-20T15:30:00Z")
}
Key characteristics of documents:
- Maximum size of 16MB per document
- Field names must be strings
- Field values can be any BSON data type
- Field order is preserved during insertion
- Case-sensitive field names
- Each document requires a unique
_id
field that acts as a primary key
Collections
Collections are groups of related documents, conceptually similar to tables in relational databases. However, unlike tables, collections don't enforce a schema across documents. Documents within the same collection can have different fields and structures.
For example, a users
collection might contain documents representing user profiles, while an orders
collection would contain documents representing customer orders.
Collections are organized within databases and follow these naming conventions:
- Cannot be empty strings
- Cannot contain the null character
- Cannot begin with "system." (reserved prefix)
- Cannot contain the $ character (reserved for certain operations)
Databases
A MongoDB instance can host multiple databases, each containing its own collections. Databases are the highest level of data organization in MongoDB and provide isolation for collections.
Some special databases include:
admin
: Used for administrative operationslocal
: Stores data specific to a single serverconfig
: Used by sharded clusters to store configuration information
BSON Format
MongoDB stores data in BSON (Binary JSON) format, which extends the JSON model to provide additional data types and to be more efficient for storage and traversal. BSON supports the following data types:
- String: UTF-8 encoded strings
- Integer: 32-bit or 64-bit integers
- Double: 64-bit IEEE 754 floating point numbers
- Boolean: true or false
- Array: Ordered lists of values
- Object: Embedded documents
- ObjectId: 12-byte identifier typically used for
_id
fields - Date: Stored as 64-bit integers representing milliseconds since the Unix epoch
- Null: Represents a null value
- Regular Expression: For pattern matching
- Binary Data: For storing binary data
- Timestamp: MongoDB internal timestamp type
- Decimal128: IEEE 754 decimal-based floating-point number
Example of converting between JSON and BSON in Python:
import json
from bson import ObjectId, json_util
# JSON cannot directly encode ObjectId, Date, etc.
document = {
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"created_at": datetime.datetime.utcnow()
}
# Use json_util from pymongo to handle BSON types
json_str = json.dumps(document, default=json_util.default)
print(json_str)
# Convert back to Python dict with BSON types
parsed_document = json.loads(json_str, object_hook=json_util.object_hook)
print(parsed_document)
Key Differences from SQL Databases
Schema Design
SQL Databases:
- Rigid schema defined at table creation
- Relationships maintained through foreign keys
- Normalization encouraged to avoid data duplication
- Schema changes require migrations
MongoDB:
- Flexible schema-less design
- Documents can evolve over time
- Embedding related data directly in documents
- Denormalization often encouraged for performance
Example of normalized SQL tables vs. MongoDB document:
SQL Tables:
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20)
);
CREATE TABLE addresses (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
street VARCHAR(100),
city VARCHAR(50),
state VARCHAR(20),
zip VARCHAR(10)
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id),
order_date TIMESTAMP,
status VARCHAR(20)
);
MongoDB Document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"email": "john.doe@example.com",
"phone": "555-123-4567",
"addresses": [
{
"type": "home",
"street": "123 Main St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
},
{
"type": "work",
"street": "456 Market St",
"city": "Anytown",
"state": "CA",
"zip": "12345"
}
],
"orders": [
{
"order_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"order_date": ISODate("2021-05-20T15:30:00Z"),
"status": "shipped"
},
{
"order_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"order_date": ISODate("2021-05-25T10:15:00Z"),
"status": "processing"
}
]
}
Query Language
SQL Databases:
- Standardized SQL language
- Joins for relating data across tables
- Complex transactions with multi-table updates
MongoDB:
- JSON-like query syntax
- No traditional joins (but has $lookup aggregation)
- Query operators to navigate nested documents and arrays
Example query comparison:
SQL:
SELECT customers.name, orders.id, orders.order_date
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE customers.email = 'john.doe@example.com'
AND orders.status = 'shipped';
MongoDB:
db.customers.find(
{
"email": "john.doe@example.com",
"orders.status": "shipped"
},
{
"name": 1,
"orders.$": 1
}
)
Transactions and ACID Properties
SQL Databases:
- Strong ACID guarantees
- Long-established transaction support
- Well-suited for financial applications
MongoDB:
- ACID transactions at document level by default
- Multi-document transactions available since version 4.0
- Distributed transactions across shards since version 4.2
Example of a MongoDB transaction:
const session = db.getMongo().startSession();
session.startTransaction();
try {
const accounts = session.getDatabase("bank").accounts;
// Withdraw from account A
accounts.updateOne(
{ account_id: "A" },
{ $inc: { balance: -100 } }
);
// Deposit to account B
accounts.updateOne(
{ account_id: "B" },
{ $inc: { balance: 100 } }
);
session.commitTransaction();
} catch (error) {
session.abortTransaction();
throw error;
} finally {
session.endSession();
}
Scaling Approach
SQL Databases:
- Traditionally scale vertically (bigger machines)
- Replication for high availability
- Partitioning/sharding often complex to set up
MongoDB:
- Built for horizontal scaling (more machines)
- Native sharding capabilities
- Auto-balancing of data across shards
- Replica sets for high availability
Entity Relationships in MongoDB
Unlike relational databases that use tables and foreign keys to model relationships, MongoDB uses two main strategies to represent relationships between entities: embedding and referencing. Understanding when to use each approach is crucial for effective MongoDB schema design.
One-to-One Relationships
In a one-to-one relationship, one document in a collection is related to exactly one document in the same or another collection.
Embedded One-to-One Relationship
For one-to-one relationships, embedding is often the most efficient approach:
// User document with embedded profile (1:1)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"username": "johndoe",
"email": "john@example.com",
"profile": {
"first_name": "John",
"last_name": "Doe",
"date_of_birth": ISODate("1990-01-15"),
"address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001"
},
"phone": "+1-555-123-4567"
}
}
Python implementation with Pydantic:
from pydantic import BaseModel, Field
from typing import Optional
from datetime import date
class Address(BaseModel):
street: str
city: str
state: str
zip: str
class Profile(BaseModel):
first_name: str
last_name: str
date_of_birth: date
address: Address
phone: Optional[str] = None
class User(BaseModel):
username: str
email: str
profile: Profile
Referenced One-to-One Relationship
In some cases, referencing is better for one-to-one relationships:
// User document
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"username": "johndoe",
"email": "john@example.com"
}
// Profile document
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c3"),
"user_id": ObjectId("60a6e3e89f1c6a8d556884b2"), // Reference to user
"first_name": "John",
"last_name": "Doe",
"date_of_birth": ISODate("1990-01-15"),
"address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001"
},
"phone": "+1-555-123-4567"
}
When to use references for one-to-one:
- When the embedded document is large and rarely accessed
- When the embedded document changes frequently
- When the embedded document needs to be accessed separately
Python implementation with PyMongo:
# Create user and profile with reference
user_id = db.users.insert_one({
"username": "johndoe",
"email": "john@example.com"
}).inserted_id
profile = {
"user_id": user_id,
"first_name": "John",
"last_name": "Doe",
"date_of_birth": datetime(1990, 1, 15),
"address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001"
},
"phone": "+1-555-123-4567"
}
db.profiles.insert_one(profile)
# Retrieve user with profile
user = db.users.find_one({"username": "johndoe"})
user_profile = db.profiles.find_one({"user_id": user["_id"]})
One-to-Many Relationships
In a one-to-many relationship, one document in a collection is related to multiple documents in another collection.
Embedded One-to-Many Relationship (Array of Embedded Documents)
When the "many" side is relatively small and stable:
// Product document with embedded reviews (1:Many)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "Smartphone X",
"price": 999.99,
"category": "electronics",
"reviews": [
{
"user_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"username": "user123",
"rating": 5,
"text": "Great product!",
"date": ISODate("2021-05-20T15:30:00Z")
},
{
"user_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"username": "user456",
"rating": 4,
"text": "Good but expensive",
"date": ISODate("2021-05-25T10:15:00Z")
}
]
}
Python implementation with Pydantic:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId
class PyObjectId(ObjectId):
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v):
if not ObjectId.is_valid(v):
raise ValueError("Invalid ObjectId")
return ObjectId(v)
class Review(BaseModel):
user_id: PyObjectId
username: str
rating: int
text: str
date: datetime = Field(default_factory=datetime.now)
class Config:
arbitrary_types_allowed = True
json_encoders = {ObjectId: str}
class Product(BaseModel):
name: str
price: float
category: str
reviews: List[Review] = []
class Config:
arbitrary_types_allowed = True
Referenced One-to-Many Relationship (Child References)
When the "many" side is large or frequently changing:
// Blog post document (parent)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"title": "Introduction to MongoDB",
"content": "MongoDB is a document database...",
"author": "John Doe",
"date": ISODate("2021-05-20T15:30:00Z")
}
// Comment documents (children)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"post_id": ObjectId("60a6e3e89f1c6a8d556884b2"), // Reference to post
"user": "Alice",
"text": "Great article!",
"date": ISODate("2021-05-20T16:30:00Z")
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"post_id": ObjectId("60a6e3e89f1c6a8d556884b2"), // Reference to post
"user": "Bob",
"text": "Thanks for sharing.",
"date": ISODate("2021-05-21T10:15:00Z")
}
Python implementation with PyMongo:
# Create blog post
post_id = db.posts.insert_one({
"title": "Introduction to MongoDB",
"content": "MongoDB is a document database...",
"author": "John Doe",
"date": datetime.now()
}).inserted_id
# Add comments referencing the post
comments = [
{
"post_id": post_id,
"user": "Alice",
"text": "Great article!",
"date": datetime.now()
},
{
"post_id": post_id,
"user": "Bob",
"text": "Thanks for sharing.",
"date": datetime.now() + timedelta(hours=1)
}
]
db.comments.insert_many(comments)
# Retrieve post with comments
post = db.posts.find_one({"_id": post_id})
post_comments = list(db.comments.find({"post_id": post_id}).sort("date", 1))
Referenced One-to-Many Relationship (Parent Reference)
Another approach for one-to-many is to have children reference their parent:
// Department document (one)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "Engineering",
"location": "Building A"
}
// Employee documents (many) with parent reference
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "John Doe",
"position": "Software Engineer",
"department_id": ObjectId("60a6e3e89f1c6a8d556884b2") // Reference to department
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"name": "Jane Smith",
"position": "QA Engineer",
"department_id": ObjectId("60a6e3e89f1c6a8d556884b2") // Reference to department
}
Python implementation with PyMongo:
# Create department
dept_id = db.departments.insert_one({
"name": "Engineering",
"location": "Building A"
}).inserted_id
# Create employees with department reference
employees = [
{
"name": "John Doe",
"position": "Software Engineer",
"department_id": dept_id
},
{
"name": "Jane Smith",
"position": "QA Engineer",
"department_id": dept_id
}
]
db.employees.insert_many(employees)
# Find all employees in a department
dept_employees = list(db.employees.find({"department_id": dept_id}))
Many-to-Many Relationships
In a many-to-many relationship, documents in both collections can be related to multiple documents in the other collection.
Embedded Many-to-Many Relationship
For many-to-many relationships with limited size:
// Student document with embedded courses
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"courses": [
{
"course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "Introduction to MongoDB",
"instructor": "Prof. Smith",
"enrolled_date": ISODate("2021-01-15")
},
{
"course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"name": "Web Development",
"instructor": "Prof. Johnson",
"enrolled_date": ISODate("2021-02-10")
}
]
}
// Course document with embedded students
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "Introduction to MongoDB",
"instructor": "Prof. Smith",
"students": [
{
"student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"enrolled_date": ISODate("2021-01-15")
},
{
"student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
"name": "Jane Smith",
"enrolled_date": ISODate("2021-01-20")
}
]
}
Note: This approach duplicates data and can be difficult to maintain as both sides need to be updated when changes occur.
Referenced Many-to-Many Relationship
A better approach is often to use a separate collection to model the relationship:
// Student documents
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"email": "john@example.com"
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
"name": "Jane Smith",
"email": "jane@example.com"
}
// Course documents
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "Introduction to MongoDB",
"instructor": "Prof. Smith"
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"name": "Web Development",
"instructor": "Prof. Johnson"
}
// Enrollments collection (junction/join collection)
{
"_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
"student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"enrolled_date": ISODate("2021-01-15"),
"grade": "A"
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
"student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"enrolled_date": ISODate("2021-02-10"),
"grade": "B+"
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884d3"),
"student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
"course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"enrolled_date": ISODate("2021-01-20"),
"grade": "A-"
}
Python implementation with PyMongo:
# Create students
student1_id = db.students.insert_one({
"name": "John Doe",
"email": "john@example.com"
}).inserted_id
student2_id = db.students.insert_one({
"name": "Jane Smith",
"email": "jane@example.com"
}).inserted_id
# Create courses
course1_id = db.courses.insert_one({
"name": "Introduction to MongoDB",
"instructor": "Prof. Smith"
}).inserted_id
course2_id = db.courses.insert_one({
"name": "Web Development",
"instructor": "Prof. Johnson"
}).inserted_id
# Create enrollments
enrollments = [
{
"student_id": student1_id,
"course_id": course1_id,
"enrolled_date": datetime(2021, 1, 15),
"grade": "A"
},
{
"student_id": student1_id,
"course_id": course2_id,
"enrolled_date": datetime(2021, 2, 10),
"grade": "B+"
},
{
"student_id": student2_id,
"course_id": course1_id,
"enrolled_date": datetime(2021, 1, 20),
"grade": "A-"
}
]
db.enrollments.insert_many(enrollments)
# Find all courses for a student
def get_student_courses(student_id):
# Get all enrollments for the student
enrollments = list(db.enrollments.find({"student_id": student_id}))
# Get the course details for each enrollment
courses = []
for enrollment in enrollments:
course = db.courses.find_one({"_id": enrollment["course_id"]})
# Add enrollment details to the course
course["enrolled_date"] = enrollment["enrolled_date"]
course["grade"] = enrollment["grade"]
courses.append(course)
return courses
# Find all students in a course
def get_course_students(course_id):
# Get all enrollments for the course
enrollments = list(db.enrollments.find({"course_id": course_id}))
# Get the student details for each enrollment
students = []
for enrollment in enrollments:
student = db.students.find_one({"_id": enrollment["student_id"]})
# Add enrollment details to the student
student["enrolled_date"] = enrollment["enrolled_date"]
student["grade"] = enrollment["grade"]
students.append(student)
return students
Self-Referencing Relationships
Self-referencing relationships occur when documents in a collection reference other documents in the same collection.
Tree Structure (Hierarchical Data)
For representing hierarchical data like categories:
// Category documents with parent references
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "Electronics",
"parent_id": null // Root category
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "Computers",
"parent_id": ObjectId("60a6e3e89f1c6a8d556884b2") // Child of Electronics
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"name": "Laptops",
"parent_id": ObjectId("60a6e3e89f1c6a8d556884c1") // Child of Computers
}
Python implementation to get the full path:
def get_category_path(category_id):
path = []
current_id = category_id
while current_id is not None:
category = db.categories.find_one({"_id": current_id})
if category is None:
break
path.insert(0, category["name"]) # Add to beginning of path
current_id = category["parent_id"]
return " > ".join(path)
# Example: "Electronics > Computers > Laptops"
Graph Structure (Network)
For representing graph-like data such as social networks:
// User documents with friend references
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"name": "John Doe",
"friends": [
ObjectId("60a6e3e89f1c6a8d556884c1"),
ObjectId("60a6e3e89f1c6a8d556884c2")
]
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
"name": "Jane Smith",
"friends": [
ObjectId("60a6e3e89f1c6a8d556884b2"),
ObjectId("60a6e3e89f1c6a8d556884c2")
]
}
{
"_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
"name": "Bob Johnson",
"friends": [
ObjectId("60a6e3e89f1c6a8d556884b2"),
ObjectId("60a6e3e89f1c6a8d556884c1")
]
}
Python implementation to find mutual friends:
def get_mutual_friends(user1_id, user2_id):
user1 = db.users.find_one({"_id": user1_id})
user2 = db.users.find_one({"_id": user2_id})
if not user1 or not user2:
return []
# Find the intersection of friend lists
mutual_friend_ids = set(user1["friends"]) & set(user2["friends"])
# Get the details of mutual friends
mutual_friends = list(db.users.find({"_id": {"$in": list(mutual_friend_ids)}}))
return mutual_friends
Choosing Between Embedding and Referencing
When deciding whether to embed or reference related data, consider these factors:
Advantages of Embedding
- Performance: Embedded documents are retrieved in a single query
- Atomicity: All related data is updated in a single operation
- Consistency: Related data is always in sync
Advantages of Referencing
- Document Size: Prevents documents from exceeding the 16MB limit
- Duplication: Avoids data duplication
- Flexibility: Allows independent access and updates to related data
- Complex Relationships: Better for many-to-many relationships
Decision Criteria
Criteria | Embed | Reference |
---|---|---|
Relationship | One-to-one or one-to-few | One-to-many or many-to-many |
Data Size | Small embedded documents | Large related documents |
Access Pattern | Always accessed together | Often accessed separately |
Update Frequency | Rarely changes | Frequently changes |
Growth | Limited, predictable growth | Unbounded growth |
Query Requirements | Simple queries on embedded data | Complex queries across collections |
Hybrid Approaches
Sometimes a hybrid approach works best:
// Order document with both embedded and referenced data
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"order_number": "ORD-12345",
"date": ISODate("2021-05-20T15:30:00Z"),
"status": "shipped",
// Referenced customer (frequently accessed separately)
"customer_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
// Embedded summary of customer info (frequently accessed together)
"customer_summary": {
"name": "John Doe",
"email": "john@example.com",
"shipping_address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001"
}
},
// Embedded line items (always accessed with the order)
"items": [
{
"product_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
"name": "Smartphone X",
"price": 999.99,
"quantity": 1
},
{
"product_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
"name": "Wireless Earbuds",
"price": 199.99,
"quantity": 2
}
],
"total": 1399.97
}
This approach gives you the best of both worlds:
- The order document contains embedded items for atomic updates and single-query retrieval
- It references the full customer document for detailed information
- It includes a customer summary to avoid an extra query for common operations
MongoDB Under the Hood
Storage Engine
The storage engine is responsible for managing how data is stored on disk and in memory. MongoDB's default storage engine is WiredTiger (since version 3.2), which offers:
- Document-Level Concurrency: Multiple clients can modify different documents in a collection simultaneously
- Compression: Both data and indexes are compressed by default
- Journaling: Write operations are recorded in a journal for durability
- Checkpoints: Creates consistent snapshots of data files every 60 seconds by default
WiredTiger uses a B-tree data structure for storage, with pages of data cached in RAM and written to disk during checkpoints.
Other important aspects of the storage engine:
- Write Ahead Log (WAL): Ensures data durability by logging operations before they are applied
- Snapshot Isolation: Readers see a consistent snapshot of data at a point in time
- Checkpoint Process: Flushes in-memory changes to disk periodically
Indexing
MongoDB supports several types of indexes to optimize query performance:
Single Field Index: Index on one field
db.users.createIndex({ "email": 1 }) // 1 for ascending order
Compound Index: Index on multiple fields
db.products.createIndex({ "category": 1, "price": -1 }) // -1 for descending order
Multikey Index: Automatically created when indexing an array field
db.posts.createIndex({ "tags": 1 }) // Will index each element in the tags array
Text Index: For text search capabilities
db.articles.createIndex({ "content": "text" })
Geospatial Index: For location-based queries
db.places.createIndex({ "location": "2dsphere" })
Hashed Index: For hash-based sharding
db.users.createIndex({ "_id": "hashed" })
Indexes in MongoDB are implemented as B-trees and stored separately from the collection data.
Query Optimization
MongoDB's query optimizer selects the most efficient query plan based on:
- Query Shape: The structure of the query (which fields, operators, etc.)
- Available Indexes: Which indexes could potentially be used
- Collection Statistics: Size of the collection and distribution of values
- Query Execution History: Results of previous similar queries
The query plan cache stores successful query plans to avoid repeated planning for similar queries.
To analyze query performance, MongoDB provides the explain()
method:
db.users.find({ "status": "active", "age": { $gt: 21 } }).explain("executionStats")
This returns detailed information about:
- Which indexes were considered
- Which index was chosen
- Number of documents examined
- Execution time
- Stages of the query plan
Memory Management
MongoDB employs a tiered storage model:
- Working Set: Active portion of data and indexes that fits in RAM
- Disk Storage: Full dataset stored on disk
WiredTiger manages memory through:
- Cache: Configured as a percentage of system RAM (default is 50%)
- Eviction: Removing less frequently used data from cache when approaching memory limits
- Page Replacement: Algorithm to decide which pages to evict
Memory usage can be monitored with:
db.serverStatus().wiredTiger.cache
MongoDB Operations
CRUD Operations
Create
MongoDB provides several methods to insert documents:
Insert a single document:
db.users.insertOne({ name: "John Doe", email: "john.doe@example.com", age: 30 })
Insert multiple documents:
db.users.insertMany([ { name: "Jane Smith", email: "jane@example.com", age: 28 }, { name: "Bob Johnson", email: "bob@example.com", age: 35 } ])
Python equivalent with PyMongo:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
users = db['users']
# Insert one document
result = users.insert_one({
"name": "John Doe",
"email": "john.doe@example.com",
"age": 30
})
print(f"Inserted document with ID: {result.inserted_id}")
# Insert multiple documents
results = users.insert_many([
{"name": "Jane Smith", "email": "jane@example.com", "age": 28},
{"name": "Bob Johnson", "email": "bob@example.com", "age": 35}
])
print(f"Inserted {len(results.inserted_ids)} documents")
Read
MongoDB offers flexible query capabilities:
Find all documents in a collection:
db.users.find()
Find documents matching specific criteria:
db.users.find({ age: { $gt: 25 } }) // Users older than 25
Find one document:
db.users.findOne({ email: "john.doe@example.com" })
Projection (selecting specific fields):
db.users.find({ age: { $gt: 25 } }, { name: 1, email: 1, _id: 0 })
Limit results:
db.users.find().limit(10)
Skip results (for pagination):
db.users.find().skip(10).limit(10) // Second page of 10 results
Sort results:
db.users.find().sort({ age: -1 }) // Sort by age descending
Python equivalent with PyMongo:
# Find all users
all_users = list(users.find())
# Find users older than 25
older_users = list(users.find({"age": {"$gt": 25}}))
# Find one user by email
user = users.find_one({"email": "john.doe@example.com"})
# Projection
user_names = list(users.find({}, {"name": 1, "_id": 0}))
# Pagination
page_size = 10
page_num = 2
paginated_users = list(users.find().skip((page_num - 1) * page_size).limit(page_size))
# Sorting
sorted_users = list(users.find().sort("age", -1)) # -1 for descending
Update
MongoDB provides multiple ways to update documents:
Update a single document:
db.users.updateOne( { email: "john.doe@example.com" }, { $set: { age: 31, status: "active" } } )
Update multiple documents:
db.users.updateMany( { age: { $lt: 30 } }, { $set: { category: "young" } } )
Replace a document:
db.users.replaceOne( { email: "john.doe@example.com" }, { name: "John Doe", email: "john.doe@example.com", age: 31, address: { city: "New York", zip: "10001" } } )
Update operators:
$set
: Set field values$inc
: Increment field values$push
: Add elements to arrays$pull
: Remove elements from arrays$addToSet
: Add elements to arrays without duplicates$unset
: Remove fields
Python equivalent with PyMongo:
# Update one document
users.update_one(
{"email": "john.doe@example.com"},
{"$set": {"age": 31, "status": "active"}}
)
# Update multiple documents
result = users.update_many(
{"age": {"$lt": 30}},
{"$set": {"category": "young"}}
)
print(f"Modified {result.modified_count} documents")
# Replace a document
users.replace_one(
{"email": "john.doe@example.com"},
{
"name": "John Doe",
"email": "john.doe@example.com",
"age": 31,
"address": {"city": "New York", "zip": "10001"}
}
)
# Using update operators
users.update_one(
{"email": "john.doe@example.com"},
{
"$inc": {"login_count": 1},
"$push": {"login_history": datetime.now()},
"$set": {"last_login": datetime.now()}
}
)
Delete
MongoDB provides methods to remove documents:
Delete a single document:
db.users.deleteOne({ email: "john.doe@example.com" })
Delete multiple documents:
db.users.deleteMany({ status: "inactive" })
Delete all documents in a collection:
db.users.deleteMany({})
Python equivalent with PyMongo:
# Delete one document
result = users.delete_one({"email": "john.doe@example.com"})
print(f"Deleted {result.deleted_count} document")
# Delete multiple documents
result = users.delete_many({"status": "inactive"})
print(f"Deleted {result.deleted_count} documents")
# Clear collection
users.delete_many({})
Aggregation Framework
MongoDB's Aggregation Framework provides a powerful way to process and transform data within the database. It uses a pipeline approach where documents pass through stages that modify them.
Common aggregation stages:
$match: Filter documents (similar to find)
{ $match: { status: "active" } }
$group: Group documents by a key
{ $group: { _id: "$department", totalEmployees: { $sum: 1 } } }
$sort: Sort documents
{ $sort: { age: -1 } }
$project: Reshape documents (select/compute fields)
{ $project: { name: 1, firstLetter: { $substr: ["$name", 0, 1] } } }
$unwind: Deconstruct array fields
{ $unwind: "$tags" }
$lookup: Perform a join with another collection
{ $lookup: { from: "orders", localField: "_id", foreignField: "customer_id", as: "customer_orders" } }
Example of a complex aggregation pipeline:
db.sales.aggregate([
// Stage 1: Filter for completed sales
{ $match: { status: "completed" } },
// Stage 2: Group by product and calculate revenue
{ $group: {
_id: "$product_id",
totalRevenue: { $sum: { $multiply: ["$price", "$quantity"] } },
count: { $sum: 1 }
}},
// Stage 3: Sort by revenue
{ $sort: { totalRevenue: -1 } },
// Stage 4: Limit to top 5
{ $limit: 5 },
// Stage 5: Join with products collection
{ $lookup: {
from: "products",
localField: "_id",
foreignField: "_id",
as: "product_info"
}},
// Stage 6: Reshape the output
{ $project: {
_id: 0,
product: { $arrayElemAt: ["$product_info.name", 0] },
totalRevenue: 1,
count: 1
}}
])
Python equivalent with PyMongo:
pipeline = [
# Stage 1: Filter for completed sales
{"$match": {"status": "completed"}},
# Stage 2: Group by product and calculate revenue
{"$group": {
"_id": "$product_id",
"totalRevenue": {"$sum": {"$multiply": ["$price", "$quantity"]}},
"count": {"$sum": 1}
}},
# Stage 3: Sort by revenue
{"$sort": {"totalRevenue": -1}},
# Stage 4: Limit to top 5
{"$limit": 5},
# Stage 5: Join with products collection
{"$lookup": {
"from": "products",
"localField": "_id",
"foreignField": "_id",
"as": "product_info"
}},
# Stage 6: Reshape the output
{"$project": {
"_id": 0,
"product": {"$arrayElemAt": ["$product_info.name", 0]},
"totalRevenue": 1,
"count": 1
}}
]
top_products = list(db.sales.aggregate(pipeline))
Text Search
MongoDB provides text search capabilities for string content:
Create a text index:
db.articles.createIndex({ title: "text", content: "text" })
Perform a text search:
db.articles.find({ $text: { $search: "mongodb database" } })
Sort by relevance score:
db.articles.find( { $text: { $search: "mongodb database" } }, { score: { $meta: "textScore" } } ).sort({ score: { $meta: "textScore" } })
Python equivalent with PyMongo:
# Create text index
db.articles.create_index([("title", "text"), ("content", "text")])
# Perform text search
results = list(db.articles.find({"$text": {"$search": "mongodb database"}}))
# Sort by relevance score
results = list(db.articles.find(
{"$text": {"$search": "mongodb database"}},
{"score": {"$meta": "textScore"}}
).sort([("score", {"$meta": "textScore"})]))
Geospatial Queries
MongoDB supports geospatial queries for location-based applications:
Create a geospatial index:
db.places.createIndex({ location: "2dsphere" })
Store location data using GeoJSON:
db.places.insertOne({ name: "Central Park", location: { type: "Point", coordinates: [-73.97, 40.77] // [longitude, latitude] } })
Find places near a point:
db.places.find({ location: { $near: { $geometry: { type: "Point", coordinates: [-73.98, 40.76] }, $maxDistance: 1000 // in meters } } })
Find places within a polygon:
db.places.find({ location: { $geoWithin: { $geometry: { type: "Polygon", coordinates: [[ [-74.0, 40.7], [-74.0, 40.8], [-73.9, 40.8], [-73.9, 40.7], [-74.0, 40.7] ]] } } } })
Python equivalent with PyMongo:
# Create geospatial index
db.places.create_index([("location", "2dsphere")])
# Insert a place with location
db.places.insert_one({
"name": "Central Park",
"location": {
"type": "Point",
"coordinates": [-73.97, 40.77]
}
})
# Find places near a point
nearby_places = list(db.places.find({
"location": {
"$near": {
"$geometry": {
"type": "Point",
"coordinates": [-73.98, 40.76]
},
"$maxDistance": 1000
}
}
}))
MongoDB with Python
PyMongo Basics
PyMongo is the official MongoDB driver for Python:
from pymongo import MongoClient
from bson.objectid import ObjectId
# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
# or with authentication:
# client = MongoClient('mongodb://username:password@localhost:27017/')
# Access a database
db = client['mydatabase']
# Access a collection
collection = db['mycollection']
# Insert a document
result = collection.insert_one({
'name': 'John Doe',
'email': 'john@example.com'
})
print(f"Inserted document with ID: {result.inserted_id}")
# Find documents
documents = collection.find({'name': 'John Doe'})
for doc in documents:
print(doc)
# Find by ID
document = collection.find_one({'_id': ObjectId('60a6e3e89f1c6a8d556884b2')})
# Update a document
result = collection.update_one(
{'email': 'john@example.com'},
{'$set': {'name': 'John Smith'}}
)
print(f"Modified {result.modified_count} document(s)")
# Delete a document
result = collection.delete_one({'email': 'john@example.com'})
print(f"Deleted {result.deleted_count} document(s)")
# Close the connection
client.close()
Motor for Async Operations
Motor is the asynchronous MongoDB driver for Python, perfect for use with async frameworks like FastAPI:
import asyncio
from motor.motor_asyncio import AsyncIOMotorClient
async def main():
# Connect to MongoDB
client = AsyncIOMotorClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['mycollection']
# Insert a document
result = await collection.insert_one({
'name': 'John Doe',
'email': 'john@example.com'
})
print(f"Inserted document with ID: {result.inserted_id}")
# Find documents
async for document in collection.find({'name': 'John Doe'}):
print(document)
# Find one document
document = await collection.find_one({'email': 'john@example.com'})
print(document)
# Close the connection
client.close()
# Run the async function
asyncio.run(main())
Pydantic Integration
Pydantic provides data validation and settings management using Python type annotations. It integrates well with MongoDB for schema validation:
from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId
from pymongo import MongoClient
# Custom type for handling ObjectId
class PyObjectId(ObjectId):
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v):
if not ObjectId.is_valid(v):
raise ValueError("Invalid ObjectId")
return ObjectId(v)
@classmethod
def __modify_schema__(cls, field_schema):
field_schema.update(type="string")
# Pydantic model for User
class User(BaseModel):
id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
name: str
email: str
age: int
is_active: bool = True
created_at: datetime = Field(default_factory=datetime.now)
tags: List[str] = []
class Config:
allow_population_by_field_name = True
arbitrary_types_allowed = True
json_encoders = {
ObjectId: str,
datetime: lambda dt: dt.isoformat()
}
# Example usage with MongoDB and Pydantic
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['users']
# Create a user from Pydantic model
user_data = {
"name": "John Doe",
"email": "john@example.com",
"age": 30,
"tags": ["developer", "python"]
}
user = User(**user_data)
result = collection.insert_one(user.dict(by_alias=True))
print(f"Inserted user with ID: {result.inserted_id}")
# Retrieve and validate from MongoDB
user_from_db = collection.find_one({"email": "john@example.com"})
validated_user = User(**user_from_db)
print(validated_user.json())
# Update using Pydantic model
user_update = User(**user_from_db)
user_update.age = 31
user_update.tags.append("mongodb")
collection.update_one(
{"_id": user_update.id},
{"$set": user_update.dict(by_alias=True, exclude={"id"})}
)
With FastAPI and Motor (async):
from fastapi import FastAPI, HTTPException, status
from motor.motor_asyncio import AsyncIOMotorClient
from pydantic import BaseModel, Field, EmailStr
from typing import List, Optional
from datetime import datetime
from bson import ObjectId
app = FastAPI()
# Database connection
client = AsyncIOMotorClient('mongodb://localhost:27017')
db = client.mydatabase
# Pydantic models
class PyObjectId(ObjectId):
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v):
if not ObjectId.is_valid(v):
raise ValueError("Invalid ObjectId")
return ObjectId(v)
@classmethod
def __modify_schema__(cls, field_schema):
field_schema.update(type="string")
class UserBase(BaseModel):
name: str
email: EmailStr
age: int
tags: List[str] = []
is_active: bool = True
class UserCreate(UserBase):
pass
class UserDB(UserBase):
id: PyObjectId = Field(default_factory=PyObjectId, alias="_id")
created_at: datetime = Field(default_factory=datetime.now)
class Config:
allow_population_by_field_name = True
arbitrary_types_allowed = True
json_encoders = {
ObjectId: str,
datetime: lambda dt: dt.isoformat()
}
# FastAPI routes
@app.post("/users/", response_model=UserDB, status_code=status.HTTP_201_CREATED)
async def create_user(user: UserCreate):
user_dict = user.dict()
user_dict["created_at"] = datetime.now()
result = await db.users.insert_one(user_dict)
created_user = await db.users.find_one({"_id": result.inserted_id})
return created_user
@app.get("/users/{user_id}", response_model=UserDB)
async def get_user(user_id: str):
if not ObjectId.is_valid(user_id):
raise HTTPException(status_code=400, detail="Invalid user ID format")
user = await db.users.find_one({"_id": ObjectId(user_id)})
if user is None:
raise HTTPException(status_code=404, detail="User not found")
return user
@app.get("/users/", response_model=List[UserDB])
async def list_users(limit: int = 10, skip: int = 0):
users = await db.users.find().skip(skip).limit(limit).to_list(length=limit)
return users
MongoDB Deployment
Running MongoDB in Docker
Docker provides an easy way to deploy MongoDB:
Basic MongoDB container:
docker run -d --name mongodb \
-p 27017:27017 \
-e MONGO_INITDB_ROOT_USERNAME=admin \
-e MONGO_INITDB_ROOT_PASSWORD=password \
-v mongodb_data:/data/db \
mongo:latest
Using Docker Compose:
# docker-compose.yml
version: '3.8'
services:
mongodb:
image: mongo:latest
container_name: mongodb
restart: always
ports:
- "27017:27017"
environment:
MONGO_INITDB_ROOT_USERNAME: admin
MONGO_INITDB_ROOT_PASSWORD: password
volumes:
- mongodb_data:/data/db
- ./mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:ro
mongo-express:
image: mongo-express:latest
container_name: mongo-express
restart: always
ports:
- "8081:8081"
environment:
ME_CONFIG_MONGODB_ADMINUSERNAME: admin
ME_CONFIG_MONGODB_ADMINPASSWORD: password
ME_CONFIG_MONGODB_SERVER: mongodb
depends_on:
- mongodb
volumes:
mongodb_data:
With initialization script:
// mongo-init.js
db = db.getSiblingDB('mydatabase');
db.createUser({
user: 'myuser',
pwd: 'mypassword',
roles: [
{ role: 'readWrite', db: 'mydatabase' }
]
});
db.createCollection('users');
db.users.insertMany([
{
name: 'John Doe',
email: 'john@example.com',
age: 30
},
{
name: 'Jane Smith',
email: 'jane@example.com',
age: 28
}
]);
Running and stopping the containers:
# Start services
docker-compose up -d
# Stop services
docker-compose down
# View logs
docker-compose logs -f mongodb
MongoDB Atlas
MongoDB Atlas is a fully-managed cloud database service provided by MongoDB, Inc. It offers:
- Automated deployment across AWS, Azure, or GCP
- Automated backups and point-in-time recovery
- Auto-scaling based on workload
- Security features like encryption, VPC peering, and IP whitelisting
- Performance optimization with query profiling and suggestions
Connecting to Atlas from Python:
from pymongo import MongoClient
# Connection string from Atlas dashboard
connection_string = "mongodb+srv://username:password@cluster0.mongodb.net/mydatabase?retryWrites=true&w=majority"
client = MongoClient(connection_string)
db = client.mydatabase
collection = db.mycollection
# Test connection
result = collection.find_one()
print(result)
Self-hosted Deployment
For production self-hosted deployments, MongoDB is typically run as a replica set or sharded cluster:
Replica Set provides redundancy and high availability:
# Start MongoDB instances
mongod --replSet myrs --dbpath /data/db1 --port 27017
mongod --replSet myrs --dbpath /data/db2 --port 27018
mongod --replSet myrs --dbpath /data/db3 --port 27019
# Configure replica set
mongo --port 27017
> rs.initiate({
_id: "myrs",
members: [
{ _id: 0, host: "mongodb0:27017" },
{ _id: 1, host: "mongodb1:27018" },
{ _id: 2, host: "mongodb2:27019" }
]
})
Sharded Cluster for horizontal scaling:
# Start config servers
mongod --configsvr --replSet configrs --dbpath /data/configdb --port 27019
# Start shard servers
mongod --shardsvr --replSet shard1rs --dbpath /data/shard1 --port 27018
# Start mongos router
mongos --configdb configrs/config1:27019,config2:27019,config3:27019 --port 27017
# Add shards via mongos
mongo --port 27017
> sh.addShard("shard1rs/shard1:27018")
> sh.enableSharding("mydatabase")
> sh.shardCollection("mydatabase.users", { "_id": "hashed" })
MongoDB Security
Authentication and Authorization
MongoDB provides role-based access control (RBAC):
Authentication Methods:
- Username/Password
- X.509 certificates
- LDAP
- Kerberos
Creating a user with specific role:
db.createUser({
user: "appUser",
pwd: "securePassword",
roles: [
{ role: "readWrite", db: "mydatabase" }
]
})
Built-in roles:
read
: Read data from any collectionreadWrite
: Read and write datadbAdmin
: Perform administrative tasksuserAdmin
: Create and modify users and rolesclusterAdmin
: Administer the whole clusterbackup
: Backup datarestore
: Restore data from backups
Creating custom roles:
db.createRole({
role: "reportingRole",
privileges: [
{
resource: { db: "mydatabase", collection: "" },
actions: [ "find" ]
}
],
roles: []
})
Network Security
Securing MongoDB network access:
- Binding to localhost only:
mongod --bind_ip 127.0.0.1
- Enabling TLS/SSL:
mongod --tlsMode requireTLS --tlsCertificateKeyFile /path/to/server.pem
- Firewall rules to restrict access:
# Allow MongoDB port only from specific IPs
ufw allow from 192.168.1.0/24 to any port 27017
- VPC/Network isolation in cloud environments
Encryption
MongoDB supports encryption at multiple levels:
- Transport Encryption (TLS/SSL) for data in transit
- Storage Encryption for data at rest:
mongod --enableEncryption --encryptionKeyFile /path/to/key
- Client-Side Field Level Encryption for sensitive fields:
const clientEncryption = new ClientEncryption(client, {
keyVaultNamespace: 'encryption.__dataKeys',
kmsProviders: {
local: {
key: localMasterKey
}
}
});
// Encrypt a field
const encryptedField = await clientEncryption.encrypt(
sensitiveData,
{
algorithm: 'AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic',
keyAltName: 'myKey'
}
);
// Store encrypted data
await collection.insertOne({
name: 'John Doe',
ssn: encryptedField
});
Performance Optimization
Indexing Strategies
Effective indexing is crucial for MongoDB performance:
- Single-field indexes for frequently queried fields:
db.users.createIndex({ "email": 1 })
- Compound indexes for multi-field queries:
db.products.createIndex({ "category": 1, "price": -1 })
Index properties:
Unique: Enforce field uniqueness
db.users.createIndex({ "email": 1 }, { unique: true })
Sparse: Only index documents with the field present
db.users.createIndex({ "optional_field": 1 }, { sparse: true })
TTL (Time-To-Live): Automatically expire documents
db.sessions.createIndex({ "last_activity": 1 }, { expireAfterSeconds: 3600 })
Partial: Only index documents matching a filter
db.orders.createIndex( { "status": 1 }, { partialFilterExpression: { "status": "active" } })
Index usage analysis:
db.users.find({ "age": { $gt: 25 } }).explain("executionStats")
- Identifying missing indexes:
db.currentOp(
{
"op" : "query",
"microsecs_running" : { $gt: 100000 }
}
)
Query Optimization Techniques
- Query profiling to identify slow queries:
// Enable profiler
db.setProfilingLevel(1, { slowms: 100 })
// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10)
- Covered queries that are satisfied entirely by an index:
// Create an index on both fields
db.users.createIndex({ "email": 1, "name": 1 })
// Query that uses only indexed fields
db.users.find(
{ "email": "john@example.com" },
{ "_id": 0, "email": 1, "name": 1 }
)
- Projection to retrieve only needed fields:
db.products.find({}, { name: 1, price: 1, _id: 0 })
- Limit results to reduce data transfer:
db.logs.find().sort({ timestamp: -1 }).limit(100)
- Avoid negation operators when possible:
// Avoid this (can't use indexes effectively)
db.users.find({ status: { $ne: "inactive" } })
// Better approach
db.users.find({ status: { $in: ["active", "pending"] } })
Performance Monitoring
- Server status metrics:
db.serverStatus()
- Database statistics:
db.stats()
- Collection statistics:
db.users.stats()
- Index usage statistics:
db.users.aggregate([
{ $indexStats: {} }
])
- Monitoring tools:
- MongoDB Compass
- MongoDB Cloud Manager
- Prometheus with MongoDB exporter
- Grafana dashboards
Real-world Use Cases
Content Management Systems
MongoDB is well-suited for content management systems:
- Flexible schema for different content types
- Rich querying for content filtering
- Embedded documents for comments and related content
Example CMS document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"title": "Getting Started with MongoDB",
"slug": "getting-started-with-mongodb",
"content": "MongoDB is a document database...",
"author": {
"name": "John Doe",
"email": "john@example.com"
},
"tags": ["mongodb", "nosql", "database"],
"status": "published",
"created_at": ISODate("2021-05-20T15:30:00Z"),
"updated_at": ISODate("2021-05-25T10:15:00Z"),
"comments": [
{
"user": "Jane Smith",
"text": "Great article!",
"created_at": ISODate("2021-05-21T08:45:00Z")
}
],
"metadata": {
"featured": true,
"view_count": 1250,
"rating": 4.7
}
}
Real-time Analytics
MongoDB excels for real-time analytics applications:
- Time-series data collection
- Aggregation pipeline for complex analytics
- Sharding for handling large data volumes
Example analytics pipeline:
db.page_views.aggregate([
// Match events from the last 24 hours
{
$match: {
timestamp: {
$gte: new Date(Date.now() - 24 * 60 * 60 * 1000)
}
}
},
// Group by page and calculate stats
{
$group: {
_id: "$page",
views: { $sum: 1 },
unique_users: { $addToSet: "$user_id" },
avg_duration: { $avg: "$duration" }
}
},
// Calculate number of unique users
{
$addFields: {
unique_users: { $size: "$unique_users" }
}
},
// Sort by most viewed
{
$sort: { views: -1 }
},
// Limit to top 10
{
$limit: 10
}
])
IoT Applications
MongoDB is popular for Internet of Things (IoT) applications:
- High write throughput for sensor data
- Time-series collections for time-ordered data
- Geospatial queries for location tracking
- TTL indexes for data expiration
Example IoT document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"device_id": "thermostat-1234",
"type": "temperature",
"value": 22.5,
"unit": "celsius",
"location": {
"type": "Point",
"coordinates": [-73.97, 40.77]
},
"battery": 87,
"timestamp": ISODate("2021-05-20T15:30:00Z")
}
Mobile Applications
MongoDB works well for mobile apps:
- Flexible schema for rapidly evolving app features
- Offline-first architecture with MongoDB Realm
- Change streams for real-time updates
- Horizontal scaling for growing user bases
Example mobile app user document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"username": "johndoe",
"email": "john@example.com",
"profile": {
"name": "John Doe",
"avatar": "https://example.com/avatars/johndoe.jpg",
"bio": "MongoDB enthusiast"
},
"preferences": {
"notifications": {
"push": true,
"email": false
},
"theme": "dark"
},
"devices": [
{
"type": "android",
"token": "fcm-token-123",
"last_active": ISODate("2021-05-20T15:30:00Z")
}
],
"last_login": ISODate("2021-05-20T15:30:00Z"),
"created_at": ISODate("2021-03-15T10:20:00Z")
}
Catalog Management
MongoDB is excellent for product catalogs:
- Schema flexibility for diverse product types
- Rich querying for faceted search
- Horizontal scaling for large catalogs
Example product document:
{
"_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
"sku": "MBP-2021-14-M1",
"name": "MacBook Pro 14-inch",
"description": "Apple MacBook Pro with M1 Pro chip",
"price": 1999.99,
"category": "electronics",
"subcategory": "laptops",
"brand": "Apple",
"attributes": {
"processor": "Apple M1 Pro",
"memory": "16GB",
"storage": "512GB SSD",
"display": "14.2-inch Liquid Retina XDR",
"color": "Space Gray"
},
"images": [
{
"url": "https://example.com/images/mbp-front.jpg",
"alt": "Front view",
"is_primary": true
},
{
"url": "https://example.com/images/mbp-side.jpg",
"alt": "Side view",
"is_primary": false
}
],
"inventory": {
"in_stock": 42,
"warehouse_location": "NYC-1"
},
"metadata": {
"featured": true,
"rating": 4.8,
"reviews_count": 156
},
"created_at": ISODate("2021-10-26T10:00:00Z"),
"updated_at": ISODate("2021-11-15T14:30:00Z")
}
Best Practices
Schema Design Patterns
Embedded Documents Pattern:
- Embed related data in a single document for faster reads
- Best for one-to-few relationships
- Example: embedding addresses in a user document
References Pattern:
- Use references between documents for one-to-many or many-to-many relationships
- Example: referencing order IDs in a user document
Bucket Pattern:
- Group related time-series data into buckets
- Prevents having too many small documents
- Example: storing hourly metrics in a daily document
Schema Versioning Pattern:
- Include a version field in documents
- Handle migrations gracefully
- Example:
{ "schema_version": 2, ... }
Computed Pattern:
- Store computed data to avoid expensive calculations
- Update during writes
- Example: storing count of items in a cart
Subset Pattern:
- Store a subset of fields from one collection in another
- Reduces need for joins
- Example: storing essential product info in order documents
Data Modeling Guidelines
Design for the application's queries:
- Start with the queries, then design the schema
- Denormalize when it improves read performance
Balance embedding vs. referencing:
- Embed when data is always accessed together
- Reference when data is large, accessed separately, or shared
Consider document growth:
- Allow for document growth when data will be updated
- Be cautious with unbounded arrays
Limit document size:
- Keep documents below the 16MB limit
- Split large content (like binary data) into GridFS
Use appropriate data types:
- Use BSON types that match your needs
- Consider ObjectId for unique identifiers
Plan for indexes:
- Index fields used in query filters, sorts, and joins
- Be mindful of index size and maintenance overhead
Operational Excellence
Monitoring and alerting:
- Monitor system metrics (CPU, memory, disk I/O)
- Monitor MongoDB metrics (operations, connections, queues)
- Set up alerting for critical thresholds
Backup strategy:
- Schedule regular backups
- Test restore processes
- Consider point-in-time recovery for critical data
Capacity planning:
- Estimate data growth
- Plan for increased load
- Provision resources accordingly
Security practices:
- Use authentication and authorization
- Encrypt data in transit and at rest
- Regularly audit access and permissions
Upgrade strategy:
- Stay current with versions
- Test upgrades in non-production environments
- Schedule maintenance windows for upgrades
Summary
MongoDB is a powerful, flexible document database that excels in scenarios requiring:
- Flexible schema for evolving data models
- Horizontal scalability for growing applications
- High write throughput for data-intensive applications
- Rich query capabilities including geospatial and text search
- Developer productivity with intuitive data models
While not suitable for every use case (especially those requiring complex transactions across multiple entities), MongoDB provides a compelling alternative to traditional relational databases for many modern application patterns.
...MongoDB usage example with python
17501 • Apr 5, 2025
Prehistory
I recently got really interested in NoSQL database MongoDB , as I have no experience of working with it previously. I kind of know the use cases of the technology and am able to use it in simple way, but what it is prominent for is its support for unstructured data handling. Unlike SQL, it does not enforced a defined structure for our objects in the collections.
Here is the example of one of the use case in python (GenAI GENERATED):
Here's an implementation approach using Pydantic for validation with MongoDB:
from typing import Optional, List, Literal, Union, Dict, Any
from datetime import datetime
from pydantic import BaseModel, Field, root_validator
from bson import ObjectId
from pymongo import MongoClient
# Helper for handling ObjectId in Pydantic
class PyObjectId(ObjectId):
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v):
if not ObjectId.is_valid(v):
raise ValueError("Invalid ObjectId")
return ObjectId(v)
@classmethod
def __modify_schema__(cls, field_schema):
field_schema.update(type="string")
# Base Product Model with common fields
class BaseProduct(BaseModel):
id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
name: str
brand: str
price: float
stock: int
release_date: datetime
product_type: Literal["phone", "laptop", "tablet"]
description: str
images: List[str] = []
tags: List[str] = []
active: bool = True
created_at: datetime = Field(default_factory=datetime.now)
updated_at: datetime = Field(default_factory=datetime.now)
class Config:
allow_population_by_field_name = True
arbitrary_types_allowed = True
json_encoders = {
ObjectId: str,
datetime: lambda dt: dt.isoformat()
}
# Phone specific model
class PhoneProduct(BaseProduct):
product_type: Literal["phone"] = "phone"
screen_size: float
battery_capacity: int # mAh
camera_mp: float
storage_options: List[int] # GB
colors: List[str]
os: str
network: str # 4G, 5G, etc.
dimensions: Dict[str, float] # height, width, depth
weight: float # grams
# Phone-specific validation
@root_validator
def validate_phone(cls, values):
if values.get("product_type") != "phone":
raise ValueError("Product type must be 'phone'")
return values
# Laptop specific model
class LaptopProduct(BaseProduct):
product_type: Literal["laptop"] = "laptop"
screen_size: float
processor: str
ram_options: List[int] # GB
storage_options: Dict[str, List[int]] # Type (SSD/HDD) -> Sizes in GB
gpu: Optional[str] = None
battery_life: float # hours
os: str
ports: Dict[str, int] # port type -> number of ports
weight: float # kg
is_touchscreen: bool = False
# Laptop-specific validation
@root_validator
def validate_laptop(cls, values):
if values.get("product_type") != "laptop":
raise ValueError("Product type must be 'laptop'")
return values
# Tablet specific model
class TabletProduct(BaseProduct):
product_type: Literal["tablet"] = "tablet"
screen_size: float
battery_capacity: int # mAh
storage_options: List[int] # GB
processor: str
ram: int # GB
camera_mp: Dict[str, float] # front, back
connectivity: List[str] # wifi, cellular, etc.
os: str
pen_support: bool = False
# Tablet-specific validation
@root_validator
def validate_tablet(cls, values):
if values.get("product_type") != "tablet":
raise ValueError("Product type must be 'tablet'")
return values
# Union type for working with any product type
ProductType = Union[PhoneProduct, LaptopProduct, TabletProduct]
# MongoDB Repository for Products
class ProductRepository:
def __init__(self, connection_string: str, db_name: str):
self.client = MongoClient(connection_string)
self.db = self.client[db_name]
self.collection = self.db.products
# Create product of any type
def create_product(self, product: ProductType):
product_dict = product.dict(by_alias=True)
result = self.collection.insert_one(product_dict)
return str(result.inserted_id)
# Get product by ID
def get_product(self, product_id: str) -> Dict[str, Any]:
product = self.collection.find_one({"_id": ObjectId(product_id)})
if not product:
return None
# Determine product type and return appropriate model
if product["product_type"] == "phone":
return PhoneProduct(**product)
elif product["product_type"] == "laptop":
return LaptopProduct(**product)
elif product["product_type"] == "tablet":
return TabletProduct(**product)
# Get all products of a specific type
def get_products_by_type(self, product_type: str) -> List[ProductType]:
products = self.collection.find({"product_type": product_type})
result = []
for product in products:
if product_type == "phone":
result.append(PhoneProduct(**product))
elif product_type == "laptop":
result.append(LaptopProduct(**product))
elif product_type == "tablet":
result.append(TabletProduct(**product))
return result
# Update product
def update_product(self, product_id: str, updated_product: ProductType):
# Update the updated_at field
product_dict = updated_product.dict(by_alias=True, exclude_unset=True)
product_dict["updated_at"] = datetime.now()
result = self.collection.update_one(
{"_id": ObjectId(product_id)},
{"$set": product_dict}
)
return result.modified_count > 0
# Delete product
def delete_product(self, product_id: str):
result = self.collection.delete_one({"_id": ObjectId(product_id)})
return result.deleted_count > 0
# Example usage
if __name__ == "__main__":
# Create repository
repo = ProductRepository("mongodb://localhost:27017", "electronics_store")
# Create a phone product
phone = PhoneProduct(
name="Galaxy S22",
brand="Samsung",
price=899.99,
stock=100,
release_date=datetime(2022, 2, 25),
description="Samsung's flagship phone for 2022",
tags=["smartphone", "android", "samsung", "flagship"],
screen_size=6.1,
battery_capacity=3700,
camera_mp=50.0,
storage_options=[128, 256],
colors=["Phantom Black", "Phantom White", "Green", "Pink Gold"],
os="Android 12",
network="5G",
dimensions={"height": 146.0, "width": 70.6, "depth": 7.6},
weight=167.0
)
# Insert phone
phone_id = repo.create_product(phone)
print(f"Created phone with ID: {phone_id}")
# Create a laptop product
laptop = LaptopProduct(
name="MacBook Pro 14",
brand="Apple",
price=1999.99,
stock=50,
release_date=datetime(2021, 10, 26),
description="Apple's professional laptop with M1 Pro chip",
tags=["laptop", "macbook", "apple", "professional"],
screen_size=14.2,
processor="Apple M1 Pro",
ram_options=[16, 32, 64],
storage_options={"SSD": [512, 1024, 2048, 4096]},
gpu="M1 Pro 16-core",
battery_life=17.0,
os="macOS Monterey",
ports={"Thunderbolt 4": 3, "HDMI": 1, "SD Card": 1, "MagSafe": 1},
weight=1.6
)
# Insert laptop
laptop_id = repo.create_product(laptop)
print(f"Created laptop with ID: {laptop_id}")
# Get all phones
phones = repo.get_products_by_type("phone")
print(f"Found {len(phones)} phones")
This implementation provides a good balance between structure and flexibility that MongoDB offers. Here's the key aspects of this approach:
1. Base Structure with Common Fields
- The
BaseProduct
class contains all common fields shared across product types - Core fields like ID, name, price, brand, etc. are consistent
2. Type-Specific Models
- Each product type (phone, laptop, tablet) extends the base model
- Type-specific fields are only present in their respective models
- Each subclass validates its own product_type for data integrity
3. Type Discrimination
- The
product_type
field serves as the discriminator to identify document types - This allows you to query all products or filter by specific types
4. Repository Pattern
- The
ProductRepository
class encapsulates MongoDB operations - It handles conversion between Pydantic models and MongoDB documents
- It automatically returns the right model type based on the product_type field
5. Validation Benefits
- Pydantic ensures each document has the required fields for its type
- Type hints provide excellent IDE support and documentation
- The root_validator ensures product types match their models
This approach gives you the best of both worlds:
- The flexibility of MongoDB's schemaless design
- The safety and structure of strong type validation with Pydantic
When retrieving documents, the repository pattern intelligently converts them to the appropriate Pydantic model, ensuring you always get the right fields and validation for each product type.
...Python cloud native engineer - epam job description
admin1 • Mar 23, 2025
Python Engineer
We seek a highly skilled Python Engineer with expertise in cloud computing and a strong focus on integrating AI capabilities into our projects. The ideal candidate will possess robust proficiency in Python and its frameworks, coupled with a deep understanding of at least one major cloud provider.
Additionally, familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG) is essential for seamlessly integrating AI capabilities.
Responsibilities
- Facilitate development and deployment of cloud-native solutions, highlighting AI integration in our projects
- Architect and launch AI-driven applications, leveraging Python frameworks such as Django, Flask or FastAPI
- Integrate Large Language Models (LLM) and Retrieval Augmented Generation (RAG) into ongoing and upcoming projects to enhance language understanding and generation capabilities
- Team up with various cross-functional teams to comprehend project objectives and convert them into AI-driven technical resolutions
- Implement AI-based features and functionalities, utilizing cloud-native architectures and industry-standard practices
- Write maintainable and well-documented code, adhering to coding standards and best practices
- Stay updated with the latest advancements in Python, cloud computing, AI, and Cloud Native architectures, and proactively suggest innovative solutions to enhance our AI capabilities
Requirements
- Proven expertise in Python programming language, with significant experience in AI integration
- Proficiency in cloud computing with hands-on experience in major cloud platforms such as AWS, Azure, or Google Cloud Platform
- Familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG)
- Excellent problem-solving abilities and the capability to effectively collaborate within a team setting
- Superior communication skills and the competence to seamlessly explain complex technical concepts to non-technical stakeholders
Nice to have
- Knowledge of Cloud Native architectures and experience with tools like Kubernetes, Docker and microservices
We offer
We connect like-minded people:
- Delivering innovative solutions to industry leaders, making a global impact
- Enjoyable working environment, whether it is the vibrant office or the comfort of your own home
- Opportunity to work abroad for up to two months per year
- Relocation opportunities within our offices in 55+ countries
- Corporate and social events
We invest in your growth:
- Leadership development, career advising, soft skills and well-being programs
- Certifications, including GCP, Azure and AWS
- Unlimited access to LinkedIn Learning, Get Abstract, O'Reilly
- Free English classes with certified teachers
- Discounts in local language schools, including offline courses for the Uzbek language
We cover it all:
- Monetary bonuses for engaging in the referral program
- Medical & family care package
- Four trust days per year (sick leave without a medical certificate)
- Discounts for fitness clubs, dance schools and sports programs
- Benefits package (sports activities, a variety of stores and services)