Stories

Discover boundless stories from unique narrators (storytellers 🙃)

MongoDB: Comprehensive Guide

17501 • Apr 11, 2025

Table of Contents

Introduction to MongoDB

MongoDB is a document-oriented, NoSQL database designed for high performance, high availability, and automatic scaling. Developed by MongoDB Inc. (formerly 10gen), it was first released in 2009 and has since become one of the most popular NoSQL databases in the world.

The name "MongoDB" comes from "humongous," reflecting its design goal to handle huge amounts of data efficiently. Unlike traditional relational databases, MongoDB stores data in flexible, JSON-like documents, which allows for variable structure among documents within the same collection.

MongoDB was designed to address several shortcomings of traditional SQL databases:

  1. Flexibility: The ability to store and process unstructured or semi-structured data
  2. Scalability: Built from the ground up to scale horizontally across multiple servers
  3. Performance: Optimized for high write throughput and query performance
  4. Developer Productivity: Intuitive data model that aligns with modern programming languages

MongoDB Architecture and Core Concepts

Documents

The fundamental unit of data in MongoDB is a document. A document is a set of key-value pairs, similar to JSON objects, but stored in a format called BSON (Binary JSON). Documents allow embedding complex structures like arrays and nested documents.

Example of a MongoDB document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john.doe@example.com",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "Anytown",
    "state": "CA",
    "zip": "12345"
  },
  "hobbies": ["reading", "hiking", "photography"],
  "created_at": ISODate("2021-05-20T15:30:00Z")
}

Key characteristics of documents:

  • Maximum size of 16MB per document
  • Field names must be strings
  • Field values can be any BSON data type
  • Field order is preserved during insertion
  • Case-sensitive field names
  • Each document requires a unique _id field that acts as a primary key

Collections

Collections are groups of related documents, conceptually similar to tables in relational databases. However, unlike tables, collections don't enforce a schema across documents. Documents within the same collection can have different fields and structures.

For example, a users collection might contain documents representing user profiles, while an orders collection would contain documents representing customer orders.

Collections are organized within databases and follow these naming conventions:

  • Cannot be empty strings
  • Cannot contain the null character
  • Cannot begin with "system." (reserved prefix)
  • Cannot contain the $ character (reserved for certain operations)

Databases

A MongoDB instance can host multiple databases, each containing its own collections. Databases are the highest level of data organization in MongoDB and provide isolation for collections.

Some special databases include:

  • admin: Used for administrative operations
  • local: Stores data specific to a single server
  • config: Used by sharded clusters to store configuration information

BSON Format

MongoDB stores data in BSON (Binary JSON) format, which extends the JSON model to provide additional data types and to be more efficient for storage and traversal. BSON supports the following data types:

  • String: UTF-8 encoded strings
  • Integer: 32-bit or 64-bit integers
  • Double: 64-bit IEEE 754 floating point numbers
  • Boolean: true or false
  • Array: Ordered lists of values
  • Object: Embedded documents
  • ObjectId: 12-byte identifier typically used for _id fields
  • Date: Stored as 64-bit integers representing milliseconds since the Unix epoch
  • Null: Represents a null value
  • Regular Expression: For pattern matching
  • Binary Data: For storing binary data
  • Timestamp: MongoDB internal timestamp type
  • Decimal128: IEEE 754 decimal-based floating-point number

Example of converting between JSON and BSON in Python:

import json
from bson import ObjectId, json_util

# JSON cannot directly encode ObjectId, Date, etc.
document = {
    "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
    "name": "John Doe",
    "created_at": datetime.datetime.utcnow()
}

# Use json_util from pymongo to handle BSON types
json_str = json.dumps(document, default=json_util.default)
print(json_str)

# Convert back to Python dict with BSON types
parsed_document = json.loads(json_str, object_hook=json_util.object_hook)
print(parsed_document)

Key Differences from SQL Databases

Schema Design

SQL Databases:

  • Rigid schema defined at table creation
  • Relationships maintained through foreign keys
  • Normalization encouraged to avoid data duplication
  • Schema changes require migrations

MongoDB:

  • Flexible schema-less design
  • Documents can evolve over time
  • Embedding related data directly in documents
  • Denormalization often encouraged for performance

Example of normalized SQL tables vs. MongoDB document:

SQL Tables:

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20)
);

CREATE TABLE addresses (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    street VARCHAR(100),
    city VARCHAR(50),
    state VARCHAR(20),
    zip VARCHAR(10)
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    order_date TIMESTAMP,
    status VARCHAR(20)
);

MongoDB Document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john.doe@example.com",
  "phone": "555-123-4567",
  "addresses": [
    {
      "type": "home",
      "street": "123 Main St",
      "city": "Anytown",
      "state": "CA",
      "zip": "12345"
    },
    {
      "type": "work",
      "street": "456 Market St",
      "city": "Anytown",
      "state": "CA",
      "zip": "12345"
    }
  ],
  "orders": [
    {
      "order_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "order_date": ISODate("2021-05-20T15:30:00Z"),
      "status": "shipped"
    },
    {
      "order_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "order_date": ISODate("2021-05-25T10:15:00Z"),
      "status": "processing"
    }
  ]
}

Query Language

SQL Databases:

  • Standardized SQL language
  • Joins for relating data across tables
  • Complex transactions with multi-table updates

MongoDB:

  • JSON-like query syntax
  • No traditional joins (but has $lookup aggregation)
  • Query operators to navigate nested documents and arrays

Example query comparison:

SQL:

SELECT customers.name, orders.id, orders.order_date
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE customers.email = 'john.doe@example.com'
AND orders.status = 'shipped';

MongoDB:

db.customers.find(
  { 
    "email": "john.doe@example.com",
    "orders.status": "shipped"
  },
  {
    "name": 1,
    "orders.$": 1
  }
)

Transactions and ACID Properties

SQL Databases:

  • Strong ACID guarantees
  • Long-established transaction support
  • Well-suited for financial applications

MongoDB:

  • ACID transactions at document level by default
  • Multi-document transactions available since version 4.0
  • Distributed transactions across shards since version 4.2

Example of a MongoDB transaction:

const session = db.getMongo().startSession();
session.startTransaction();

try {
  const accounts = session.getDatabase("bank").accounts;
  
  // Withdraw from account A
  accounts.updateOne(
    { account_id: "A" }, 
    { $inc: { balance: -100 } }
  );
  
  // Deposit to account B
  accounts.updateOne(
    { account_id: "B" }, 
    { $inc: { balance: 100 } }
  );
  
  session.commitTransaction();
} catch (error) {
  session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Scaling Approach

SQL Databases:

  • Traditionally scale vertically (bigger machines)
  • Replication for high availability
  • Partitioning/sharding often complex to set up

MongoDB:

  • Built for horizontal scaling (more machines)
  • Native sharding capabilities
  • Auto-balancing of data across shards
  • Replica sets for high availability

Entity Relationships in MongoDB

Unlike relational databases that use tables and foreign keys to model relationships, MongoDB uses two main strategies to represent relationships between entities: embedding and referencing. Understanding when to use each approach is crucial for effective MongoDB schema design.

One-to-One Relationships

In a one-to-one relationship, one document in a collection is related to exactly one document in the same or another collection.

Embedded One-to-One Relationship

For one-to-one relationships, embedding is often the most efficient approach:

// User document with embedded profile (1:1)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com",
  "profile": {
    "first_name": "John",
    "last_name": "Doe",
    "date_of_birth": ISODate("1990-01-15"),
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    },
    "phone": "+1-555-123-4567"
  }
}

Python implementation with Pydantic:

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip: str

class Profile(BaseModel):
    first_name: str
    last_name: str
    date_of_birth: date
    address: Address
    phone: Optional[str] = None

class User(BaseModel):
    username: str
    email: str
    profile: Profile

Referenced One-to-One Relationship

In some cases, referencing is better for one-to-one relationships:

// User document
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com"
}

// Profile document
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c3"),
  "user_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to user
  "first_name": "John",
  "last_name": "Doe",
  "date_of_birth": ISODate("1990-01-15"),
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "state": "NY",
    "zip": "10001"
  },
  "phone": "+1-555-123-4567"
}

When to use references for one-to-one:

  • When the embedded document is large and rarely accessed
  • When the embedded document changes frequently
  • When the embedded document needs to be accessed separately

Python implementation with PyMongo:

# Create user and profile with reference
user_id = db.users.insert_one({
    "username": "johndoe",
    "email": "john@example.com"
}).inserted_id

profile = {
    "user_id": user_id,
    "first_name": "John",
    "last_name": "Doe",
    "date_of_birth": datetime(1990, 1, 15),
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "state": "NY",
        "zip": "10001"
    },
    "phone": "+1-555-123-4567"
}
db.profiles.insert_one(profile)

# Retrieve user with profile
user = db.users.find_one({"username": "johndoe"})
user_profile = db.profiles.find_one({"user_id": user["_id"]})

One-to-Many Relationships

In a one-to-many relationship, one document in a collection is related to multiple documents in another collection.

Embedded One-to-Many Relationship (Array of Embedded Documents)

When the "many" side is relatively small and stable:

// Product document with embedded reviews (1:Many)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Smartphone X",
  "price": 999.99,
  "category": "electronics",
  "reviews": [
    {
      "user_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "username": "user123",
      "rating": 5,
      "text": "Great product!",
      "date": ISODate("2021-05-20T15:30:00Z")
    },
    {
      "user_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "username": "user456",
      "rating": 4,
      "text": "Good but expensive",
      "date": ISODate("2021-05-25T10:15:00Z")
    }
  ]
}

Python implementation with Pydantic:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId

class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)

class Review(BaseModel):
    user_id: PyObjectId
    username: str
    rating: int
    text: str
    date: datetime = Field(default_factory=datetime.now)
    
    class Config:
        arbitrary_types_allowed = True
        json_encoders = {ObjectId: str}

class Product(BaseModel):
    name: str
    price: float
    category: str
    reviews: List[Review] = []
    
    class Config:
        arbitrary_types_allowed = True

Referenced One-to-Many Relationship (Child References)

When the "many" side is large or frequently changing:

// Blog post document (parent)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "title": "Introduction to MongoDB",
  "content": "MongoDB is a document database...",
  "author": "John Doe",
  "date": ISODate("2021-05-20T15:30:00Z")
}

// Comment documents (children)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "post_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to post
  "user": "Alice",
  "text": "Great article!",
  "date": ISODate("2021-05-20T16:30:00Z")
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "post_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to post
  "user": "Bob",
  "text": "Thanks for sharing.",
  "date": ISODate("2021-05-21T10:15:00Z")
}

Python implementation with PyMongo:

# Create blog post
post_id = db.posts.insert_one({
    "title": "Introduction to MongoDB",
    "content": "MongoDB is a document database...",
    "author": "John Doe",
    "date": datetime.now()
}).inserted_id

# Add comments referencing the post
comments = [
    {
        "post_id": post_id,
        "user": "Alice",
        "text": "Great article!",
        "date": datetime.now()
    },
    {
        "post_id": post_id,
        "user": "Bob",
        "text": "Thanks for sharing.",
        "date": datetime.now() + timedelta(hours=1)
    }
]
db.comments.insert_many(comments)

# Retrieve post with comments
post = db.posts.find_one({"_id": post_id})
post_comments = list(db.comments.find({"post_id": post_id}).sort("date", 1))

Referenced One-to-Many Relationship (Parent Reference)

Another approach for one-to-many is to have children reference their parent:

// Department document (one)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Engineering",
  "location": "Building A"
}

// Employee documents (many) with parent reference
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "John Doe",
  "position": "Software Engineer",
  "department_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Reference to department
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Jane Smith",
  "position": "QA Engineer",
  "department_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Reference to department
}

Python implementation with PyMongo:

# Create department
dept_id = db.departments.insert_one({
    "name": "Engineering",
    "location": "Building A"
}).inserted_id

# Create employees with department reference
employees = [
    {
        "name": "John Doe",
        "position": "Software Engineer",
        "department_id": dept_id
    },
    {
        "name": "Jane Smith",
        "position": "QA Engineer",
        "department_id": dept_id
    }
]
db.employees.insert_many(employees)

# Find all employees in a department
dept_employees = list(db.employees.find({"department_id": dept_id}))

Many-to-Many Relationships

In a many-to-many relationship, documents in both collections can be related to multiple documents in the other collection.

Embedded Many-to-Many Relationship

For many-to-many relationships with limited size:

// Student document with embedded courses
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "courses": [
    {
      "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "name": "Introduction to MongoDB",
      "instructor": "Prof. Smith",
      "enrolled_date": ISODate("2021-01-15")
    },
    {
      "course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "name": "Web Development",
      "instructor": "Prof. Johnson",
      "enrolled_date": ISODate("2021-02-10")
    }
  ]
}

// Course document with embedded students
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Introduction to MongoDB",
  "instructor": "Prof. Smith",
  "students": [
    {
      "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
      "name": "John Doe",
      "enrolled_date": ISODate("2021-01-15")
    },
    {
      "student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
      "name": "Jane Smith",
      "enrolled_date": ISODate("2021-01-20")
    }
  ]
}

Note: This approach duplicates data and can be difficult to maintain as both sides need to be updated when changes occur.

Referenced Many-to-Many Relationship

A better approach is often to use a separate collection to model the relationship:

// Student documents
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john@example.com"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
  "name": "Jane Smith",
  "email": "jane@example.com"
}

// Course documents
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Introduction to MongoDB",
  "instructor": "Prof. Smith"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Web Development",
  "instructor": "Prof. Johnson"
}

// Enrollments collection (junction/join collection)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "enrolled_date": ISODate("2021-01-15"),
  "grade": "A"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "enrolled_date": ISODate("2021-02-10"),
  "grade": "B+"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d3"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "enrolled_date": ISODate("2021-01-20"),
  "grade": "A-"
}

Python implementation with PyMongo:

# Create students
student1_id = db.students.insert_one({
    "name": "John Doe",
    "email": "john@example.com"
}).inserted_id

student2_id = db.students.insert_one({
    "name": "Jane Smith",
    "email": "jane@example.com"
}).inserted_id

# Create courses
course1_id = db.courses.insert_one({
    "name": "Introduction to MongoDB",
    "instructor": "Prof. Smith"
}).inserted_id

course2_id = db.courses.insert_one({
    "name": "Web Development",
    "instructor": "Prof. Johnson"
}).inserted_id

# Create enrollments
enrollments = [
    {
        "student_id": student1_id,
        "course_id": course1_id,
        "enrolled_date": datetime(2021, 1, 15),
        "grade": "A"
    },
    {
        "student_id": student1_id,
        "course_id": course2_id,
        "enrolled_date": datetime(2021, 2, 10),
        "grade": "B+"
    },
    {
        "student_id": student2_id,
        "course_id": course1_id,
        "enrolled_date": datetime(2021, 1, 20),
        "grade": "A-"
    }
]
db.enrollments.insert_many(enrollments)

# Find all courses for a student
def get_student_courses(student_id):
    # Get all enrollments for the student
    enrollments = list(db.enrollments.find({"student_id": student_id}))
    
    # Get the course details for each enrollment
    courses = []
    for enrollment in enrollments:
        course = db.courses.find_one({"_id": enrollment["course_id"]})
        # Add enrollment details to the course
        course["enrolled_date"] = enrollment["enrolled_date"]
        course["grade"] = enrollment["grade"]
        courses.append(course)
    
    return courses

# Find all students in a course
def get_course_students(course_id):
    # Get all enrollments for the course
    enrollments = list(db.enrollments.find({"course_id": course_id}))
    
    # Get the student details for each enrollment
    students = []
    for enrollment in enrollments:
        student = db.students.find_one({"_id": enrollment["student_id"]})
        # Add enrollment details to the student
        student["enrolled_date"] = enrollment["enrolled_date"]
        student["grade"] = enrollment["grade"]
        students.append(student)
    
    return students

Self-Referencing Relationships

Self-referencing relationships occur when documents in a collection reference other documents in the same collection.

Tree Structure (Hierarchical Data)

For representing hierarchical data like categories:

// Category documents with parent references
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Electronics",
  "parent_id": null  // Root category
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Computers",
  "parent_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Child of Electronics
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Laptops",
  "parent_id": ObjectId("60a6e3e89f1c6a8d556884c1")  // Child of Computers
}

Python implementation to get the full path:

def get_category_path(category_id):
    path = []
    current_id = category_id
    
    while current_id is not None:
        category = db.categories.find_one({"_id": current_id})
        if category is None:
            break
            
        path.insert(0, category["name"])  # Add to beginning of path
        current_id = category["parent_id"]
    
    return " > ".join(path)

# Example: "Electronics > Computers > Laptops"

Graph Structure (Network)

For representing graph-like data such as social networks:

// User documents with friend references
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884c1"),
    ObjectId("60a6e3e89f1c6a8d556884c2")
  ]
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Jane Smith",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884b2"),
    ObjectId("60a6e3e89f1c6a8d556884c2")
  ]
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Bob Johnson",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884b2"),
    ObjectId("60a6e3e89f1c6a8d556884c1")
  ]
}

Python implementation to find mutual friends:

def get_mutual_friends(user1_id, user2_id):
    user1 = db.users.find_one({"_id": user1_id})
    user2 = db.users.find_one({"_id": user2_id})
    
    if not user1 or not user2:
        return []
    
    # Find the intersection of friend lists
    mutual_friend_ids = set(user1["friends"]) & set(user2["friends"])
    
    # Get the details of mutual friends
    mutual_friends = list(db.users.find({"_id": {"$in": list(mutual_friend_ids)}}))
    
    return mutual_friends

Choosing Between Embedding and Referencing

When deciding whether to embed or reference related data, consider these factors:

Advantages of Embedding

  1. Performance: Embedded documents are retrieved in a single query
  2. Atomicity: All related data is updated in a single operation
  3. Consistency: Related data is always in sync

Advantages of Referencing

  1. Document Size: Prevents documents from exceeding the 16MB limit
  2. Duplication: Avoids data duplication
  3. Flexibility: Allows independent access and updates to related data
  4. Complex Relationships: Better for many-to-many relationships

Decision Criteria

Criteria Embed Reference
Relationship One-to-one or one-to-few One-to-many or many-to-many
Data Size Small embedded documents Large related documents
Access Pattern Always accessed together Often accessed separately
Update Frequency Rarely changes Frequently changes
Growth Limited, predictable growth Unbounded growth
Query Requirements Simple queries on embedded data Complex queries across collections

Hybrid Approaches

Sometimes a hybrid approach works best:

// Order document with both embedded and referenced data
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "order_number": "ORD-12345",
  "date": ISODate("2021-05-20T15:30:00Z"),
  "status": "shipped",
  
  // Referenced customer (frequently accessed separately)
  "customer_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  
  // Embedded summary of customer info (frequently accessed together)
  "customer_summary": {
    "name": "John Doe",
    "email": "john@example.com",
    "shipping_address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    }
  },
  
  // Embedded line items (always accessed with the order)
  "items": [
    {
      "product_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
      "name": "Smartphone X",
      "price": 999.99,
      "quantity": 1
    },
    {
      "product_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
      "name": "Wireless Earbuds",
      "price": 199.99,
      "quantity": 2
    }
  ],
  
  "total": 1399.97
}

This approach gives you the best of both worlds:

  • The order document contains embedded items for atomic updates and single-query retrieval
  • It references the full customer document for detailed information
  • It includes a customer summary to avoid an extra query for common operations

MongoDB Under the Hood

Storage Engine

The storage engine is responsible for managing how data is stored on disk and in memory. MongoDB's default storage engine is WiredTiger (since version 3.2), which offers:

  1. Document-Level Concurrency: Multiple clients can modify different documents in a collection simultaneously
  2. Compression: Both data and indexes are compressed by default
  3. Journaling: Write operations are recorded in a journal for durability
  4. Checkpoints: Creates consistent snapshots of data files every 60 seconds by default

WiredTiger uses a B-tree data structure for storage, with pages of data cached in RAM and written to disk during checkpoints.

Other important aspects of the storage engine:

  • Write Ahead Log (WAL): Ensures data durability by logging operations before they are applied
  • Snapshot Isolation: Readers see a consistent snapshot of data at a point in time
  • Checkpoint Process: Flushes in-memory changes to disk periodically

Indexing

MongoDB supports several types of indexes to optimize query performance:

  1. Single Field Index: Index on one field

    db.users.createIndex({ "email": 1 })  // 1 for ascending order
    
  2. Compound Index: Index on multiple fields

    db.products.createIndex({ "category": 1, "price": -1 })  // -1 for descending order
    
  3. Multikey Index: Automatically created when indexing an array field

    db.posts.createIndex({ "tags": 1 })  // Will index each element in the tags array
    
  4. Text Index: For text search capabilities

    db.articles.createIndex({ "content": "text" })
    
  5. Geospatial Index: For location-based queries

    db.places.createIndex({ "location": "2dsphere" })
    
  6. Hashed Index: For hash-based sharding

    db.users.createIndex({ "_id": "hashed" })
    

Indexes in MongoDB are implemented as B-trees and stored separately from the collection data.

Query Optimization

MongoDB's query optimizer selects the most efficient query plan based on:

  1. Query Shape: The structure of the query (which fields, operators, etc.)
  2. Available Indexes: Which indexes could potentially be used
  3. Collection Statistics: Size of the collection and distribution of values
  4. Query Execution History: Results of previous similar queries

The query plan cache stores successful query plans to avoid repeated planning for similar queries.

To analyze query performance, MongoDB provides the explain() method:

db.users.find({ "status": "active", "age": { $gt: 21 } }).explain("executionStats")

This returns detailed information about:

  • Which indexes were considered
  • Which index was chosen
  • Number of documents examined
  • Execution time
  • Stages of the query plan

Memory Management

MongoDB employs a tiered storage model:

  1. Working Set: Active portion of data and indexes that fits in RAM
  2. Disk Storage: Full dataset stored on disk

WiredTiger manages memory through:

  • Cache: Configured as a percentage of system RAM (default is 50%)
  • Eviction: Removing less frequently used data from cache when approaching memory limits
  • Page Replacement: Algorithm to decide which pages to evict

Memory usage can be monitored with:

db.serverStatus().wiredTiger.cache

MongoDB Operations

CRUD Operations

Create

MongoDB provides several methods to insert documents:

  1. Insert a single document:

    db.users.insertOne({
      name: "John Doe",
      email: "john.doe@example.com",
      age: 30
    })
    
  2. Insert multiple documents:

    db.users.insertMany([
      { name: "Jane Smith", email: "jane@example.com", age: 28 },
      { name: "Bob Johnson", email: "bob@example.com", age: 35 }
    ])
    

Python equivalent with PyMongo:

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
users = db['users']

# Insert one document
result = users.insert_one({
    "name": "John Doe",
    "email": "john.doe@example.com",
    "age": 30
})
print(f"Inserted document with ID: {result.inserted_id}")

# Insert multiple documents
results = users.insert_many([
    {"name": "Jane Smith", "email": "jane@example.com", "age": 28},
    {"name": "Bob Johnson", "email": "bob@example.com", "age": 35}
])
print(f"Inserted {len(results.inserted_ids)} documents")

Read

MongoDB offers flexible query capabilities:

  1. Find all documents in a collection:

    db.users.find()
    
  2. Find documents matching specific criteria:

    db.users.find({ age: { $gt: 25 } })  // Users older than 25
    
  3. Find one document:

    db.users.findOne({ email: "john.doe@example.com" })
    
  4. Projection (selecting specific fields):

    db.users.find({ age: { $gt: 25 } }, { name: 1, email: 1, _id: 0 })
    
  5. Limit results:

    db.users.find().limit(10)
    
  6. Skip results (for pagination):

    db.users.find().skip(10).limit(10)  // Second page of 10 results
    
  7. Sort results:

    db.users.find().sort({ age: -1 })  // Sort by age descending
    

Python equivalent with PyMongo:

# Find all users
all_users = list(users.find())

# Find users older than 25
older_users = list(users.find({"age": {"$gt": 25}}))

# Find one user by email
user = users.find_one({"email": "john.doe@example.com"})

# Projection
user_names = list(users.find({}, {"name": 1, "_id": 0}))

# Pagination
page_size = 10
page_num = 2
paginated_users = list(users.find().skip((page_num - 1) * page_size).limit(page_size))

# Sorting
sorted_users = list(users.find().sort("age", -1))  # -1 for descending

Update

MongoDB provides multiple ways to update documents:

  1. Update a single document:

    db.users.updateOne(
      { email: "john.doe@example.com" },
      { $set: { age: 31, status: "active" } }
    )
    
  2. Update multiple documents:

    db.users.updateMany(
      { age: { $lt: 30 } },
      { $set: { category: "young" } }
    )
    
  3. Replace a document:

    db.users.replaceOne(
      { email: "john.doe@example.com" },
      {
        name: "John Doe",
        email: "john.doe@example.com",
        age: 31,
        address: { city: "New York", zip: "10001" }
      }
    )
    
  4. Update operators:

    • $set: Set field values
    • $inc: Increment field values
    • $push: Add elements to arrays
    • $pull: Remove elements from arrays
    • $addToSet: Add elements to arrays without duplicates
    • $unset: Remove fields

Python equivalent with PyMongo:

# Update one document
users.update_one(
    {"email": "john.doe@example.com"},
    {"$set": {"age": 31, "status": "active"}}
)

# Update multiple documents
result = users.update_many(
    {"age": {"$lt": 30}},
    {"$set": {"category": "young"}}
)
print(f"Modified {result.modified_count} documents")

# Replace a document
users.replace_one(
    {"email": "john.doe@example.com"},
    {
        "name": "John Doe",
        "email": "john.doe@example.com",
        "age": 31,
        "address": {"city": "New York", "zip": "10001"}
    }
)

# Using update operators
users.update_one(
    {"email": "john.doe@example.com"},
    {
        "$inc": {"login_count": 1},
        "$push": {"login_history": datetime.now()},
        "$set": {"last_login": datetime.now()}
    }
)

Delete

MongoDB provides methods to remove documents:

  1. Delete a single document:

    db.users.deleteOne({ email: "john.doe@example.com" })
    
  2. Delete multiple documents:

    db.users.deleteMany({ status: "inactive" })
    
  3. Delete all documents in a collection:

    db.users.deleteMany({})
    

Python equivalent with PyMongo:

# Delete one document
result = users.delete_one({"email": "john.doe@example.com"})
print(f"Deleted {result.deleted_count} document")

# Delete multiple documents
result = users.delete_many({"status": "inactive"})
print(f"Deleted {result.deleted_count} documents")

# Clear collection
users.delete_many({})

Aggregation Framework

MongoDB's Aggregation Framework provides a powerful way to process and transform data within the database. It uses a pipeline approach where documents pass through stages that modify them.

Common aggregation stages:

  1. $match: Filter documents (similar to find)

    { $match: { status: "active" } }
    
  2. $group: Group documents by a key

    { $group: { _id: "$department", totalEmployees: { $sum: 1 } } }
    
  3. $sort: Sort documents

    { $sort: { age: -1 } }
    
  4. $project: Reshape documents (select/compute fields)

    { $project: { name: 1, firstLetter: { $substr: ["$name", 0, 1] } } }
    
  5. $unwind: Deconstruct array fields

    { $unwind: "$tags" }
    
  6. $lookup: Perform a join with another collection

    {
      $lookup: {
        from: "orders",
        localField: "_id",
        foreignField: "customer_id",
        as: "customer_orders"
      }
    }
    

Example of a complex aggregation pipeline:

db.sales.aggregate([
  // Stage 1: Filter for completed sales
  { $match: { status: "completed" } },
  
  // Stage 2: Group by product and calculate revenue
  { $group: {
      _id: "$product_id",
      totalRevenue: { $sum: { $multiply: ["$price", "$quantity"] } },
      count: { $sum: 1 }
  }},
  
  // Stage 3: Sort by revenue
  { $sort: { totalRevenue: -1 } },
  
  // Stage 4: Limit to top 5
  { $limit: 5 },
  
  // Stage 5: Join with products collection
  { $lookup: {
      from: "products",
      localField: "_id",
      foreignField: "_id",
      as: "product_info"
  }},
  
  // Stage 6: Reshape the output
  { $project: {
      _id: 0,
      product: { $arrayElemAt: ["$product_info.name", 0] },
      totalRevenue: 1,
      count: 1
  }}
])

Python equivalent with PyMongo:

pipeline = [
    # Stage 1: Filter for completed sales
    {"$match": {"status": "completed"}},
    
    # Stage 2: Group by product and calculate revenue
    {"$group": {
        "_id": "$product_id",
        "totalRevenue": {"$sum": {"$multiply": ["$price", "$quantity"]}},
        "count": {"$sum": 1}
    }},
    
    # Stage 3: Sort by revenue
    {"$sort": {"totalRevenue": -1}},
    
    # Stage 4: Limit to top 5
    {"$limit": 5},
    
    # Stage 5: Join with products collection
    {"$lookup": {
        "from": "products",
        "localField": "_id",
        "foreignField": "_id",
        "as": "product_info"
    }},
    
    # Stage 6: Reshape the output
    {"$project": {
        "_id": 0,
        "product": {"$arrayElemAt": ["$product_info.name", 0]},
        "totalRevenue": 1,
        "count": 1
    }}
]

top_products = list(db.sales.aggregate(pipeline))

Text Search

MongoDB provides text search capabilities for string content:

  1. Create a text index:

    db.articles.createIndex({ title: "text", content: "text" })
    
  2. Perform a text search:

    db.articles.find({ $text: { $search: "mongodb database" } })
    
  3. Sort by relevance score:

    db.articles.find(
      { $text: { $search: "mongodb database" } },
      { score: { $meta: "textScore" } }
    ).sort({ score: { $meta: "textScore" } })
    

Python equivalent with PyMongo:

# Create text index
db.articles.create_index([("title", "text"), ("content", "text")])

# Perform text search
results = list(db.articles.find({"$text": {"$search": "mongodb database"}}))

# Sort by relevance score
results = list(db.articles.find(
    {"$text": {"$search": "mongodb database"}},
    {"score": {"$meta": "textScore"}}
).sort([("score", {"$meta": "textScore"})]))

Geospatial Queries

MongoDB supports geospatial queries for location-based applications:

  1. Create a geospatial index:

    db.places.createIndex({ location: "2dsphere" })
    
  2. Store location data using GeoJSON:

    db.places.insertOne({
      name: "Central Park",
      location: {
        type: "Point",
        coordinates: [-73.97, 40.77]  // [longitude, latitude]
      }
    })
    
  3. Find places near a point:

    db.places.find({
      location: {
        $near: {
          $geometry: {
            type: "Point",
            coordinates: [-73.98, 40.76]
          },
          $maxDistance: 1000  // in meters
        }
      }
    })
    
  4. Find places within a polygon:

    db.places.find({
      location: {
        $geoWithin: {
          $geometry: {
            type: "Polygon",
            coordinates: [[
              [-74.0, 40.7],
              [-74.0, 40.8],
              [-73.9, 40.8],
              [-73.9, 40.7],
              [-74.0, 40.7]
            ]]
          }
        }
      }
    })
    

Python equivalent with PyMongo:

# Create geospatial index
db.places.create_index([("location", "2dsphere")])

# Insert a place with location
db.places.insert_one({
    "name": "Central Park",
    "location": {
        "type": "Point",
        "coordinates": [-73.97, 40.77]
    }
})

# Find places near a point
nearby_places = list(db.places.find({
    "location": {
        "$near": {
            "$geometry": {
                "type": "Point",
                "coordinates": [-73.98, 40.76]
            },
            "$maxDistance": 1000
        }
    }
}))

MongoDB with Python

PyMongo Basics

PyMongo is the official MongoDB driver for Python:

from pymongo import MongoClient
from bson.objectid import ObjectId

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
# or with authentication:
# client = MongoClient('mongodb://username:password@localhost:27017/')

# Access a database
db = client['mydatabase']

# Access a collection
collection = db['mycollection']

# Insert a document
result = collection.insert_one({
    'name': 'John Doe',
    'email': 'john@example.com'
})
print(f"Inserted document with ID: {result.inserted_id}")

# Find documents
documents = collection.find({'name': 'John Doe'})
for doc in documents:
    print(doc)

# Find by ID
document = collection.find_one({'_id': ObjectId('60a6e3e89f1c6a8d556884b2')})

# Update a document
result = collection.update_one(
    {'email': 'john@example.com'},
    {'$set': {'name': 'John Smith'}}
)
print(f"Modified {result.modified_count} document(s)")

# Delete a document
result = collection.delete_one({'email': 'john@example.com'})
print(f"Deleted {result.deleted_count} document(s)")

# Close the connection
client.close()

Motor for Async Operations

Motor is the asynchronous MongoDB driver for Python, perfect for use with async frameworks like FastAPI:

import asyncio
from motor.motor_asyncio import AsyncIOMotorClient

async def main():
    # Connect to MongoDB
    client = AsyncIOMotorClient('mongodb://localhost:27017/')
    db = client['mydatabase']
    collection = db['mycollection']
    
    # Insert a document
    result = await collection.insert_one({
        'name': 'John Doe',
        'email': 'john@example.com'
    })
    print(f"Inserted document with ID: {result.inserted_id}")
    
    # Find documents
    async for document in collection.find({'name': 'John Doe'}):
        print(document)
    
    # Find one document
    document = await collection.find_one({'email': 'john@example.com'})
    print(document)
    
    # Close the connection
    client.close()

# Run the async function
asyncio.run(main())

Pydantic Integration

Pydantic provides data validation and settings management using Python type annotations. It integrates well with MongoDB for schema validation:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId
from pymongo import MongoClient

# Custom type for handling ObjectId
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)
    
    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")

# Pydantic model for User
class User(BaseModel):
    id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
    name: str
    email: str
    age: int
    is_active: bool = True
    created_at: datetime = Field(default_factory=datetime.now)
    tags: List[str] = []
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }

# Example usage with MongoDB and Pydantic
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['users']

# Create a user from Pydantic model
user_data = {
    "name": "John Doe",
    "email": "john@example.com",
    "age": 30,
    "tags": ["developer", "python"]
}
user = User(**user_data)
result = collection.insert_one(user.dict(by_alias=True))
print(f"Inserted user with ID: {result.inserted_id}")

# Retrieve and validate from MongoDB
user_from_db = collection.find_one({"email": "john@example.com"})
validated_user = User(**user_from_db)
print(validated_user.json())

# Update using Pydantic model
user_update = User(**user_from_db)
user_update.age = 31
user_update.tags.append("mongodb")
collection.update_one(
    {"_id": user_update.id},
    {"$set": user_update.dict(by_alias=True, exclude={"id"})}
)

With FastAPI and Motor (async):

from fastapi import FastAPI, HTTPException, status
from motor.motor_asyncio import AsyncIOMotorClient
from pydantic import BaseModel, Field, EmailStr
from typing import List, Optional
from datetime import datetime
from bson import ObjectId

app = FastAPI()

# Database connection
client = AsyncIOMotorClient('mongodb://localhost:27017')
db = client.mydatabase

# Pydantic models
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)
    
    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")

class UserBase(BaseModel):
    name: str
    email: EmailStr
    age: int
    tags: List[str] = []
    is_active: bool = True

class UserCreate(UserBase):
    pass

class UserDB(UserBase):
    id: PyObjectId = Field(default_factory=PyObjectId, alias="_id")
    created_at: datetime = Field(default_factory=datetime.now)
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }

# FastAPI routes
@app.post("/users/", response_model=UserDB, status_code=status.HTTP_201_CREATED)
async def create_user(user: UserCreate):
    user_dict = user.dict()
    user_dict["created_at"] = datetime.now()
    
    result = await db.users.insert_one(user_dict)
    
    created_user = await db.users.find_one({"_id": result.inserted_id})
    return created_user

@app.get("/users/{user_id}", response_model=UserDB)
async def get_user(user_id: str):
    if not ObjectId.is_valid(user_id):
        raise HTTPException(status_code=400, detail="Invalid user ID format")
        
    user = await db.users.find_one({"_id": ObjectId(user_id)})
    if user is None:
        raise HTTPException(status_code=404, detail="User not found")
        
    return user

@app.get("/users/", response_model=List[UserDB])
async def list_users(limit: int = 10, skip: int = 0):
    users = await db.users.find().skip(skip).limit(limit).to_list(length=limit)
    return users

MongoDB Deployment

Running MongoDB in Docker

Docker provides an easy way to deploy MongoDB:

Basic MongoDB container:

docker run -d --name mongodb \
    -p 27017:27017 \
    -e MONGO_INITDB_ROOT_USERNAME=admin \
    -e MONGO_INITDB_ROOT_PASSWORD=password \
    -v mongodb_data:/data/db \
    mongo:latest

Using Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  mongodb:
    image: mongo:latest
    container_name: mongodb
    restart: always
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: password
    volumes:
      - mongodb_data:/data/db
      - ./mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:ro

  mongo-express:
    image: mongo-express:latest
    container_name: mongo-express
    restart: always
    ports:
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_ADMINUSERNAME: admin
      ME_CONFIG_MONGODB_ADMINPASSWORD: password
      ME_CONFIG_MONGODB_SERVER: mongodb
    depends_on:
      - mongodb

volumes:
  mongodb_data:

With initialization script:

// mongo-init.js
db = db.getSiblingDB('mydatabase');

db.createUser({
  user: 'myuser',
  pwd: 'mypassword',
  roles: [
    { role: 'readWrite', db: 'mydatabase' }
  ]
});

db.createCollection('users');
db.users.insertMany([
  {
    name: 'John Doe',
    email: 'john@example.com',
    age: 30
  },
  {
    name: 'Jane Smith',
    email: 'jane@example.com',
    age: 28
  }
]);

Running and stopping the containers:

# Start services
docker-compose up -d

# Stop services
docker-compose down

# View logs
docker-compose logs -f mongodb

MongoDB Atlas

MongoDB Atlas is a fully-managed cloud database service provided by MongoDB, Inc. It offers:

  1. Automated deployment across AWS, Azure, or GCP
  2. Automated backups and point-in-time recovery
  3. Auto-scaling based on workload
  4. Security features like encryption, VPC peering, and IP whitelisting
  5. Performance optimization with query profiling and suggestions

Connecting to Atlas from Python:

from pymongo import MongoClient

# Connection string from Atlas dashboard
connection_string = "mongodb+srv://username:password@cluster0.mongodb.net/mydatabase?retryWrites=true&w=majority"

client = MongoClient(connection_string)
db = client.mydatabase
collection = db.mycollection

# Test connection
result = collection.find_one()
print(result)

Self-hosted Deployment

For production self-hosted deployments, MongoDB is typically run as a replica set or sharded cluster:

Replica Set provides redundancy and high availability:

# Start MongoDB instances
mongod --replSet myrs --dbpath /data/db1 --port 27017
mongod --replSet myrs --dbpath /data/db2 --port 27018
mongod --replSet myrs --dbpath /data/db3 --port 27019

# Configure replica set
mongo --port 27017
> rs.initiate({
    _id: "myrs",
    members: [
      { _id: 0, host: "mongodb0:27017" },
      { _id: 1, host: "mongodb1:27018" },
      { _id: 2, host: "mongodb2:27019" }
    ]
  })

Sharded Cluster for horizontal scaling:

# Start config servers
mongod --configsvr --replSet configrs --dbpath /data/configdb --port 27019

# Start shard servers
mongod --shardsvr --replSet shard1rs --dbpath /data/shard1 --port 27018

# Start mongos router
mongos --configdb configrs/config1:27019,config2:27019,config3:27019 --port 27017

# Add shards via mongos
mongo --port 27017
> sh.addShard("shard1rs/shard1:27018")
> sh.enableSharding("mydatabase")
> sh.shardCollection("mydatabase.users", { "_id": "hashed" })

MongoDB Security

Authentication and Authorization

MongoDB provides role-based access control (RBAC):

  1. Authentication Methods:

    • Username/Password
    • X.509 certificates
    • LDAP
    • Kerberos
  2. Creating a user with specific role:

db.createUser({
  user: "appUser",
  pwd: "securePassword",
  roles: [
    { role: "readWrite", db: "mydatabase" }
  ]
})
  1. Built-in roles:

    • read: Read data from any collection
    • readWrite: Read and write data
    • dbAdmin: Perform administrative tasks
    • userAdmin: Create and modify users and roles
    • clusterAdmin: Administer the whole cluster
    • backup: Backup data
    • restore: Restore data from backups
  2. Creating custom roles:

db.createRole({
  role: "reportingRole",
  privileges: [
    {
      resource: { db: "mydatabase", collection: "" },
      actions: [ "find" ]
    }
  ],
  roles: []
})

Network Security

Securing MongoDB network access:

  1. Binding to localhost only:
mongod --bind_ip 127.0.0.1
  1. Enabling TLS/SSL:
mongod --tlsMode requireTLS --tlsCertificateKeyFile /path/to/server.pem
  1. Firewall rules to restrict access:
# Allow MongoDB port only from specific IPs
ufw allow from 192.168.1.0/24 to any port 27017
  1. VPC/Network isolation in cloud environments

Encryption

MongoDB supports encryption at multiple levels:

  1. Transport Encryption (TLS/SSL) for data in transit
  2. Storage Encryption for data at rest:
mongod --enableEncryption --encryptionKeyFile /path/to/key
  1. Client-Side Field Level Encryption for sensitive fields:
const clientEncryption = new ClientEncryption(client, {
  keyVaultNamespace: 'encryption.__dataKeys',
  kmsProviders: {
    local: {
      key: localMasterKey
    }
  }
});

// Encrypt a field
const encryptedField = await clientEncryption.encrypt(
  sensitiveData,
  {
    algorithm: 'AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic',
    keyAltName: 'myKey'
  }
);

// Store encrypted data
await collection.insertOne({
  name: 'John Doe',
  ssn: encryptedField
});

Performance Optimization

Indexing Strategies

Effective indexing is crucial for MongoDB performance:

  1. Single-field indexes for frequently queried fields:
db.users.createIndex({ "email": 1 })
  1. Compound indexes for multi-field queries:
db.products.createIndex({ "category": 1, "price": -1 })
  1. Index properties:

    • Unique: Enforce field uniqueness

      db.users.createIndex({ "email": 1 }, { unique: true })
      
    • Sparse: Only index documents with the field present

      db.users.createIndex({ "optional_field": 1 }, { sparse: true })
      
    • TTL (Time-To-Live): Automatically expire documents

      db.sessions.createIndex({ "last_activity": 1 }, { expireAfterSeconds: 3600 })
      
    • Partial: Only index documents matching a filter

      db.orders.createIndex(  { "status": 1 },  { partialFilterExpression: { "status": "active" } })
      
  2. Index usage analysis:

db.users.find({ "age": { $gt: 25 } }).explain("executionStats")
  1. Identifying missing indexes:
db.currentOp(
  {
    "op" : "query",
    "microsecs_running" : { $gt: 100000 }
  }
)

Query Optimization Techniques

  1. Query profiling to identify slow queries:
// Enable profiler
db.setProfilingLevel(1, { slowms: 100 })

// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10)
  1. Covered queries that are satisfied entirely by an index:
// Create an index on both fields
db.users.createIndex({ "email": 1, "name": 1 })

// Query that uses only indexed fields
db.users.find(
  { "email": "john@example.com" },
  { "_id": 0, "email": 1, "name": 1 }
)
  1. Projection to retrieve only needed fields:
db.products.find({}, { name: 1, price: 1, _id: 0 })
  1. Limit results to reduce data transfer:
db.logs.find().sort({ timestamp: -1 }).limit(100)
  1. Avoid negation operators when possible:
// Avoid this (can't use indexes effectively)
db.users.find({ status: { $ne: "inactive" } })

// Better approach
db.users.find({ status: { $in: ["active", "pending"] } })

Performance Monitoring

  1. Server status metrics:
db.serverStatus()
  1. Database statistics:
db.stats()
  1. Collection statistics:
db.users.stats()
  1. Index usage statistics:
db.users.aggregate([
  { $indexStats: {} }
])
  1. Monitoring tools:
    • MongoDB Compass
    • MongoDB Cloud Manager
    • Prometheus with MongoDB exporter
    • Grafana dashboards

Real-world Use Cases

Content Management Systems

MongoDB is well-suited for content management systems:

  1. Flexible schema for different content types
  2. Rich querying for content filtering
  3. Embedded documents for comments and related content

Example CMS document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "title": "Getting Started with MongoDB",
  "slug": "getting-started-with-mongodb",
  "content": "MongoDB is a document database...",
  "author": {
    "name": "John Doe",
    "email": "john@example.com"
  },
  "tags": ["mongodb", "nosql", "database"],
  "status": "published",
  "created_at": ISODate("2021-05-20T15:30:00Z"),
  "updated_at": ISODate("2021-05-25T10:15:00Z"),
  "comments": [
    {
      "user": "Jane Smith",
      "text": "Great article!",
      "created_at": ISODate("2021-05-21T08:45:00Z")
    }
  ],
  "metadata": {
    "featured": true,
    "view_count": 1250,
    "rating": 4.7
  }
}

Real-time Analytics

MongoDB excels for real-time analytics applications:

  1. Time-series data collection
  2. Aggregation pipeline for complex analytics
  3. Sharding for handling large data volumes

Example analytics pipeline:

db.page_views.aggregate([
  // Match events from the last 24 hours
  {
    $match: {
      timestamp: {
        $gte: new Date(Date.now() - 24 * 60 * 60 * 1000)
      }
    }
  },
  
  // Group by page and calculate stats
  {
    $group: {
      _id: "$page",
      views: { $sum: 1 },
      unique_users: { $addToSet: "$user_id" },
      avg_duration: { $avg: "$duration" }
    }
  },
  
  // Calculate number of unique users
  {
    $addFields: {
      unique_users: { $size: "$unique_users" }
    }
  },
  
  // Sort by most viewed
  {
    $sort: { views: -1 }
  },
  
  // Limit to top 10
  {
    $limit: 10
  }
])

IoT Applications

MongoDB is popular for Internet of Things (IoT) applications:

  1. High write throughput for sensor data
  2. Time-series collections for time-ordered data
  3. Geospatial queries for location tracking
  4. TTL indexes for data expiration

Example IoT document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "device_id": "thermostat-1234",
  "type": "temperature",
  "value": 22.5,
  "unit": "celsius",
  "location": {
    "type": "Point",
    "coordinates": [-73.97, 40.77]
  },
  "battery": 87,
  "timestamp": ISODate("2021-05-20T15:30:00Z")
}

Mobile Applications

MongoDB works well for mobile apps:

  1. Flexible schema for rapidly evolving app features
  2. Offline-first architecture with MongoDB Realm
  3. Change streams for real-time updates
  4. Horizontal scaling for growing user bases

Example mobile app user document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com",
  "profile": {
    "name": "John Doe",
    "avatar": "https://example.com/avatars/johndoe.jpg",
    "bio": "MongoDB enthusiast"
  },
  "preferences": {
    "notifications": {
      "push": true,
      "email": false
    },
    "theme": "dark"
  },
  "devices": [
    {
      "type": "android",
      "token": "fcm-token-123",
      "last_active": ISODate("2021-05-20T15:30:00Z")
    }
  ],
  "last_login": ISODate("2021-05-20T15:30:00Z"),
  "created_at": ISODate("2021-03-15T10:20:00Z")
}

Catalog Management

MongoDB is excellent for product catalogs:

  1. Schema flexibility for diverse product types
  2. Rich querying for faceted search
  3. Horizontal scaling for large catalogs

Example product document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "sku": "MBP-2021-14-M1",
  "name": "MacBook Pro 14-inch",
  "description": "Apple MacBook Pro with M1 Pro chip",
  "price": 1999.99,
  "category": "electronics",
  "subcategory": "laptops",
  "brand": "Apple",
  "attributes": {
    "processor": "Apple M1 Pro",
    "memory": "16GB",
    "storage": "512GB SSD",
    "display": "14.2-inch Liquid Retina XDR",
    "color": "Space Gray"
  },
  "images": [
    {
      "url": "https://example.com/images/mbp-front.jpg",
      "alt": "Front view",
      "is_primary": true
    },
    {
      "url": "https://example.com/images/mbp-side.jpg",
      "alt": "Side view",
      "is_primary": false
    }
  ],
  "inventory": {
    "in_stock": 42,
    "warehouse_location": "NYC-1"
  },
  "metadata": {
    "featured": true,
    "rating": 4.8,
    "reviews_count": 156
  },
  "created_at": ISODate("2021-10-26T10:00:00Z"),
  "updated_at": ISODate("2021-11-15T14:30:00Z")
}

Best Practices

Schema Design Patterns

  1. Embedded Documents Pattern:

    • Embed related data in a single document for faster reads
    • Best for one-to-few relationships
    • Example: embedding addresses in a user document
  2. References Pattern:

    • Use references between documents for one-to-many or many-to-many relationships
    • Example: referencing order IDs in a user document
  3. Bucket Pattern:

    • Group related time-series data into buckets
    • Prevents having too many small documents
    • Example: storing hourly metrics in a daily document
  4. Schema Versioning Pattern:

    • Include a version field in documents
    • Handle migrations gracefully
    • Example: { "schema_version": 2, ... }
  5. Computed Pattern:

    • Store computed data to avoid expensive calculations
    • Update during writes
    • Example: storing count of items in a cart
  6. Subset Pattern:

    • Store a subset of fields from one collection in another
    • Reduces need for joins
    • Example: storing essential product info in order documents

Data Modeling Guidelines

  1. Design for the application's queries:

    • Start with the queries, then design the schema
    • Denormalize when it improves read performance
  2. Balance embedding vs. referencing:

    • Embed when data is always accessed together
    • Reference when data is large, accessed separately, or shared
  3. Consider document growth:

    • Allow for document growth when data will be updated
    • Be cautious with unbounded arrays
  4. Limit document size:

    • Keep documents below the 16MB limit
    • Split large content (like binary data) into GridFS
  5. Use appropriate data types:

    • Use BSON types that match your needs
    • Consider ObjectId for unique identifiers
  6. Plan for indexes:

    • Index fields used in query filters, sorts, and joins
    • Be mindful of index size and maintenance overhead

Operational Excellence

  1. Monitoring and alerting:

    • Monitor system metrics (CPU, memory, disk I/O)
    • Monitor MongoDB metrics (operations, connections, queues)
    • Set up alerting for critical thresholds
  2. Backup strategy:

    • Schedule regular backups
    • Test restore processes
    • Consider point-in-time recovery for critical data
  3. Capacity planning:

    • Estimate data growth
    • Plan for increased load
    • Provision resources accordingly
  4. Security practices:

    • Use authentication and authorization
    • Encrypt data in transit and at rest
    • Regularly audit access and permissions
  5. Upgrade strategy:

    • Stay current with versions
    • Test upgrades in non-production environments
    • Schedule maintenance windows for upgrades

Summary

MongoDB is a powerful, flexible document database that excels in scenarios requiring:

  1. Flexible schema for evolving data models
  2. Horizontal scalability for growing applications
  3. High write throughput for data-intensive applications
  4. Rich query capabilities including geospatial and text search
  5. Developer productivity with intuitive data models

While not suitable for every use case (especially those requiring complex transactions across multiple entities), MongoDB provides a compelling alternative to traditional relational databases for many modern application patterns.

...
MongoDB usage example with python

17501 • Apr 5, 2025

Prehistory

I recently got really interested in NoSQL database MongoDB , as I have no experience of working with it previously. I kind of know the use cases of the technology and am able to use it in simple way, but what it is prominent for is its support for unstructured data handling. Unlike SQL, it does not enforced a defined structure for our objects in the collections.

Here is the example of one of the use case in python (GenAI GENERATED):

Here's an implementation approach using Pydantic for validation with MongoDB:


from typing import Optional, List, Literal, Union, Dict, Any
from datetime import datetime
from pydantic import BaseModel, Field, root_validator
from bson import ObjectId
from pymongo import MongoClient

# Helper for handling ObjectId in Pydantic
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)

    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")


# Base Product Model with common fields
class BaseProduct(BaseModel):
    id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
    name: str
    brand: str
    price: float
    stock: int
    release_date: datetime
    product_type: Literal["phone", "laptop", "tablet"]
    description: str
    images: List[str] = []
    tags: List[str] = []
    active: bool = True
    created_at: datetime = Field(default_factory=datetime.now)
    updated_at: datetime = Field(default_factory=datetime.now)
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }


# Phone specific model
class PhoneProduct(BaseProduct):
    product_type: Literal["phone"] = "phone"
    screen_size: float
    battery_capacity: int  # mAh
    camera_mp: float
    storage_options: List[int]  # GB
    colors: List[str]
    os: str
    network: str  # 4G, 5G, etc.
    dimensions: Dict[str, float]  # height, width, depth
    weight: float  # grams
    
    # Phone-specific validation
    @root_validator
    def validate_phone(cls, values):
        if values.get("product_type") != "phone":
            raise ValueError("Product type must be 'phone'")
        return values


# Laptop specific model
class LaptopProduct(BaseProduct):
    product_type: Literal["laptop"] = "laptop"
    screen_size: float
    processor: str
    ram_options: List[int]  # GB
    storage_options: Dict[str, List[int]]  # Type (SSD/HDD) -> Sizes in GB
    gpu: Optional[str] = None
    battery_life: float  # hours
    os: str
    ports: Dict[str, int]  # port type -> number of ports
    weight: float  # kg
    is_touchscreen: bool = False
    
    # Laptop-specific validation
    @root_validator
    def validate_laptop(cls, values):
        if values.get("product_type") != "laptop":
            raise ValueError("Product type must be 'laptop'")
        return values


# Tablet specific model
class TabletProduct(BaseProduct):
    product_type: Literal["tablet"] = "tablet"
    screen_size: float
    battery_capacity: int  # mAh
    storage_options: List[int]  # GB
    processor: str
    ram: int  # GB
    camera_mp: Dict[str, float]  # front, back
    connectivity: List[str]  # wifi, cellular, etc.
    os: str
    pen_support: bool = False
    
    # Tablet-specific validation
    @root_validator
    def validate_tablet(cls, values):
        if values.get("product_type") != "tablet":
            raise ValueError("Product type must be 'tablet'")
        return values


# Union type for working with any product type
ProductType = Union[PhoneProduct, LaptopProduct, TabletProduct]


# MongoDB Repository for Products
class ProductRepository:
    def __init__(self, connection_string: str, db_name: str):
        self.client = MongoClient(connection_string)
        self.db = self.client[db_name]
        self.collection = self.db.products
    
    # Create product of any type
    def create_product(self, product: ProductType):
        product_dict = product.dict(by_alias=True)
        result = self.collection.insert_one(product_dict)
        return str(result.inserted_id)
    
    # Get product by ID
    def get_product(self, product_id: str) -> Dict[str, Any]:
        product = self.collection.find_one({"_id": ObjectId(product_id)})
        if not product:
            return None
        
        # Determine product type and return appropriate model
        if product["product_type"] == "phone":
            return PhoneProduct(**product)
        elif product["product_type"] == "laptop":
            return LaptopProduct(**product)
        elif product["product_type"] == "tablet":
            return TabletProduct(**product)
    
    # Get all products of a specific type
    def get_products_by_type(self, product_type: str) -> List[ProductType]:
        products = self.collection.find({"product_type": product_type})
        
        result = []
        for product in products:
            if product_type == "phone":
                result.append(PhoneProduct(**product))
            elif product_type == "laptop":
                result.append(LaptopProduct(**product))
            elif product_type == "tablet":
                result.append(TabletProduct(**product))
                
        return result
    
    # Update product
    def update_product(self, product_id: str, updated_product: ProductType):
        # Update the updated_at field
        product_dict = updated_product.dict(by_alias=True, exclude_unset=True)
        product_dict["updated_at"] = datetime.now()
        
        result = self.collection.update_one(
            {"_id": ObjectId(product_id)},
            {"$set": product_dict}
        )
        return result.modified_count > 0
    
    # Delete product
    def delete_product(self, product_id: str):
        result = self.collection.delete_one({"_id": ObjectId(product_id)})
        return result.deleted_count > 0


# Example usage
if __name__ == "__main__":
    # Create repository
    repo = ProductRepository("mongodb://localhost:27017", "electronics_store")
    
    # Create a phone product
    phone = PhoneProduct(
        name="Galaxy S22",
        brand="Samsung",
        price=899.99,
        stock=100,
        release_date=datetime(2022, 2, 25),
        description="Samsung's flagship phone for 2022",
        tags=["smartphone", "android", "samsung", "flagship"],
        screen_size=6.1,
        battery_capacity=3700,
        camera_mp=50.0,
        storage_options=[128, 256],
        colors=["Phantom Black", "Phantom White", "Green", "Pink Gold"],
        os="Android 12",
        network="5G",
        dimensions={"height": 146.0, "width": 70.6, "depth": 7.6},
        weight=167.0
    )
    
    # Insert phone
    phone_id = repo.create_product(phone)
    print(f"Created phone with ID: {phone_id}")
    
    # Create a laptop product
    laptop = LaptopProduct(
        name="MacBook Pro 14",
        brand="Apple",
        price=1999.99,
        stock=50,
        release_date=datetime(2021, 10, 26),
        description="Apple's professional laptop with M1 Pro chip",
        tags=["laptop", "macbook", "apple", "professional"],
        screen_size=14.2,
        processor="Apple M1 Pro",
        ram_options=[16, 32, 64],
        storage_options={"SSD": [512, 1024, 2048, 4096]},
        gpu="M1 Pro 16-core",
        battery_life=17.0,
        os="macOS Monterey",
        ports={"Thunderbolt 4": 3, "HDMI": 1, "SD Card": 1, "MagSafe": 1},
        weight=1.6
    )
    
    # Insert laptop
    laptop_id = repo.create_product(laptop)
    print(f"Created laptop with ID: {laptop_id}")
    
    # Get all phones
    phones = repo.get_products_by_type("phone")
    print(f"Found {len(phones)} phones")

This implementation provides a good balance between structure and flexibility that MongoDB offers. Here's the key aspects of this approach:

1. Base Structure with Common Fields

  • The BaseProduct class contains all common fields shared across product types
  • Core fields like ID, name, price, brand, etc. are consistent

2. Type-Specific Models

  • Each product type (phone, laptop, tablet) extends the base model
  • Type-specific fields are only present in their respective models
  • Each subclass validates its own product_type for data integrity

3. Type Discrimination

  • The product_type field serves as the discriminator to identify document types
  • This allows you to query all products or filter by specific types

4. Repository Pattern

  • The ProductRepository class encapsulates MongoDB operations
  • It handles conversion between Pydantic models and MongoDB documents
  • It automatically returns the right model type based on the product_type field

5. Validation Benefits

  • Pydantic ensures each document has the required fields for its type
  • Type hints provide excellent IDE support and documentation
  • The root_validator ensures product types match their models

This approach gives you the best of both worlds:

  1. The flexibility of MongoDB's schemaless design
  2. The safety and structure of strong type validation with Pydantic

When retrieving documents, the repository pattern intelligently converts them to the appropriate Pydantic model, ensuring you always get the right fields and validation for each product type.

...
Python cloud native engineer - epam job description

admin1 • Mar 23, 2025

Python Engineer

We seek a highly skilled Python Engineer with expertise in cloud computing and a strong focus on integrating AI capabilities into our projects. The ideal candidate will possess robust proficiency in Python and its frameworks, coupled with a deep understanding of at least one major cloud provider.

Additionally, familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG) is essential for seamlessly integrating AI capabilities.

Responsibilities

  • Facilitate development and deployment of cloud-native solutions, highlighting AI integration in our projects
  • Architect and launch AI-driven applications, leveraging Python frameworks such as Django, Flask or FastAPI
  • Integrate Large Language Models (LLM) and Retrieval Augmented Generation (RAG) into ongoing and upcoming projects to enhance language understanding and generation capabilities
  • Team up with various cross-functional teams to comprehend project objectives and convert them into AI-driven technical resolutions
  • Implement AI-based features and functionalities, utilizing cloud-native architectures and industry-standard practices
  • Write maintainable and well-documented code, adhering to coding standards and best practices
  • Stay updated with the latest advancements in Python, cloud computing, AI, and Cloud Native architectures, and proactively suggest innovative solutions to enhance our AI capabilities

Requirements

  • Proven expertise in Python programming language, with significant experience in AI integration
  • Proficiency in cloud computing with hands-on experience in major cloud platforms such as AWS, Azure, or Google Cloud Platform
  • Familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG)
  • Excellent problem-solving abilities and the capability to effectively collaborate within a team setting
  • Superior communication skills and the competence to seamlessly explain complex technical concepts to non-technical stakeholders

Nice to have

  • Knowledge of Cloud Native architectures and experience with tools like Kubernetes, Docker and microservices

We offer

We connect like-minded people:

  • Delivering innovative solutions to industry leaders, making a global impact
  • Enjoyable working environment, whether it is the vibrant office or the comfort of your own home
  • Opportunity to work abroad for up to two months per year
  • Relocation opportunities within our offices in 55+ countries
  • Corporate and social events

We invest in your growth:

  • Leadership development, career advising, soft skills and well-being programs
  • Certifications, including GCP, Azure and AWS
  • Unlimited access to LinkedIn Learning, Get Abstract, O'Reilly
  • Free English classes with certified teachers
  • Discounts in local language schools, including offline courses for the Uzbek language

We cover it all:

  • Monetary bonuses for engaging in the referral program
  • Medical & family care package
  • Four trust days per year (sick leave without a medical certificate)
  • Discounts for fitness clubs, dance schools and sports programs
  • Benefits package (sports activities, a variety of stores and services)
...