Stories

Discover boundless stories from unique narrators (storytellers 🙃)

MongoDB: Comprehensive Guide

17501 • Apr 11, 2025

Introduction to MongoDB
MongoDB Architecture and Core Concepts
Key Differences from SQL Databases
Entity Relationships in MongoDB
MongoDB Under the Hood
MongoDB Operations
MongoDB with Python
MongoDB Deployment
MongoDB Security
Performance Optimization
Real-world Use Cases
Best Practices
Summary

Introduction to MongoDB

MongoDB is a document-oriented, NoSQL database designed for high performance, high availability, and automatic scaling. Developed by MongoDB Inc. (formerly 10gen), it was first released in 2009 and has since become one of the most popular NoSQL databases in the world.

The name "MongoDB" comes from "humongous," reflecting its design goal to handle huge amounts of data efficiently. Unlike traditional relational databases, MongoDB stores data in flexible, JSON-like documents, which allows for variable structure among documents within the same collection.

MongoDB was designed to address several shortcomings of traditional SQL databases:

Flexibility: The ability to store and process unstructured or semi-structured data
Scalability: Built from the ground up to scale horizontally across multiple servers
Performance: Optimized for high write throughput and query performance
Developer Productivity: Intuitive data model that aligns with modern programming languages

MongoDB Architecture and Core Concepts

Documents

The fundamental unit of data in MongoDB is a document. A document is a set of key-value pairs, similar to JSON objects, but stored in a format called BSON (Binary JSON). Documents allow embedding complex structures like arrays and nested documents.

Example of a MongoDB document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john.doe@example.com",
  "age": 30,
  "address": {
    "street": "123 Main St",
    "city": "Anytown",
    "state": "CA",
    "zip": "12345"
  },
  "hobbies": ["reading", "hiking", "photography"],
  "created_at": ISODate("2021-05-20T15:30:00Z")
}

Key characteristics of documents:

Maximum size of 16MB per document
Field names must be strings
Field values can be any BSON data type
Field order is preserved during insertion
Case-sensitive field names
Each document requires a unique _id field that acts as a primary key

Collections

Collections are groups of related documents, conceptually similar to tables in relational databases. However, unlike tables, collections don't enforce a schema across documents. Documents within the same collection can have different fields and structures.

For example, a users collection might contain documents representing user profiles, while an orders collection would contain documents representing customer orders.

Collections are organized within databases and follow these naming conventions:

Cannot be empty strings
Cannot contain the null character
Cannot begin with "system." (reserved prefix)
Cannot contain the $ character (reserved for certain operations)

Databases

A MongoDB instance can host multiple databases, each containing its own collections. Databases are the highest level of data organization in MongoDB and provide isolation for collections.

Some special databases include:

admin: Used for administrative operations
local: Stores data specific to a single server
config: Used by sharded clusters to store configuration information

BSON Format

MongoDB stores data in BSON (Binary JSON) format, which extends the JSON model to provide additional data types and to be more efficient for storage and traversal. BSON supports the following data types:

String: UTF-8 encoded strings
Integer: 32-bit or 64-bit integers
Double: 64-bit IEEE 754 floating point numbers
Boolean: true or false
Array: Ordered lists of values
Object: Embedded documents
ObjectId: 12-byte identifier typically used for _id fields
Date: Stored as 64-bit integers representing milliseconds since the Unix epoch
Null: Represents a null value
Regular Expression: For pattern matching
Binary Data: For storing binary data
Timestamp: MongoDB internal timestamp type
Decimal128: IEEE 754 decimal-based floating-point number

Example of converting between JSON and BSON in Python:

import json
from bson import ObjectId, json_util

# JSON cannot directly encode ObjectId, Date, etc.
document = {
    "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
    "name": "John Doe",
    "created_at": datetime.datetime.utcnow()
}

# Use json_util from pymongo to handle BSON types
json_str = json.dumps(document, default=json_util.default)
print(json_str)

# Convert back to Python dict with BSON types
parsed_document = json.loads(json_str, object_hook=json_util.object_hook)
print(parsed_document)

Key Differences from SQL Databases

Schema Design

SQL Databases:

Rigid schema defined at table creation
Relationships maintained through foreign keys
Normalization encouraged to avoid data duplication
Schema changes require migrations

MongoDB:

Flexible schema-less design
Documents can evolve over time
Embedding related data directly in documents
Denormalization often encouraged for performance

Example of normalized SQL tables vs. MongoDB document:

SQL Tables:

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    phone VARCHAR(20)
);

CREATE TABLE addresses (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    street VARCHAR(100),
    city VARCHAR(50),
    state VARCHAR(20),
    zip VARCHAR(10)
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER REFERENCES customers(id),
    order_date TIMESTAMP,
    status VARCHAR(20)
);

MongoDB Document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john.doe@example.com",
  "phone": "555-123-4567",
  "addresses": [
    {
      "type": "home",
      "street": "123 Main St",
      "city": "Anytown",
      "state": "CA",
      "zip": "12345"
    },
    {
      "type": "work",
      "street": "456 Market St",
      "city": "Anytown",
      "state": "CA",
      "zip": "12345"
    }
  ],
  "orders": [
    {
      "order_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "order_date": ISODate("2021-05-20T15:30:00Z"),
      "status": "shipped"
    },
    {
      "order_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "order_date": ISODate("2021-05-25T10:15:00Z"),
      "status": "processing"
    }
  ]
}

Query Language

SQL Databases:

Standardized SQL language
Joins for relating data across tables
Complex transactions with multi-table updates

MongoDB:

JSON-like query syntax
No traditional joins (but has $lookup aggregation)
Query operators to navigate nested documents and arrays

Example query comparison:

SQL:

SELECT customers.name, orders.id, orders.order_date
FROM customers
JOIN orders ON customers.id = orders.customer_id
WHERE customers.email = 'john.doe@example.com'
AND orders.status = 'shipped';

MongoDB:

db.customers.find(
  { 
    "email": "john.doe@example.com",
    "orders.status": "shipped"
  },
  {
    "name": 1,
    "orders.$": 1
  }
)

Transactions and ACID Properties

SQL Databases:

Strong ACID guarantees
Long-established transaction support
Well-suited for financial applications

MongoDB:

ACID transactions at document level by default
Multi-document transactions available since version 4.0
Distributed transactions across shards since version 4.2

Example of a MongoDB transaction:

const session = db.getMongo().startSession();
session.startTransaction();

try {
  const accounts = session.getDatabase("bank").accounts;
  
  // Withdraw from account A
  accounts.updateOne(
    { account_id: "A" }, 
    { $inc: { balance: -100 } }
  );
  
  // Deposit to account B
  accounts.updateOne(
    { account_id: "B" }, 
    { $inc: { balance: 100 } }
  );
  
  session.commitTransaction();
} catch (error) {
  session.abortTransaction();
  throw error;
} finally {
  session.endSession();
}

Scaling Approach

SQL Databases:

Traditionally scale vertically (bigger machines)
Replication for high availability
Partitioning/sharding often complex to set up

MongoDB:

Built for horizontal scaling (more machines)
Native sharding capabilities
Auto-balancing of data across shards
Replica sets for high availability

Entity Relationships in MongoDB

Unlike relational databases that use tables and foreign keys to model relationships, MongoDB uses two main strategies to represent relationships between entities: embedding and referencing. Understanding when to use each approach is crucial for effective MongoDB schema design.

One-to-One Relationships

In a one-to-one relationship, one document in a collection is related to exactly one document in the same or another collection.

Embedded One-to-One Relationship

For one-to-one relationships, embedding is often the most efficient approach:

// User document with embedded profile (1:1)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com",
  "profile": {
    "first_name": "John",
    "last_name": "Doe",
    "date_of_birth": ISODate("1990-01-15"),
    "address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    },
    "phone": "+1-555-123-4567"
  }
}

Python implementation with Pydantic:

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip: str

class Profile(BaseModel):
    first_name: str
    last_name: str
    date_of_birth: date
    address: Address
    phone: Optional[str] = None

class User(BaseModel):
    username: str
    email: str
    profile: Profile

Referenced One-to-One Relationship

In some cases, referencing is better for one-to-one relationships:

// User document
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com"
}

// Profile document
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c3"),
  "user_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to user
  "first_name": "John",
  "last_name": "Doe",
  "date_of_birth": ISODate("1990-01-15"),
  "address": {
    "street": "123 Main St",
    "city": "New York",
    "state": "NY",
    "zip": "10001"
  },
  "phone": "+1-555-123-4567"
}

When to use references for one-to-one:

When the embedded document is large and rarely accessed
When the embedded document changes frequently
When the embedded document needs to be accessed separately

Python implementation with PyMongo:

# Create user and profile with reference
user_id = db.users.insert_one({
    "username": "johndoe",
    "email": "john@example.com"
}).inserted_id

profile = {
    "user_id": user_id,
    "first_name": "John",
    "last_name": "Doe",
    "date_of_birth": datetime(1990, 1, 15),
    "address": {
        "street": "123 Main St",
        "city": "New York",
        "state": "NY",
        "zip": "10001"
    },
    "phone": "+1-555-123-4567"
}
db.profiles.insert_one(profile)

# Retrieve user with profile
user = db.users.find_one({"username": "johndoe"})
user_profile = db.profiles.find_one({"user_id": user["_id"]})

One-to-Many Relationships

In a one-to-many relationship, one document in a collection is related to multiple documents in another collection.

Embedded One-to-Many Relationship (Array of Embedded Documents)

When the "many" side is relatively small and stable:

// Product document with embedded reviews (1:Many)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Smartphone X",
  "price": 999.99,
  "category": "electronics",
  "reviews": [
    {
      "user_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "username": "user123",
      "rating": 5,
      "text": "Great product!",
      "date": ISODate("2021-05-20T15:30:00Z")
    },
    {
      "user_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "username": "user456",
      "rating": 4,
      "text": "Good but expensive",
      "date": ISODate("2021-05-25T10:15:00Z")
    }
  ]
}

Python implementation with Pydantic:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId

class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)

class Review(BaseModel):
    user_id: PyObjectId
    username: str
    rating: int
    text: str
    date: datetime = Field(default_factory=datetime.now)
    
    class Config:
        arbitrary_types_allowed = True
        json_encoders = {ObjectId: str}

class Product(BaseModel):
    name: str
    price: float
    category: str
    reviews: List[Review] = []
    
    class Config:
        arbitrary_types_allowed = True

Referenced One-to-Many Relationship (Child References)

When the "many" side is large or frequently changing:

// Blog post document (parent)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "title": "Introduction to MongoDB",
  "content": "MongoDB is a document database...",
  "author": "John Doe",
  "date": ISODate("2021-05-20T15:30:00Z")
}

// Comment documents (children)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "post_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to post
  "user": "Alice",
  "text": "Great article!",
  "date": ISODate("2021-05-20T16:30:00Z")
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "post_id": ObjectId("60a6e3e89f1c6a8d556884b2"),  // Reference to post
  "user": "Bob",
  "text": "Thanks for sharing.",
  "date": ISODate("2021-05-21T10:15:00Z")
}

Python implementation with PyMongo:

# Create blog post
post_id = db.posts.insert_one({
    "title": "Introduction to MongoDB",
    "content": "MongoDB is a document database...",
    "author": "John Doe",
    "date": datetime.now()
}).inserted_id

# Add comments referencing the post
comments = [
    {
        "post_id": post_id,
        "user": "Alice",
        "text": "Great article!",
        "date": datetime.now()
    },
    {
        "post_id": post_id,
        "user": "Bob",
        "text": "Thanks for sharing.",
        "date": datetime.now() + timedelta(hours=1)
    }
]
db.comments.insert_many(comments)

# Retrieve post with comments
post = db.posts.find_one({"_id": post_id})
post_comments = list(db.comments.find({"post_id": post_id}).sort("date", 1))

Referenced One-to-Many Relationship (Parent Reference)

Another approach for one-to-many is to have children reference their parent:

// Department document (one)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Engineering",
  "location": "Building A"
}

// Employee documents (many) with parent reference
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "John Doe",
  "position": "Software Engineer",
  "department_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Reference to department
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Jane Smith",
  "position": "QA Engineer",
  "department_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Reference to department
}

Python implementation with PyMongo:

# Create department
dept_id = db.departments.insert_one({
    "name": "Engineering",
    "location": "Building A"
}).inserted_id

# Create employees with department reference
employees = [
    {
        "name": "John Doe",
        "position": "Software Engineer",
        "department_id": dept_id
    },
    {
        "name": "Jane Smith",
        "position": "QA Engineer",
        "department_id": dept_id
    }
]
db.employees.insert_many(employees)

# Find all employees in a department
dept_employees = list(db.employees.find({"department_id": dept_id}))

Many-to-Many Relationships

In a many-to-many relationship, documents in both collections can be related to multiple documents in the other collection.

Embedded Many-to-Many Relationship

For many-to-many relationships with limited size:

// Student document with embedded courses
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "courses": [
    {
      "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
      "name": "Introduction to MongoDB",
      "instructor": "Prof. Smith",
      "enrolled_date": ISODate("2021-01-15")
    },
    {
      "course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
      "name": "Web Development",
      "instructor": "Prof. Johnson",
      "enrolled_date": ISODate("2021-02-10")
    }
  ]
}

// Course document with embedded students
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Introduction to MongoDB",
  "instructor": "Prof. Smith",
  "students": [
    {
      "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
      "name": "John Doe",
      "enrolled_date": ISODate("2021-01-15")
    },
    {
      "student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
      "name": "Jane Smith",
      "enrolled_date": ISODate("2021-01-20")
    }
  ]
}

Note: This approach duplicates data and can be difficult to maintain as both sides need to be updated when changes occur.

Referenced Many-to-Many Relationship

A better approach is often to use a separate collection to model the relationship:

// Student documents
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "email": "john@example.com"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
  "name": "Jane Smith",
  "email": "jane@example.com"
}

// Course documents
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Introduction to MongoDB",
  "instructor": "Prof. Smith"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Web Development",
  "instructor": "Prof. Johnson"
}

// Enrollments collection (junction/join collection)
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "enrolled_date": ISODate("2021-01-15"),
  "grade": "A"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "enrolled_date": ISODate("2021-02-10"),
  "grade": "B+"
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884d3"),
  "student_id": ObjectId("60a6e3e89f1c6a8d556884b3"),
  "course_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "enrolled_date": ISODate("2021-01-20"),
  "grade": "A-"
}

Python implementation with PyMongo:

# Create students
student1_id = db.students.insert_one({
    "name": "John Doe",
    "email": "john@example.com"
}).inserted_id

student2_id = db.students.insert_one({
    "name": "Jane Smith",
    "email": "jane@example.com"
}).inserted_id

# Create courses
course1_id = db.courses.insert_one({
    "name": "Introduction to MongoDB",
    "instructor": "Prof. Smith"
}).inserted_id

course2_id = db.courses.insert_one({
    "name": "Web Development",
    "instructor": "Prof. Johnson"
}).inserted_id

# Create enrollments
enrollments = [
    {
        "student_id": student1_id,
        "course_id": course1_id,
        "enrolled_date": datetime(2021, 1, 15),
        "grade": "A"
    },
    {
        "student_id": student1_id,
        "course_id": course2_id,
        "enrolled_date": datetime(2021, 2, 10),
        "grade": "B+"
    },
    {
        "student_id": student2_id,
        "course_id": course1_id,
        "enrolled_date": datetime(2021, 1, 20),
        "grade": "A-"
    }
]
db.enrollments.insert_many(enrollments)

# Find all courses for a student
def get_student_courses(student_id):
    # Get all enrollments for the student
    enrollments = list(db.enrollments.find({"student_id": student_id}))
    
    # Get the course details for each enrollment
    courses = []
    for enrollment in enrollments:
        course = db.courses.find_one({"_id": enrollment["course_id"]})
        # Add enrollment details to the course
        course["enrolled_date"] = enrollment["enrolled_date"]
        course["grade"] = enrollment["grade"]
        courses.append(course)
    
    return courses

# Find all students in a course
def get_course_students(course_id):
    # Get all enrollments for the course
    enrollments = list(db.enrollments.find({"course_id": course_id}))
    
    # Get the student details for each enrollment
    students = []
    for enrollment in enrollments:
        student = db.students.find_one({"_id": enrollment["student_id"]})
        # Add enrollment details to the student
        student["enrolled_date"] = enrollment["enrolled_date"]
        student["grade"] = enrollment["grade"]
        students.append(student)
    
    return students

Self-Referencing Relationships

Self-referencing relationships occur when documents in a collection reference other documents in the same collection.

Tree Structure (Hierarchical Data)

For representing hierarchical data like categories:

// Category documents with parent references
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "Electronics",
  "parent_id": null  // Root category
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Computers",
  "parent_id": ObjectId("60a6e3e89f1c6a8d556884b2")  // Child of Electronics
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Laptops",
  "parent_id": ObjectId("60a6e3e89f1c6a8d556884c1")  // Child of Computers
}

Python implementation to get the full path:

def get_category_path(category_id):
    path = []
    current_id = category_id
    
    while current_id is not None:
        category = db.categories.find_one({"_id": current_id})
        if category is None:
            break
            
        path.insert(0, category["name"])  # Add to beginning of path
        current_id = category["parent_id"]
    
    return " > ".join(path)

# Example: "Electronics > Computers > Laptops"

Graph Structure (Network)

For representing graph-like data such as social networks:

// User documents with friend references
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "name": "John Doe",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884c1"),
    ObjectId("60a6e3e89f1c6a8d556884c2")
  ]
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  "name": "Jane Smith",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884b2"),
    ObjectId("60a6e3e89f1c6a8d556884c2")
  ]
}

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884c2"),
  "name": "Bob Johnson",
  "friends": [
    ObjectId("60a6e3e89f1c6a8d556884b2"),
    ObjectId("60a6e3e89f1c6a8d556884c1")
  ]
}

Python implementation to find mutual friends:

def get_mutual_friends(user1_id, user2_id):
    user1 = db.users.find_one({"_id": user1_id})
    user2 = db.users.find_one({"_id": user2_id})
    
    if not user1 or not user2:
        return []
    
    # Find the intersection of friend lists
    mutual_friend_ids = set(user1["friends"]) & set(user2["friends"])
    
    # Get the details of mutual friends
    mutual_friends = list(db.users.find({"_id": {"$in": list(mutual_friend_ids)}}))
    
    return mutual_friends

Choosing Between Embedding and Referencing

When deciding whether to embed or reference related data, consider these factors:

Advantages of Embedding

Performance: Embedded documents are retrieved in a single query
Atomicity: All related data is updated in a single operation
Consistency: Related data is always in sync

Advantages of Referencing

Document Size: Prevents documents from exceeding the 16MB limit
Duplication: Avoids data duplication
Flexibility: Allows independent access and updates to related data
Complex Relationships: Better for many-to-many relationships

Decision Criteria

Criteria	Embed	Reference
Relationship	One-to-one or one-to-few	One-to-many or many-to-many
Data Size	Small embedded documents	Large related documents
Access Pattern	Always accessed together	Often accessed separately
Update Frequency	Rarely changes	Frequently changes
Growth	Limited, predictable growth	Unbounded growth
Query Requirements	Simple queries on embedded data	Complex queries across collections

Hybrid Approaches

Sometimes a hybrid approach works best:

// Order document with both embedded and referenced data
{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "order_number": "ORD-12345",
  "date": ISODate("2021-05-20T15:30:00Z"),
  "status": "shipped",
  
  // Referenced customer (frequently accessed separately)
  "customer_id": ObjectId("60a6e3e89f1c6a8d556884c1"),
  
  // Embedded summary of customer info (frequently accessed together)
  "customer_summary": {
    "name": "John Doe",
    "email": "john@example.com",
    "shipping_address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    }
  },
  
  // Embedded line items (always accessed with the order)
  "items": [
    {
      "product_id": ObjectId("60a6e3e89f1c6a8d556884d1"),
      "name": "Smartphone X",
      "price": 999.99,
      "quantity": 1
    },
    {
      "product_id": ObjectId("60a6e3e89f1c6a8d556884d2"),
      "name": "Wireless Earbuds",
      "price": 199.99,
      "quantity": 2
    }
  ],
  
  "total": 1399.97
}

This approach gives you the best of both worlds:

The order document contains embedded items for atomic updates and single-query retrieval
It references the full customer document for detailed information
It includes a customer summary to avoid an extra query for common operations

MongoDB Under the Hood

Storage Engine

The storage engine is responsible for managing how data is stored on disk and in memory. MongoDB's default storage engine is WiredTiger (since version 3.2), which offers:

Document-Level Concurrency: Multiple clients can modify different documents in a collection simultaneously
Compression: Both data and indexes are compressed by default
Journaling: Write operations are recorded in a journal for durability
Checkpoints: Creates consistent snapshots of data files every 60 seconds by default

WiredTiger uses a B-tree data structure for storage, with pages of data cached in RAM and written to disk during checkpoints.

Other important aspects of the storage engine:

Write Ahead Log (WAL): Ensures data durability by logging operations before they are applied
Snapshot Isolation: Readers see a consistent snapshot of data at a point in time
Checkpoint Process: Flushes in-memory changes to disk periodically

Indexing

MongoDB supports several types of indexes to optimize query performance:

Single Field Index: Index on one field

db.users.createIndex({ "email": 1 })  // 1 for ascending order

Compound Index: Index on multiple fields

db.products.createIndex({ "category": 1, "price": -1 })  // -1 for descending order

Multikey Index: Automatically created when indexing an array field

db.posts.createIndex({ "tags": 1 })  // Will index each element in the tags array

Text Index: For text search capabilities

db.articles.createIndex({ "content": "text" })

Geospatial Index: For location-based queries

db.places.createIndex({ "location": "2dsphere" })

Hashed Index: For hash-based sharding

db.users.createIndex({ "_id": "hashed" })

Indexes in MongoDB are implemented as B-trees and stored separately from the collection data.

Query Optimization

MongoDB's query optimizer selects the most efficient query plan based on:

Query Shape: The structure of the query (which fields, operators, etc.)
Available Indexes: Which indexes could potentially be used
Collection Statistics: Size of the collection and distribution of values
Query Execution History: Results of previous similar queries

The query plan cache stores successful query plans to avoid repeated planning for similar queries.

To analyze query performance, MongoDB provides the explain() method:

db.users.find({ "status": "active", "age": { $gt: 21 } }).explain("executionStats")

This returns detailed information about:

Which indexes were considered
Which index was chosen
Number of documents examined
Execution time
Stages of the query plan

Memory Management

MongoDB employs a tiered storage model:

Working Set: Active portion of data and indexes that fits in RAM
Disk Storage: Full dataset stored on disk

WiredTiger manages memory through:

Cache: Configured as a percentage of system RAM (default is 50%)
Eviction: Removing less frequently used data from cache when approaching memory limits
Page Replacement: Algorithm to decide which pages to evict

Memory usage can be monitored with:

db.serverStatus().wiredTiger.cache

MongoDB Operations

CRUD Operations

Create

MongoDB provides several methods to insert documents:

Insert a single document:

db.users.insertOne({
  name: "John Doe",
  email: "john.doe@example.com",
  age: 30
})

Insert multiple documents:

db.users.insertMany([
  { name: "Jane Smith", email: "jane@example.com", age: 28 },
  { name: "Bob Johnson", email: "bob@example.com", age: 35 }
])

Python equivalent with PyMongo:

from pymongo import MongoClient

client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
users = db['users']

# Insert one document
result = users.insert_one({
    "name": "John Doe",
    "email": "john.doe@example.com",
    "age": 30
})
print(f"Inserted document with ID: {result.inserted_id}")

# Insert multiple documents
results = users.insert_many([
    {"name": "Jane Smith", "email": "jane@example.com", "age": 28},
    {"name": "Bob Johnson", "email": "bob@example.com", "age": 35}
])
print(f"Inserted {len(results.inserted_ids)} documents")

Read

MongoDB offers flexible query capabilities:

Find all documents in a collection:
```
db.users.find()
```

Find documents matching specific criteria:

db.users.find({ age: { $gt: 25 } })  // Users older than 25

Find one document:

db.users.findOne({ email: "john.doe@example.com" })

Projection (selecting specific fields):

db.users.find({ age: { $gt: 25 } }, { name: 1, email: 1, _id: 0 })

Limit results:
```
db.users.find().limit(10)
```

Skip results (for pagination):

db.users.find().skip(10).limit(10)  // Second page of 10 results

Sort results:

db.users.find().sort({ age: -1 })  // Sort by age descending

Python equivalent with PyMongo:

# Find all users
all_users = list(users.find())

# Find users older than 25
older_users = list(users.find({"age": {"$gt": 25}}))

# Find one user by email
user = users.find_one({"email": "john.doe@example.com"})

# Projection
user_names = list(users.find({}, {"name": 1, "_id": 0}))

# Pagination
page_size = 10
page_num = 2
paginated_users = list(users.find().skip((page_num - 1) * page_size).limit(page_size))

# Sorting
sorted_users = list(users.find().sort("age", -1))  # -1 for descending

Update

MongoDB provides multiple ways to update documents:

Update a single document:

db.users.updateOne(
  { email: "john.doe@example.com" },
  { $set: { age: 31, status: "active" } }
)

Update multiple documents:

db.users.updateMany(
  { age: { $lt: 30 } },
  { $set: { category: "young" } }
)

Replace a document:

db.users.replaceOne(
  { email: "john.doe@example.com" },
  {
    name: "John Doe",
    email: "john.doe@example.com",
    age: 31,
    address: { city: "New York", zip: "10001" }
  }
)

Update operators:
- $set: Set field values
- $inc: Increment field values
- $push: Add elements to arrays
- $pull: Remove elements from arrays
- $addToSet: Add elements to arrays without duplicates
- $unset: Remove fields

Python equivalent with PyMongo:

# Update one document
users.update_one(
    {"email": "john.doe@example.com"},
    {"$set": {"age": 31, "status": "active"}}
)

# Update multiple documents
result = users.update_many(
    {"age": {"$lt": 30}},
    {"$set": {"category": "young"}}
)
print(f"Modified {result.modified_count} documents")

# Replace a document
users.replace_one(
    {"email": "john.doe@example.com"},
    {
        "name": "John Doe",
        "email": "john.doe@example.com",
        "age": 31,
        "address": {"city": "New York", "zip": "10001"}
    }
)

# Using update operators
users.update_one(
    {"email": "john.doe@example.com"},
    {
        "$inc": {"login_count": 1},
        "$push": {"login_history": datetime.now()},
        "$set": {"last_login": datetime.now()}
    }
)

Delete

MongoDB provides methods to remove documents:

Delete a single document:

db.users.deleteOne({ email: "john.doe@example.com" })

Delete multiple documents:

db.users.deleteMany({ status: "inactive" })

Delete all documents in a collection:
```
db.users.deleteMany({})
```

Python equivalent with PyMongo:

# Delete one document
result = users.delete_one({"email": "john.doe@example.com"})
print(f"Deleted {result.deleted_count} document")

# Delete multiple documents
result = users.delete_many({"status": "inactive"})
print(f"Deleted {result.deleted_count} documents")

# Clear collection
users.delete_many({})

Aggregation Framework

MongoDB's Aggregation Framework provides a powerful way to process and transform data within the database. It uses a pipeline approach where documents pass through stages that modify them.

Common aggregation stages:

$match: Filter documents (similar to find)
```
{ $match: { status: "active" } }
```

$group: Group documents by a key

{ $group: { _id: "$department", totalEmployees: { $sum: 1 } } }

$sort: Sort documents
```
{ $sort: { age: -1 } }
```

$project: Reshape documents (select/compute fields)

{ $project: { name: 1, firstLetter: { $substr: ["$name", 0, 1] } } }

$unwind: Deconstruct array fields
```
{ $unwind: "$tags" }
```

$lookup: Perform a join with another collection

{
  $lookup: {
    from: "orders",
    localField: "_id",
    foreignField: "customer_id",
    as: "customer_orders"
  }
}

Example of a complex aggregation pipeline:

db.sales.aggregate([
  // Stage 1: Filter for completed sales
  { $match: { status: "completed" } },
  
  // Stage 2: Group by product and calculate revenue
  { $group: {
      _id: "$product_id",
      totalRevenue: { $sum: { $multiply: ["$price", "$quantity"] } },
      count: { $sum: 1 }
  }},
  
  // Stage 3: Sort by revenue
  { $sort: { totalRevenue: -1 } },
  
  // Stage 4: Limit to top 5
  { $limit: 5 },
  
  // Stage 5: Join with products collection
  { $lookup: {
      from: "products",
      localField: "_id",
      foreignField: "_id",
      as: "product_info"
  }},
  
  // Stage 6: Reshape the output
  { $project: {
      _id: 0,
      product: { $arrayElemAt: ["$product_info.name", 0] },
      totalRevenue: 1,
      count: 1
  }}
])

Python equivalent with PyMongo:

pipeline = [
    # Stage 1: Filter for completed sales
    {"$match": {"status": "completed"}},
    
    # Stage 2: Group by product and calculate revenue
    {"$group": {
        "_id": "$product_id",
        "totalRevenue": {"$sum": {"$multiply": ["$price", "$quantity"]}},
        "count": {"$sum": 1}
    }},
    
    # Stage 3: Sort by revenue
    {"$sort": {"totalRevenue": -1}},
    
    # Stage 4: Limit to top 5
    {"$limit": 5},
    
    # Stage 5: Join with products collection
    {"$lookup": {
        "from": "products",
        "localField": "_id",
        "foreignField": "_id",
        "as": "product_info"
    }},
    
    # Stage 6: Reshape the output
    {"$project": {
        "_id": 0,
        "product": {"$arrayElemAt": ["$product_info.name", 0]},
        "totalRevenue": 1,
        "count": 1
    }}
]

top_products = list(db.sales.aggregate(pipeline))

Text Search

MongoDB provides text search capabilities for string content:

Create a text index:

db.articles.createIndex({ title: "text", content: "text" })

Perform a text search:

db.articles.find({ $text: { $search: "mongodb database" } })

Sort by relevance score:

db.articles.find(
  { $text: { $search: "mongodb database" } },
  { score: { $meta: "textScore" } }
).sort({ score: { $meta: "textScore" } })

Python equivalent with PyMongo:

# Create text index
db.articles.create_index([("title", "text"), ("content", "text")])

# Perform text search
results = list(db.articles.find({"$text": {"$search": "mongodb database"}}))

# Sort by relevance score
results = list(db.articles.find(
    {"$text": {"$search": "mongodb database"}},
    {"score": {"$meta": "textScore"}}
).sort([("score", {"$meta": "textScore"})]))

Geospatial Queries

MongoDB supports geospatial queries for location-based applications:

Create a geospatial index:

db.places.createIndex({ location: "2dsphere" })

Store location data using GeoJSON:

db.places.insertOne({
  name: "Central Park",
  location: {
    type: "Point",
    coordinates: [-73.97, 40.77]  // [longitude, latitude]
  }
})

Find places near a point:

db.places.find({
  location: {
    $near: {
      $geometry: {
        type: "Point",
        coordinates: [-73.98, 40.76]
      },
      $maxDistance: 1000  // in meters
    }
  }
})

Find places within a polygon:

db.places.find({
  location: {
    $geoWithin: {
      $geometry: {
        type: "Polygon",
        coordinates: [[
          [-74.0, 40.7],
          [-74.0, 40.8],
          [-73.9, 40.8],
          [-73.9, 40.7],
          [-74.0, 40.7]
        ]]
      }
    }
  }
})

Python equivalent with PyMongo:

# Create geospatial index
db.places.create_index([("location", "2dsphere")])

# Insert a place with location
db.places.insert_one({
    "name": "Central Park",
    "location": {
        "type": "Point",
        "coordinates": [-73.97, 40.77]
    }
})

# Find places near a point
nearby_places = list(db.places.find({
    "location": {
        "$near": {
            "$geometry": {
                "type": "Point",
                "coordinates": [-73.98, 40.76]
            },
            "$maxDistance": 1000
        }
    }
}))

MongoDB with Python

PyMongo Basics

PyMongo is the official MongoDB driver for Python:

from pymongo import MongoClient
from bson.objectid import ObjectId

# Connect to MongoDB
client = MongoClient('mongodb://localhost:27017/')
# or with authentication:
# client = MongoClient('mongodb://username:password@localhost:27017/')

# Access a database
db = client['mydatabase']

# Access a collection
collection = db['mycollection']

# Insert a document
result = collection.insert_one({
    'name': 'John Doe',
    'email': 'john@example.com'
})
print(f"Inserted document with ID: {result.inserted_id}")

# Find documents
documents = collection.find({'name': 'John Doe'})
for doc in documents:
    print(doc)

# Find by ID
document = collection.find_one({'_id': ObjectId('60a6e3e89f1c6a8d556884b2')})

# Update a document
result = collection.update_one(
    {'email': 'john@example.com'},
    {'$set': {'name': 'John Smith'}}
)
print(f"Modified {result.modified_count} document(s)")

# Delete a document
result = collection.delete_one({'email': 'john@example.com'})
print(f"Deleted {result.deleted_count} document(s)")

# Close the connection
client.close()

Motor for Async Operations

Motor is the asynchronous MongoDB driver for Python, perfect for use with async frameworks like FastAPI:

import asyncio
from motor.motor_asyncio import AsyncIOMotorClient

async def main():
    # Connect to MongoDB
    client = AsyncIOMotorClient('mongodb://localhost:27017/')
    db = client['mydatabase']
    collection = db['mycollection']
    
    # Insert a document
    result = await collection.insert_one({
        'name': 'John Doe',
        'email': 'john@example.com'
    })
    print(f"Inserted document with ID: {result.inserted_id}")
    
    # Find documents
    async for document in collection.find({'name': 'John Doe'}):
        print(document)
    
    # Find one document
    document = await collection.find_one({'email': 'john@example.com'})
    print(document)
    
    # Close the connection
    client.close()

# Run the async function
asyncio.run(main())

Pydantic Integration

Pydantic provides data validation and settings management using Python type annotations. It integrates well with MongoDB for schema validation:

from pydantic import BaseModel, Field
from typing import List, Optional
from datetime import datetime
from bson import ObjectId
from pymongo import MongoClient

# Custom type for handling ObjectId
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)
    
    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")

# Pydantic model for User
class User(BaseModel):
    id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
    name: str
    email: str
    age: int
    is_active: bool = True
    created_at: datetime = Field(default_factory=datetime.now)
    tags: List[str] = []
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }

# Example usage with MongoDB and Pydantic
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['users']

# Create a user from Pydantic model
user_data = {
    "name": "John Doe",
    "email": "john@example.com",
    "age": 30,
    "tags": ["developer", "python"]
}
user = User(**user_data)
result = collection.insert_one(user.dict(by_alias=True))
print(f"Inserted user with ID: {result.inserted_id}")

# Retrieve and validate from MongoDB
user_from_db = collection.find_one({"email": "john@example.com"})
validated_user = User(**user_from_db)
print(validated_user.json())

# Update using Pydantic model
user_update = User(**user_from_db)
user_update.age = 31
user_update.tags.append("mongodb")
collection.update_one(
    {"_id": user_update.id},
    {"$set": user_update.dict(by_alias=True, exclude={"id"})}
)

With FastAPI and Motor (async):

from fastapi import FastAPI, HTTPException, status
from motor.motor_asyncio import AsyncIOMotorClient
from pydantic import BaseModel, Field, EmailStr
from typing import List, Optional
from datetime import datetime
from bson import ObjectId

app = FastAPI()

# Database connection
client = AsyncIOMotorClient('mongodb://localhost:27017')
db = client.mydatabase

# Pydantic models
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate
        
    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)
    
    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")

class UserBase(BaseModel):
    name: str
    email: EmailStr
    age: int
    tags: List[str] = []
    is_active: bool = True

class UserCreate(UserBase):
    pass

class UserDB(UserBase):
    id: PyObjectId = Field(default_factory=PyObjectId, alias="_id")
    created_at: datetime = Field(default_factory=datetime.now)
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }

# FastAPI routes
@app.post("/users/", response_model=UserDB, status_code=status.HTTP_201_CREATED)
async def create_user(user: UserCreate):
    user_dict = user.dict()
    user_dict["created_at"] = datetime.now()
    
    result = await db.users.insert_one(user_dict)
    
    created_user = await db.users.find_one({"_id": result.inserted_id})
    return created_user

@app.get("/users/{user_id}", response_model=UserDB)
async def get_user(user_id: str):
    if not ObjectId.is_valid(user_id):
        raise HTTPException(status_code=400, detail="Invalid user ID format")
        
    user = await db.users.find_one({"_id": ObjectId(user_id)})
    if user is None:
        raise HTTPException(status_code=404, detail="User not found")
        
    return user

@app.get("/users/", response_model=List[UserDB])
async def list_users(limit: int = 10, skip: int = 0):
    users = await db.users.find().skip(skip).limit(limit).to_list(length=limit)
    return users

MongoDB Deployment

Running MongoDB in Docker

Docker provides an easy way to deploy MongoDB:

Basic MongoDB container:

docker run -d --name mongodb \
    -p 27017:27017 \
    -e MONGO_INITDB_ROOT_USERNAME=admin \
    -e MONGO_INITDB_ROOT_PASSWORD=password \
    -v mongodb_data:/data/db \
    mongo:latest

Using Docker Compose:

# docker-compose.yml
version: '3.8'

services:
  mongodb:
    image: mongo:latest
    container_name: mongodb
    restart: always
    ports:
      - "27017:27017"
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: password
    volumes:
      - mongodb_data:/data/db
      - ./mongo-init.js:/docker-entrypoint-initdb.d/mongo-init.js:ro

  mongo-express:
    image: mongo-express:latest
    container_name: mongo-express
    restart: always
    ports:
      - "8081:8081"
    environment:
      ME_CONFIG_MONGODB_ADMINUSERNAME: admin
      ME_CONFIG_MONGODB_ADMINPASSWORD: password
      ME_CONFIG_MONGODB_SERVER: mongodb
    depends_on:
      - mongodb

volumes:
  mongodb_data:

With initialization script:

// mongo-init.js
db = db.getSiblingDB('mydatabase');

db.createUser({
  user: 'myuser',
  pwd: 'mypassword',
  roles: [
    { role: 'readWrite', db: 'mydatabase' }
  ]
});

db.createCollection('users');
db.users.insertMany([
  {
    name: 'John Doe',
    email: 'john@example.com',
    age: 30
  },
  {
    name: 'Jane Smith',
    email: 'jane@example.com',
    age: 28
  }
]);

Running and stopping the containers:

# Start services
docker-compose up -d

# Stop services
docker-compose down

# View logs
docker-compose logs -f mongodb

MongoDB Atlas

MongoDB Atlas is a fully-managed cloud database service provided by MongoDB, Inc. It offers:

Automated deployment across AWS, Azure, or GCP
Automated backups and point-in-time recovery
Auto-scaling based on workload
Security features like encryption, VPC peering, and IP whitelisting
Performance optimization with query profiling and suggestions

Connecting to Atlas from Python:

from pymongo import MongoClient

# Connection string from Atlas dashboard
connection_string = "mongodb+srv://username:password@cluster0.mongodb.net/mydatabase?retryWrites=true&w=majority"

client = MongoClient(connection_string)
db = client.mydatabase
collection = db.mycollection

# Test connection
result = collection.find_one()
print(result)

Self-hosted Deployment

For production self-hosted deployments, MongoDB is typically run as a replica set or sharded cluster:

Replica Set provides redundancy and high availability:

# Start MongoDB instances
mongod --replSet myrs --dbpath /data/db1 --port 27017
mongod --replSet myrs --dbpath /data/db2 --port 27018
mongod --replSet myrs --dbpath /data/db3 --port 27019

# Configure replica set
mongo --port 27017
> rs.initiate({
    _id: "myrs",
    members: [
      { _id: 0, host: "mongodb0:27017" },
      { _id: 1, host: "mongodb1:27018" },
      { _id: 2, host: "mongodb2:27019" }
    ]
  })

Sharded Cluster for horizontal scaling:

# Start config servers
mongod --configsvr --replSet configrs --dbpath /data/configdb --port 27019

# Start shard servers
mongod --shardsvr --replSet shard1rs --dbpath /data/shard1 --port 27018

# Start mongos router
mongos --configdb configrs/config1:27019,config2:27019,config3:27019 --port 27017

# Add shards via mongos
mongo --port 27017
> sh.addShard("shard1rs/shard1:27018")
> sh.enableSharding("mydatabase")
> sh.shardCollection("mydatabase.users", { "_id": "hashed" })

MongoDB Security

Authentication and Authorization

MongoDB provides role-based access control (RBAC):

Authentication Methods:
- Username/Password
- X.509 certificates
- LDAP
- Kerberos
Creating a user with specific role:

db.createUser({
  user: "appUser",
  pwd: "securePassword",
  roles: [
    { role: "readWrite", db: "mydatabase" }
  ]
})

Built-in roles:
- read: Read data from any collection
- readWrite: Read and write data
- dbAdmin: Perform administrative tasks
- userAdmin: Create and modify users and roles
- clusterAdmin: Administer the whole cluster
- backup: Backup data
- restore: Restore data from backups
Creating custom roles:

db.createRole({
  role: "reportingRole",
  privileges: [
    {
      resource: { db: "mydatabase", collection: "" },
      actions: [ "find" ]
    }
  ],
  roles: []
})

Network Security

Securing MongoDB network access:

Binding to localhost only:

mongod --bind_ip 127.0.0.1

Enabling TLS/SSL:

mongod --tlsMode requireTLS --tlsCertificateKeyFile /path/to/server.pem

Firewall rules to restrict access:

# Allow MongoDB port only from specific IPs
ufw allow from 192.168.1.0/24 to any port 27017

VPC/Network isolation in cloud environments

Encryption

MongoDB supports encryption at multiple levels:

Transport Encryption (TLS/SSL) for data in transit
Storage Encryption for data at rest:

mongod --enableEncryption --encryptionKeyFile /path/to/key

Client-Side Field Level Encryption for sensitive fields:

const clientEncryption = new ClientEncryption(client, {
  keyVaultNamespace: 'encryption.__dataKeys',
  kmsProviders: {
    local: {
      key: localMasterKey
    }
  }
});

// Encrypt a field
const encryptedField = await clientEncryption.encrypt(
  sensitiveData,
  {
    algorithm: 'AEAD_AES_256_CBC_HMAC_SHA_512-Deterministic',
    keyAltName: 'myKey'
  }
);

// Store encrypted data
await collection.insertOne({
  name: 'John Doe',
  ssn: encryptedField
});

Performance Optimization

Indexing Strategies

Effective indexing is crucial for MongoDB performance:

Single-field indexes for frequently queried fields:

db.users.createIndex({ "email": 1 })

Compound indexes for multi-field queries:

db.products.createIndex({ "category": 1, "price": -1 })

Index properties:

Unique: Enforce field uniqueness

db.users.createIndex({ "email": 1 }, { unique: true })

Sparse: Only index documents with the field present

db.users.createIndex({ "optional_field": 1 }, { sparse: true })

TTL (Time-To-Live): Automatically expire documents

db.sessions.createIndex({ "last_activity": 1 }, { expireAfterSeconds: 3600 })

Partial: Only index documents matching a filter

db.orders.createIndex(  { "status": 1 },  { partialFilterExpression: { "status": "active" } })

Index usage analysis:

db.users.find({ "age": { $gt: 25 } }).explain("executionStats")

Identifying missing indexes:

db.currentOp(
  {
    "op" : "query",
    "microsecs_running" : { $gt: 100000 }
  }
)

Query Optimization Techniques

Query profiling to identify slow queries:

// Enable profiler
db.setProfilingLevel(1, { slowms: 100 })

// View slow queries
db.system.profile.find().sort({ ts: -1 }).limit(10)

Covered queries that are satisfied entirely by an index:

// Create an index on both fields
db.users.createIndex({ "email": 1, "name": 1 })

// Query that uses only indexed fields
db.users.find(
  { "email": "john@example.com" },
  { "_id": 0, "email": 1, "name": 1 }
)

Projection to retrieve only needed fields:

db.products.find({}, { name: 1, price: 1, _id: 0 })

Limit results to reduce data transfer:

db.logs.find().sort({ timestamp: -1 }).limit(100)

Avoid negation operators when possible:

// Avoid this (can't use indexes effectively)
db.users.find({ status: { $ne: "inactive" } })

// Better approach
db.users.find({ status: { $in: ["active", "pending"] } })

Performance Monitoring

Server status metrics:

db.serverStatus()

Database statistics:

db.stats()

Collection statistics:

db.users.stats()

Index usage statistics:

db.users.aggregate([
  { $indexStats: {} }
])

Monitoring tools:
- MongoDB Compass
- MongoDB Cloud Manager
- Prometheus with MongoDB exporter
- Grafana dashboards

Real-world Use Cases

Content Management Systems

MongoDB is well-suited for content management systems:

Flexible schema for different content types
Rich querying for content filtering
Embedded documents for comments and related content

Example CMS document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "title": "Getting Started with MongoDB",
  "slug": "getting-started-with-mongodb",
  "content": "MongoDB is a document database...",
  "author": {
    "name": "John Doe",
    "email": "john@example.com"
  },
  "tags": ["mongodb", "nosql", "database"],
  "status": "published",
  "created_at": ISODate("2021-05-20T15:30:00Z"),
  "updated_at": ISODate("2021-05-25T10:15:00Z"),
  "comments": [
    {
      "user": "Jane Smith",
      "text": "Great article!",
      "created_at": ISODate("2021-05-21T08:45:00Z")
    }
  ],
  "metadata": {
    "featured": true,
    "view_count": 1250,
    "rating": 4.7
  }
}

Real-time Analytics

MongoDB excels for real-time analytics applications:

Time-series data collection
Aggregation pipeline for complex analytics
Sharding for handling large data volumes

Example analytics pipeline:

db.page_views.aggregate([
  // Match events from the last 24 hours
  {
    $match: {
      timestamp: {
        $gte: new Date(Date.now() - 24 * 60 * 60 * 1000)
      }
    }
  },
  
  // Group by page and calculate stats
  {
    $group: {
      _id: "$page",
      views: { $sum: 1 },
      unique_users: { $addToSet: "$user_id" },
      avg_duration: { $avg: "$duration" }
    }
  },
  
  // Calculate number of unique users
  {
    $addFields: {
      unique_users: { $size: "$unique_users" }
    }
  },
  
  // Sort by most viewed
  {
    $sort: { views: -1 }
  },
  
  // Limit to top 10
  {
    $limit: 10
  }
])

IoT Applications

MongoDB is popular for Internet of Things (IoT) applications:

High write throughput for sensor data
Time-series collections for time-ordered data
Geospatial queries for location tracking
TTL indexes for data expiration

Example IoT document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "device_id": "thermostat-1234",
  "type": "temperature",
  "value": 22.5,
  "unit": "celsius",
  "location": {
    "type": "Point",
    "coordinates": [-73.97, 40.77]
  },
  "battery": 87,
  "timestamp": ISODate("2021-05-20T15:30:00Z")
}

Mobile Applications

MongoDB works well for mobile apps:

Flexible schema for rapidly evolving app features
Offline-first architecture with MongoDB Realm
Change streams for real-time updates
Horizontal scaling for growing user bases

Example mobile app user document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "username": "johndoe",
  "email": "john@example.com",
  "profile": {
    "name": "John Doe",
    "avatar": "https://example.com/avatars/johndoe.jpg",
    "bio": "MongoDB enthusiast"
  },
  "preferences": {
    "notifications": {
      "push": true,
      "email": false
    },
    "theme": "dark"
  },
  "devices": [
    {
      "type": "android",
      "token": "fcm-token-123",
      "last_active": ISODate("2021-05-20T15:30:00Z")
    }
  ],
  "last_login": ISODate("2021-05-20T15:30:00Z"),
  "created_at": ISODate("2021-03-15T10:20:00Z")
}

Catalog Management

MongoDB is excellent for product catalogs:

Schema flexibility for diverse product types
Rich querying for faceted search
Horizontal scaling for large catalogs

Example product document:

{
  "_id": ObjectId("60a6e3e89f1c6a8d556884b2"),
  "sku": "MBP-2021-14-M1",
  "name": "MacBook Pro 14-inch",
  "description": "Apple MacBook Pro with M1 Pro chip",
  "price": 1999.99,
  "category": "electronics",
  "subcategory": "laptops",
  "brand": "Apple",
  "attributes": {
    "processor": "Apple M1 Pro",
    "memory": "16GB",
    "storage": "512GB SSD",
    "display": "14.2-inch Liquid Retina XDR",
    "color": "Space Gray"
  },
  "images": [
    {
      "url": "https://example.com/images/mbp-front.jpg",
      "alt": "Front view",
      "is_primary": true
    },
    {
      "url": "https://example.com/images/mbp-side.jpg",
      "alt": "Side view",
      "is_primary": false
    }
  ],
  "inventory": {
    "in_stock": 42,
    "warehouse_location": "NYC-1"
  },
  "metadata": {
    "featured": true,
    "rating": 4.8,
    "reviews_count": 156
  },
  "created_at": ISODate("2021-10-26T10:00:00Z"),
  "updated_at": ISODate("2021-11-15T14:30:00Z")
}

Best Practices

Schema Design Patterns

Embedded Documents Pattern:
- Embed related data in a single document for faster reads
- Best for one-to-few relationships
- Example: embedding addresses in a user document
References Pattern:
- Use references between documents for one-to-many or many-to-many relationships
- Example: referencing order IDs in a user document
Bucket Pattern:
- Group related time-series data into buckets
- Prevents having too many small documents
- Example: storing hourly metrics in a daily document
Schema Versioning Pattern:
- Include a version field in documents
- Handle migrations gracefully
- Example: { "schema_version": 2, ... }
Computed Pattern:
- Store computed data to avoid expensive calculations
- Update during writes
- Example: storing count of items in a cart
Subset Pattern:
- Store a subset of fields from one collection in another
- Reduces need for joins
- Example: storing essential product info in order documents

Data Modeling Guidelines

Design for the application's queries:
- Start with the queries, then design the schema
- Denormalize when it improves read performance
Balance embedding vs. referencing:
- Embed when data is always accessed together
- Reference when data is large, accessed separately, or shared
Consider document growth:
- Allow for document growth when data will be updated
- Be cautious with unbounded arrays
Limit document size:
- Keep documents below the 16MB limit
- Split large content (like binary data) into GridFS
Use appropriate data types:
- Use BSON types that match your needs
- Consider ObjectId for unique identifiers
Plan for indexes:
- Index fields used in query filters, sorts, and joins
- Be mindful of index size and maintenance overhead

Operational Excellence

Monitoring and alerting:
- Monitor system metrics (CPU, memory, disk I/O)
- Monitor MongoDB metrics (operations, connections, queues)
- Set up alerting for critical thresholds
Backup strategy:
- Schedule regular backups
- Test restore processes
- Consider point-in-time recovery for critical data
Capacity planning:
- Estimate data growth
- Plan for increased load
- Provision resources accordingly
Security practices:
- Use authentication and authorization
- Encrypt data in transit and at rest
- Regularly audit access and permissions
Upgrade strategy:
- Stay current with versions
- Test upgrades in non-production environments
- Schedule maintenance windows for upgrades

Summary

MongoDB is a powerful, flexible document database that excels in scenarios requiring:

Flexible schema for evolving data models
Horizontal scalability for growing applications
High write throughput for data-intensive applications
Rich query capabilities including geospatial and text search
Developer productivity with intuitive data models

While not suitable for every use case (especially those requiring complex transactions across multiple entities), MongoDB provides a compelling alternative to traditional relational databases for many modern application patterns.

...

mongodb python nosql

4 0 0

MongoDB usage example with python

17501 • Apr 5, 2025

Prehistory

I recently got really interested in NoSQL database MongoDB , as I have no experience of working with it previously. I kind of know the use cases of the technology and am able to use it in simple way, but what it is prominent for is its support for unstructured data handling. Unlike SQL, it does not enforced a defined structure for our objects in the collections.

Here is the example of one of the use case in python (GenAI GENERATED):

Here's an implementation approach using Pydantic for validation with MongoDB:

from typing import Optional, List, Literal, Union, Dict, Any
from datetime import datetime
from pydantic import BaseModel, Field, root_validator
from bson import ObjectId
from pymongo import MongoClient

# Helper for handling ObjectId in Pydantic
class PyObjectId(ObjectId):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v):
        if not ObjectId.is_valid(v):
            raise ValueError("Invalid ObjectId")
        return ObjectId(v)

    @classmethod
    def __modify_schema__(cls, field_schema):
        field_schema.update(type="string")


# Base Product Model with common fields
class BaseProduct(BaseModel):
    id: Optional[PyObjectId] = Field(default_factory=PyObjectId, alias="_id")
    name: str
    brand: str
    price: float
    stock: int
    release_date: datetime
    product_type: Literal["phone", "laptop", "tablet"]
    description: str
    images: List[str] = []
    tags: List[str] = []
    active: bool = True
    created_at: datetime = Field(default_factory=datetime.now)
    updated_at: datetime = Field(default_factory=datetime.now)
    
    class Config:
        allow_population_by_field_name = True
        arbitrary_types_allowed = True
        json_encoders = {
            ObjectId: str,
            datetime: lambda dt: dt.isoformat()
        }


# Phone specific model
class PhoneProduct(BaseProduct):
    product_type: Literal["phone"] = "phone"
    screen_size: float
    battery_capacity: int  # mAh
    camera_mp: float
    storage_options: List[int]  # GB
    colors: List[str]
    os: str
    network: str  # 4G, 5G, etc.
    dimensions: Dict[str, float]  # height, width, depth
    weight: float  # grams
    
    # Phone-specific validation
    @root_validator
    def validate_phone(cls, values):
        if values.get("product_type") != "phone":
            raise ValueError("Product type must be 'phone'")
        return values


# Laptop specific model
class LaptopProduct(BaseProduct):
    product_type: Literal["laptop"] = "laptop"
    screen_size: float
    processor: str
    ram_options: List[int]  # GB
    storage_options: Dict[str, List[int]]  # Type (SSD/HDD) -> Sizes in GB
    gpu: Optional[str] = None
    battery_life: float  # hours
    os: str
    ports: Dict[str, int]  # port type -> number of ports
    weight: float  # kg
    is_touchscreen: bool = False
    
    # Laptop-specific validation
    @root_validator
    def validate_laptop(cls, values):
        if values.get("product_type") != "laptop":
            raise ValueError("Product type must be 'laptop'")
        return values


# Tablet specific model
class TabletProduct(BaseProduct):
    product_type: Literal["tablet"] = "tablet"
    screen_size: float
    battery_capacity: int  # mAh
    storage_options: List[int]  # GB
    processor: str
    ram: int  # GB
    camera_mp: Dict[str, float]  # front, back
    connectivity: List[str]  # wifi, cellular, etc.
    os: str
    pen_support: bool = False
    
    # Tablet-specific validation
    @root_validator
    def validate_tablet(cls, values):
        if values.get("product_type") != "tablet":
            raise ValueError("Product type must be 'tablet'")
        return values


# Union type for working with any product type
ProductType = Union[PhoneProduct, LaptopProduct, TabletProduct]


# MongoDB Repository for Products
class ProductRepository:
    def __init__(self, connection_string: str, db_name: str):
        self.client = MongoClient(connection_string)
        self.db = self.client[db_name]
        self.collection = self.db.products
    
    # Create product of any type
    def create_product(self, product: ProductType):
        product_dict = product.dict(by_alias=True)
        result = self.collection.insert_one(product_dict)
        return str(result.inserted_id)
    
    # Get product by ID
    def get_product(self, product_id: str) -> Dict[str, Any]:
        product = self.collection.find_one({"_id": ObjectId(product_id)})
        if not product:
            return None
        
        # Determine product type and return appropriate model
        if product["product_type"] == "phone":
            return PhoneProduct(**product)
        elif product["product_type"] == "laptop":
            return LaptopProduct(**product)
        elif product["product_type"] == "tablet":
            return TabletProduct(**product)
    
    # Get all products of a specific type
    def get_products_by_type(self, product_type: str) -> List[ProductType]:
        products = self.collection.find({"product_type": product_type})
        
        result = []
        for product in products:
            if product_type == "phone":
                result.append(PhoneProduct(**product))
            elif product_type == "laptop":
                result.append(LaptopProduct(**product))
            elif product_type == "tablet":
                result.append(TabletProduct(**product))
                
        return result
    
    # Update product
    def update_product(self, product_id: str, updated_product: ProductType):
        # Update the updated_at field
        product_dict = updated_product.dict(by_alias=True, exclude_unset=True)
        product_dict["updated_at"] = datetime.now()
        
        result = self.collection.update_one(
            {"_id": ObjectId(product_id)},
            {"$set": product_dict}
        )
        return result.modified_count > 0
    
    # Delete product
    def delete_product(self, product_id: str):
        result = self.collection.delete_one({"_id": ObjectId(product_id)})
        return result.deleted_count > 0


# Example usage
if __name__ == "__main__":
    # Create repository
    repo = ProductRepository("mongodb://localhost:27017", "electronics_store")
    
    # Create a phone product
    phone = PhoneProduct(
        name="Galaxy S22",
        brand="Samsung",
        price=899.99,
        stock=100,
        release_date=datetime(2022, 2, 25),
        description="Samsung's flagship phone for 2022",
        tags=["smartphone", "android", "samsung", "flagship"],
        screen_size=6.1,
        battery_capacity=3700,
        camera_mp=50.0,
        storage_options=[128, 256],
        colors=["Phantom Black", "Phantom White", "Green", "Pink Gold"],
        os="Android 12",
        network="5G",
        dimensions={"height": 146.0, "width": 70.6, "depth": 7.6},
        weight=167.0
    )
    
    # Insert phone
    phone_id = repo.create_product(phone)
    print(f"Created phone with ID: {phone_id}")
    
    # Create a laptop product
    laptop = LaptopProduct(
        name="MacBook Pro 14",
        brand="Apple",
        price=1999.99,
        stock=50,
        release_date=datetime(2021, 10, 26),
        description="Apple's professional laptop with M1 Pro chip",
        tags=["laptop", "macbook", "apple", "professional"],
        screen_size=14.2,
        processor="Apple M1 Pro",
        ram_options=[16, 32, 64],
        storage_options={"SSD": [512, 1024, 2048, 4096]},
        gpu="M1 Pro 16-core",
        battery_life=17.0,
        os="macOS Monterey",
        ports={"Thunderbolt 4": 3, "HDMI": 1, "SD Card": 1, "MagSafe": 1},
        weight=1.6
    )
    
    # Insert laptop
    laptop_id = repo.create_product(laptop)
    print(f"Created laptop with ID: {laptop_id}")
    
    # Get all phones
    phones = repo.get_products_by_type("phone")
    print(f"Found {len(phones)} phones")

This implementation provides a good balance between structure and flexibility that MongoDB offers. Here's the key aspects of this approach:

1. Base Structure with Common Fields

The BaseProduct class contains all common fields shared across product types
Core fields like ID, name, price, brand, etc. are consistent

2. Type-Specific Models

Each product type (phone, laptop, tablet) extends the base model
Type-specific fields are only present in their respective models
Each subclass validates its own product_type for data integrity

3. Type Discrimination

The product_type field serves as the discriminator to identify document types
This allows you to query all products or filter by specific types

4. Repository Pattern

The ProductRepository class encapsulates MongoDB operations
It handles conversion between Pydantic models and MongoDB documents
It automatically returns the right model type based on the product_type field

5. Validation Benefits

Pydantic ensures each document has the required fields for its type
Type hints provide excellent IDE support and documentation
The root_validator ensures product types match their models

This approach gives you the best of both worlds:

The flexibility of MongoDB's schemaless design
The safety and structure of strong type validation with Pydantic

When retrieving documents, the repository pattern intelligently converts them to the appropriate Pydantic model, ensuring you always get the right fields and validation for each product type.

...

mongodb python nosql

10 0 0

Python cloud native engineer - epam job description

admin1 • Mar 23, 2025

Python Engineer

We seek a highly skilled Python Engineer with expertise in cloud computing and a strong focus on integrating AI capabilities into our projects. The ideal candidate will possess robust proficiency in Python and its frameworks, coupled with a deep understanding of at least one major cloud provider.

Additionally, familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG) is essential for seamlessly integrating AI capabilities.

Responsibilities

Facilitate development and deployment of cloud-native solutions, highlighting AI integration in our projects
Architect and launch AI-driven applications, leveraging Python frameworks such as Django, Flask or FastAPI
Integrate Large Language Models (LLM) and Retrieval Augmented Generation (RAG) into ongoing and upcoming projects to enhance language understanding and generation capabilities
Team up with various cross-functional teams to comprehend project objectives and convert them into AI-driven technical resolutions
Implement AI-based features and functionalities, utilizing cloud-native architectures and industry-standard practices
Write maintainable and well-documented code, adhering to coding standards and best practices
Stay updated with the latest advancements in Python, cloud computing, AI, and Cloud Native architectures, and proactively suggest innovative solutions to enhance our AI capabilities

Requirements

Proven expertise in Python programming language, with significant experience in AI integration
Proficiency in cloud computing with hands-on experience in major cloud platforms such as AWS, Azure, or Google Cloud Platform
Familiarity with Large Language Models (LLM) and Retrieval Augmented Generation (RAG)
Excellent problem-solving abilities and the capability to effectively collaborate within a team setting
Superior communication skills and the competence to seamlessly explain complex technical concepts to non-technical stakeholders

Nice to have

Knowledge of Cloud Native architectures and experience with tools like Kubernetes, Docker and microservices

We offer

We connect like-minded people:

Delivering innovative solutions to industry leaders, making a global impact
Enjoyable working environment, whether it is the vibrant office or the comfort of your own home
Opportunity to work abroad for up to two months per year
Relocation opportunities within our offices in 55+ countries
Corporate and social events

We invest in your growth:

Leadership development, career advising, soft skills and well-being programs
Certifications, including GCP, Azure and AWS
Unlimited access to LinkedIn Learning, Get Abstract, O'Reilly
Free English classes with certified teachers
Discounts in local language schools, including offline courses for the Uzbek language

We cover it all:

Monetary bonuses for engaging in the referral program
Medical & family care package
Four trust days per year (sick leave without a medical certificate)
Discounts for fitness clubs, dance schools and sports programs
Benefits package (sports activities, a variety of stores and services)

...

job descriptions python cloud +1

20 0 1

Stories

MongoDB: Comprehensive Guide

Table of Contents

Introduction to MongoDB

MongoDB Architecture and Core Concepts

Documents

Collections

Databases

BSON Format

Key Differences from SQL Databases

Schema Design

Query Language

Transactions and ACID Properties

Scaling Approach

Entity Relationships in MongoDB

One-to-One Relationships

Embedded One-to-One Relationship

Referenced One-to-One Relationship

One-to-Many Relationships

Embedded One-to-Many Relationship (Array of Embedded Documents)

Referenced One-to-Many Relationship (Child References)

Referenced One-to-Many Relationship (Parent Reference)

Many-to-Many Relationships

Embedded Many-to-Many Relationship

Referenced Many-to-Many Relationship

Self-Referencing Relationships

Tree Structure (Hierarchical Data)

Graph Structure (Network)

Choosing Between Embedding and Referencing

Advantages of Embedding

Advantages of Referencing

Decision Criteria

Hybrid Approaches

MongoDB Under the Hood

Storage Engine

Indexing

Query Optimization

Memory Management

MongoDB Operations

CRUD Operations

Create

Read

Update

Delete

Aggregation Framework

Text Search

Geospatial Queries

MongoDB with Python

PyMongo Basics

Motor for Async Operations

Pydantic Integration

MongoDB Deployment

Running MongoDB in Docker

MongoDB Atlas

Self-hosted Deployment

MongoDB Security

Authentication and Authorization

Network Security

Encryption

Performance Optimization

Indexing Strategies

Query Optimization Techniques

Performance Monitoring

Real-world Use Cases

Content Management Systems

Real-time Analytics

IoT Applications

Mobile Applications

Catalog Management

Best Practices

Schema Design Patterns

Data Modeling Guidelines

Operational Excellence

Summary

MongoDB usage example with python

Prehistory

1. Base Structure with Common Fields

2. Type-Specific Models

3. Type Discrimination

4. Repository Pattern