Aggregation Mechanics
It details internal optimization operations, limits, support for sharded collections, and concurrency concerns.
Aggregation Pipeline Optimization – It changed in version 2.4. Aggregation pipeline operations have an optimization phase which attempts to rearrange the pipeline for improved performance.
Pipeline Sequence Optimization
$sort + $skip + $limit Sequence Optimization – When you have a sequence with $sort followed by a $skip followed by a $limit, an optimization occurs that moves the $limit operator before the $skip operator. For example, if the pipeline consists of the following stages:
{ $sort: { age : -1 } },
{ $skip: 10 },
{ $limit: 5 }
During the optimization phase, the optimizer transforms the sequence to the following:
{ $sort: { age : -1 } },
{ $limit: 15 }
{ $skip: 10 }
The $limit value has increased to the sum of the initial value and the $skip value.
The optimized sequence now has $sort immediately preceding the $limit.
$limit + $skip + $limit + $skip Sequence Optimization – When you have a continuous sequence of a $limit pipeline stage followed by a $skip pipeline stage, the optimization phase attempts to arrange the pipeline stages to combine the limits and skips. For example, if the pipeline consists of the following stages.
{ $limit: 100 },
{ $skip: 5 },
{ $limit: 10},
{ $skip: 2 }
During the intermediate step, the optimizer reverses the position of the $skip followed by a $limit to $limit followed by the $skip.
{ $limit: 100 },
{ $limit: 15},
{ $skip: 5 },
{ $skip: 2 }
The $limit value has increased to the sum of the initial value and the $skip value. Then, for the final $limit value, the optimizer selects the minimum between the adjacent $limit values. For the final $skip value, the optimizer adds the adjacent $skip values, to transform the sequence to the following
{ $limit: 15 },
{ $skip: 7 }
Projection Optimization – The aggregation pipeline can determine if it requires only a subset of the fields in the documents to obtain the results. If so, the pipeline will only use those required fields, reducing the amount of data passing through the pipeline.
Aggregation Pipeline Limits – Aggregation operations with the aggregate command have the following limitations.
Type Restrictions – The aggregation pipeline cannot operate on values of the following types: Symbol, MinKey, MaxKey, DBRef, Code, and CodeWScope. Changed in version 2.4: Removed restriction on Binary type data. In MongoDB 2.2, the pipeline could not operate on Binary type data.
Result Size Restrictions – If the aggregate command returns a single document that contains the complete result set, the command will produce an error if the result set exceeds the BSON Document Size limit, which is currently 16 megabytes. To manage result sets that exceed this limit, the aggregate command can return result sets of any size if the command return a cursor or store the results to a collection. It changed in version 2.6: The aggregate command can return results as a cursor or store the results in a collection, which are not subject to the size limit. The db.collection.aggregate() returns a cursor and can return result sets of any size.
Memory Restrictions – It changed in version 2.6. Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUsage option to enable aggregation pipeline stages to write data to temporary files.
Aggregation Pipeline and Sharded Collections – The aggregation pipeline supports operations on sharded collections. This section describes behaviors specific to the aggregation pipeline and sharded collections.
Behavior – It changed in version 2.6. When operating on a sharded collection, the aggregation pipeline is split into two parts. The first pipeline runs on each shard, or if an early $match can exclude shards through the use of the shard key in the predicate, the pipeline runs on only the relevant shards. The second pipeline consists of the remaining pipeline stages and runs on the primary shard. The primary shard merges the cursors from the other shards and runs the second pipeline on these results. The primary shard forwards the final results to the mongos. In previous versions, the second pipeline would run on the mongos. When splitting the aggregation pipeline into two parts, the pipeline is split to ensure that the shards perform as many stages as possible. To retrieve information on the division, use the explain option for the db.collection.aggregate() method.
Map-Reduce and Sharded Collections – Map-reduce supports operations on sharded collections, both as an input and as an output. This section describes the behaviors of mapReduce specific to sharded collections.
Sharded Collection as Input – When using sharded collection as the input for a map-reduce operation, mongos will automatically dispatch the map-reduce job to each shard in parallel. There is no special option required. mongos will wait for jobs on all shards to finish.
Sharded Collection as Output – It changed in version 2.2. If the out field for mapReduce has the sharded value, MongoDB shards the output collection using the _id field as the shard key. To output to a sharded collection
- If the output collection does not exist, MongoDB creates and shards the collection on the _id field.
- For a new or an empty sharded collection, MongoDB uses the results of the first stage of the map-reduce operation to create the initial chunks distributed among the shards.
- mongos dispatches, in parallel, a map-reduce post-processing job to every shard that owns a chunk. During the post-processing, each shard will pull the results for its own chunks from the other shards, run the final reduce/finalize, and write locally to the output collection.
During later map-reduce jobs, MongoDB splits chunks as needed. Balancing of chunks for the output collection is automatically prevented during post-processing to avoid concurrency issues.
In MongoDB 2.0
- mongos retrieves the results from each shard, performs a merge sort to order the results, and proceeds to the reduce/finalize phase as needed. mongos then writes the result to the output collection in sharded mode.
- This model requires only a small amount of memory, even for large data sets.
- Shard chunks are not automatically split during insertion. This requires manual intervention until the chunks are granular and balanced.
For best results, only use the sharded output options for mapReduce in version 2.2 or later.
Map Reduce Concurrency – The map-reduce operation is composed of many tasks, including reads from the input collection, executions of the map function, executions of the reduce function, writes to a temporary collection during processing, and writes to the output collection. During the operation, map-reduce takes the following locks
- The read phase takes a read lock. It yields every 100 documents.
- The insert into the temporary collection takes a write lock for a single write.
- If the output collection does not exist, the creation of the output collection takes a write lock.
- If the output collection exists, then the output actions (i.e. merge, replace, reduce) take a write lock.
Changed in version 2.4: The V8 JavaScript engine, which became the default in 2.4, allows multiple JavaScript operations to execute at the same time. Prior to 2.4, JavaScript code (i.e. map, reduce, finalize functions) executed in a single thread. The final write lock during post-processing makes the results appear atomically. However, output actions merge and reduce may take minutes to process. For the merge and reduce, the nonAtomic flag is available.
Apply for MongoDB Certification Now!!
https://www.vskills.in/certification/databases/mongodb-server-administrator