Mongo DB

Command :

show dbs
db
show collections

In MongoDB, you don't need to create collection. MongoDB creates collection automatically, when you insert some document.
For Insert or update operations , you can use insertOne()/insertMany() ,updateOne()/updateMany()/replaceOne() command. Use bulkWrite() for bulk operations.
By default insertMany() does ordered inserts and stop execution on error. Please use ordered:false to continue insertion ignoring errors.
For Case insensitive search : use /i as below :

db.movieDetails.find({"genres":/famIly/i})

** update() and save() methods are deprecated in mongodb.

To query data from MongoDB collection, you need to use MongoDB's find()method.

db.COLLECTION_NAME.find()

db.mycol.find().pretty()

$AND operator allow to have duplicate key constraints in filter condition.
Always use $elemMatch for multiple field comparison in an array.
db.getLogComponents() is used to check logging level. ( i.e. check verbosity level.)
To debug slow operations/query in database we rely on profiling.
Journaling and WAL : Every 50 ms , mongod sync memory data to diagnostics_dest/wiredtiger**.log on disk. At every checkpoint ( 60s interval , data gets flushed from disk to datafiles)
profiler is used to debug slow running operations in mongodb. Default level is 0 (disabled). When enabled, it stores information in system.profile collection.

To create an index you need to use ensureIndex() method of MongoDB.

db.COLLECTION_NAME.ensureIndex({KEY:1})

*** Deprecated since version 3.0.0: db.collection.ensureIndex() is now an alias for db.collection.createIndex().

Here key is the name of the field on which you want to create index and 1 is for ascending order. To create index in descending order you need to use -1.

configuration file /etc/mongod.conf, and port# 27017 is by default.
Read uncommitted is the default isolation level and applies to mongod standalone instances as well as to replica sets and sharded clusters.

~/.dbshell and ~/.mongorc.js files :

mongo maintains a history of commands in the .dbshell file.
mongo will read the .mongorc.js file from the home directory of the user invoking mongo. In the file, users can define variables, customize the mongo shell prompt, or update information that they would like updated every time they launch a shell. If you use the shell to evaluate a JavaScript file or expression either on the command line with mongo --eval or by specifying a .js file to mongo, mongo will read the .mongorc.js file after the JavaScript has finished processing.
Specify the --norc option to disable reading .mongorc.js.

Index Types in Mongo DB :

1. Single Key Index

In addition to the MongoDB-defined _id index, MongoDB supports the creation of user-defined ascending/descending indexes on a single field of a document.

2.Compound Index

MongoDB also supports user-defined indexes on multiple fields, i.e. compound indexes.

3. Multikey Index

MongoDB uses multikey indexes to index the content stored in arrays. If you index a field that holds an array value, MongoDB creates separate index entries for every element of the array.MongoDB automatically determines whether to create a multikey index if the indexed field contains an array value; you do not need to explicitly specify the multikey type.

4. Geospatial Index

To support efficient queries of geospatial coordinate data, MongoDB provides two special indexes: 2d indexes that uses planar geometry when returning results and 2dsphere indexes that use spherical geometry to return results.

5. Text Indexes

MongoDB provides a text index type that supports searching for string content in a collection.

6. Hashed Indexes

To support hash based sharding, MongoDB provides a hashed index type, which indexes the hash of the value of a field. These indexes have a more random distribution of values along their range, but only support equality matches and cannot support range-based queries.

Index Properties

Unique Indexes

The unique property for an index causes MongoDB to reject duplicate values for the indexed field. Other than the unique constraint, unique indexes are functionally interchangeable with other MongoDB indexes.

Partial Indexes

New in version 3.2.

Partial indexes only index the documents in a collection that meet a specified filter expression. By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance.

Partial indexes offer a superset of the functionality of sparse indexes and should be preferred over sparse indexes.

Sparse Indexes

The sparse property of an index ensures that the index only contain entries for documents that have the indexed field. The index skips documents that do not have the indexed field.

You can combine the sparse index option with the unique index option to reject documents that have duplicate values for a field but ignore documents that do not have the indexed key.

TTL Indexes

TTL indexes are special indexes that MongoDB can use to automatically remove documents from a collection after a certain amount of time. This is ideal for certain types of information like machine generated event data, logs, and session information that only need to persist in a database for a finite amount of time.

Install Mongo DB :

https://docs.mongodb.com/v3.6/tutorial/install-mongodb-on-ubuntu/

MongoDB enables journaling by default. Journaling protects against data loss in the event of service interruptions, such as power failures and unexpected reboots.

Replica Sets :

You need to run below command after role switch(Primary to Secondary) to make "Show dbs" command works.

rs.slaveOk()

You can require that members of replica sets and sharded clusters authenticate to each other. For the internal authentication of the members, MongoDB can use either keyfiles or x.509 certificates.

openssl rand -base64 741 > mongodb.key
chmod 600 mongodb.key
chown mongod:mongod mongodb.key

Primary to Secondary replication is asynchronous and follow default PV1 protocol (RAFT).
Make sure you have odd number of voting nodes in replica sets.
During topology change ( i.e. Priority change) , Primary/Secondary role can change.
Maximum 7 voting members are allowed in replica sets.
Replica sets member sets : 
When instance starts , it goes through STARTUP -> STARTUP2->RECOVERING->SECONDARY

STARTUP	        Not yet an active member of any set. All members start up in 
                        this state. The mongod parses the replica set configuration 
                        document while in STARTUP.
RECOVERING	Members either perform startup self-checks, or transition from
                        completing a rollback or resync. Data is not available for reads 
                        from this member. Eligible to vote.
STARTUP2	        The member has joined the set and is running an initial sync. 
                        Eligible to vote.
SECONDARY	A member in state secondary is replicating the data store. 
                        Eligible to vote.

Config file :

replication:
  replSetName: rs0

security:
  authorization: enabled
  keyFile: /home/mongodb.key

Any data written to local database doesn't get replicated to other nodes.

Write concern, Read concern & read preference :

How to enable authentication in Arbiter :

By default, when replica members are added to cluster users and roles are replicated to secondary but do not replicat on arbiter

Please follow below steps for user authentication on arbiter :

1. Deploy your Replica Set in the normal way, see Deploy a Replica Set

2. Shut down the arbiter database

3. Start the arbiter in maintenance mode, see Perform Maintenance on Replica Set Members

4. Mainly you modify the configuration file and comment out the replication, sharding and security section. Modify net.port to prevent that any other Replica Set member connects to the database.

5. Connect from local host to the arbiter database, without authentication

6. Create the Shard Local Admin User.

db.getSiblingDB("admin").createUser( {user: "ladmin", pwd: "secret", roles: [ "root" ] })

Instead of root you can also grant [ "clusterAdmin", "userAdminAnyDatabase" ] which is stricter.

7. Shut down the arbiter database

8.Restart arbiter database in normal mode.

Localhost Exception

The localhost exception allows you to enable access control and then create the first user in the system. With the localhost exception, after you enable access control, connect to the localhost interface and create the first user in the admin database. The first user must have privileges to create other users, such as a user with the userAdmin or userAdminAnyDatabase role. Connections using the localhost exception only have access to create the first user on the admin database.

Changed in version 3.4: MongoDB 3.4 extended the localhost exception to permit execution of the db.createRole() method. This method allows users authorizing via LDAP to create a role inside of MongoDB that maps to a role defined in LDAP. See LDAP Authorization for more information.

The localhost exception applies only when there are no users created in the MongoDB instance.

In the case of a sharded cluster, the localhost exception applies to each shard individually as well as to the cluster as a whole. Once you create a sharded cluster and add a user administrator through the mongos instance, you must still prevent unauthorized access to the individual shards. Follow one of the following steps for each shard in your cluster:

Create an administrative user, or

Disable the localhost exception at startup. To disable the localhost exception, set the enableLocalhostAuthBypass parameter to 0.

Database Profiler :

The database profiler collects detailed information about Database Commands executed against a running mongod instance. This includes CRUD operations as well as configuration and administration commands. The profiler writes all the data it collects to the system.profile collection, a capped collection in the admin database.
The profiler is off by default. You can enable the profiler on a per-database or per-instance basis at one of several profiling levels.
if the mongod process shuts down, such as during a maintenance procedure, the profiling level set with db.setProfilingLevel() will be lost.
CRUD operations, Administrative operations, and Configuration operations are all captured by the database profiler.

Level Description

0 The profiler is off and does not collect any data. This is the default profiler level.

1 The profiler collects data for operations that take longer than the value of slowms.

2 The profiler collects data for all operations.

Mongoimport & mongoexport :

export & import data in json format.

mongodump & mongorestore :

export & import data in bson format.

By default, mongodump does not capture the contents of the local database.

mongodump only captures the documents in the database. The resulting backup is space efficient, but mongorestore or mongod must rebuild the indexes after restoring data.

Applications can continue to modify data while mongodump captures the output. For replica sets, mongodump provides the --oplog option to include in its output oplog entries that occur during the mongodump operation. This allows the corresponding mongorestore operation to replay the captured oplog. To restore a backup created with --oplog, use mongorestore with the --oplogReplay option.

Sharding :

You cannot change the shard key after sharding, nor can you unshard a sharded collection.

If queries do not include the shard key or the prefix of a compound shard key, mongos performs a broadcast operation, querying all shards in the sharded cluster. These scatter/gather queries can be long running operations.

Starting in version 4.4, documents in sharded collections can be missing the shard key fields. Missing shard key fields are treated as having null values when distributing the documents across shards but not when routing queries.
In version 4.2 and earlier, shard key fields must exist in every document for a sharded collection.
Starting in MongoDB 4.4, you can refine a collection’s shard key by adding a suffix field or fields to the existing key.
In MongoDB 4.2 and earlier, the choice of shard key cannot be changed after sharding.
In MongoDB 4.0 and earlier, a document’s shard key field value is immutable.
Starting in MongoDB 4.2, you can update a document’s shard key value unless your shard key field is the immutable _id field.
To shard a populated collection, the collection must have an index that starts with the shard key. When sharding an empty collection, MongoDB creates the supporting index if the collection does not already have an appropriate index for the specified shard key.
In an attempt to achieve an even distribution of chunks across all shards in the cluster, a balancer runs in the background to migrate chunks across the shards .

A. Read uncommitted is the default isolation level and applies to mongod standalone instances as well as to replica sets and sharded clusters.

Storage Engine :

WiredTiger Storage Engine (Default) :

WiredTiger is the default storage engine starting in MongoDB 3.2. It is well-suited for most workloads and is recommended for new deployments. WiredTiger provides a document-level concurrency model, checkpointing, and compression, among other features.

In-Memory Storage Engine

In-Memory Storage Engine is available in MongoDB Enterprise. Rather than storing documents on-disk, it retains them in-memory for more predictable data latencies.

Journaling :

A sequential, binary transaction log used to bring the database into a valid state in the event of a hard shutdown. Journaling writes data first to the journal and then to the core data files. MongoDB enables journaling by default for 64-bit builds of MongoDB version 2.0 and newer. Journal files are pre-allocated and exist as files in the data directory. Each journal file is limited to 100MB size.

WiredTiger syncs the buffered journal records to Journal disk file upon any of the following conditions:

For replica set members (primary and secondary members),
If there are operations waiting for oplog entries. Operations that can wait for oplog entries include:

forward scanning queries against the oplog
read operations performed as part of causally consistent sessions

Additionally for secondary members, after every batch application of the oplog entries.
If a write operation includes or implies a write concern of j: true.
At every 100 milliseconds (See storage.journal.commitIntervalMs).
When WiredTiger creates a new journal file. Because MongoDB uses a journal file size limit of 100 MB, WiredTiger creates a new journal file approximately every 100 MB of data.Once the file exceeds that limit, WiredTiger creates a new journal file.WiredTiger automatically removes old journal files to maintain only the files needed to recover from last checkpoint.

For the journal files, MongoDB creates a subdirectory named journal under the dbPath directory. WiredTiger journal files have names with the following format WiredTigerLog.<sequence> where <sequence> is a zero-padded number starting from 0000000001.

rs.stepDown() :

To step down primary to secondary node. Make sure your secondary instance has correct priority and voting configuration, otherwise stepdown command will failed.

replSetStepDown :

Instructs the primary of the replica set to become a secondary. After the primary steps down, eligble secondaries will hold an election for primary.

Error Msg :

Show < > commands failed with below error msg when replica set was started for first time at initial sync.

"errmsg" : "not master and slaveOk=false",

Ans :  rs.slaveOk()

This allows the current connection to allow read operations to run on secondary members.

Hidden member in replica set :

A hidden member maintains a copy of the primary’s data set but is invisible to client applications. Hidden members are good for workloads with different usage patterns from the other members in the replica set. Hidden members must always be priority 0 members and so cannot become primary. The db.isMaster() method does not display hidden members. Hidden members, however, may vote in elections.

Automatic Failover :

Your application connection logic should include tolerance for automatic failovers and the subsequent elections. Starting in MongoDB 3.6, MongoDB drivers can detect the loss of the primary and automatically retry certain write operations a single time, providing additional built-in handling of automatic failovers and elections:

MongoDB 4.2-compatible drivers enable retryable writes by default
MongoDB 4.0 and 3.6-compatible drivers must explicitly enable retryable writes by including retryWrites=true in the connection string.

Journal :

MongoDB uses write ahead logging to an on-disk journal to guarantee write operation durability.

The WiredTiger storage engine does not require journaling to guarantee a consistent state after a crash. The database will be restored to the last consistent checkpoint during recovery. However, if MongoDB exits unexpectedly in between checkpoints, journaling is required to recover writes that occurred after the last checkpoint.

WiredTiger journal files for MongoDB have a maximum size limit of approximately 100 MB.

Once the file exceeds that limit, WiredTiger creates a new journal file.
WiredTiger automatically removes old journal files to maintain only the files needed to recover from last checkpoint.

WiredTiger pre-allocates journal files.

For In-Memory Storage Engine, as data is kept in memory, there is no separate journal file is created

OPLOG :

Oplog is a capped collection.
MongoDB applies database operations on the primary and then records the operations on the primary’s oplog. The secondary members then copy and apply these operations in an asynchronous process. All replica set members contain a copy of the oplog, in the local.oplog.rs collection, which allows them to maintain the current state of the database.

Any secondary member can import oplog entries from any other member.

oplog operations produce the same results whether applied once or multiple times to the target dataset.

When you start a replica set member for the first time, MongoDB creates an oplog of a default size (5% of available disk size) if you do not specify the oplog size.

your replication oplog window tells you how long a secondary member can be offline and still catch up to the primary without doing a full resync. Once you exceed that time, oplog entries that have not yet replicated get overwritten and cannot be applied. Since it’s much slower to do a full DB copy than to catch up using the oplog, knowing that time frame can inform your operations policy regarding time to repair down secondaries.

The second, and more subtle, issue is that replication oplog window is also the maximum amount of time it can take to perform the initial phase of a full sync (either when adding a new secondary, or fully resyncing a stale one). In this phase, the entire database is copied to the secondary, while the oplog keeps track of operations performed on the primary since the start of the copy. If it takes longer than your replication oplog window to copy the data from the primary to the secondary, then by the time that copy is done, the oplog will have lost track of data that wasn’t present in the initial copy. This means that you will not be able to *resync* any stale secondaries, or *add* any new secondaries! The only way to recover from this state is to shutdown the primary and allocate a larger oplog.

Auditing :

When auditAuthorizationSuccess is false, the audit system only logs the authorization failures for authCheck. Enabling auditAuthorizationSuccess degrades performance more than logging only the authorization failures.When the auditAuthorizationSuccess is set to true, there will be performance impacts as each event will have to be logged in the audit before the oplog.

db.adminCommand( { setParameter: 1, auditAuthorizationSuccess: true } )

Client Authentication Mechanism :

Note :
- localhost exception exists when instance started with authorization and without any users exist in DB. This exception exhausted when first user is created. So always create admin/root user once login to database and authentic session using db.auth command.

For Community edition

1. SCRAM ( Basic password authentication , default one when mongod/mongos started with auth enabled )
2. X.509

For MongoDB Enterprise edition only

3. LDAP
4. KERBEROS

Imp Commands :

The following operation runs getParameter on the admin database using a value of saslHostName to retrieve the value for that parameter:

db.adminCommand( { getParameter : 1, "saslHostName" : 1 } )
db.adminCommand( { getParameter : '*' } )

Rotate the log file ( or audit log file) by issuing the logRotate command from the admin database in a mongo shell:

db.adminCommand( { logRotate : 1 } )

Starting 4.2 , we can convert command line options to YAML as below :

mongod --shardsvr --replSet myShard --dbpath /var/lib/mongodb --bind_ip localhost,My-Example-Hostname --fork --logpath /var/log/mongodb/mongod.log --clusterAuthMode x509 --tlsMode requireTLS --tlsCAFile /path/to/my/CA/file --tlsCertificateKeyFile /path/to/my/certificate/file --tlsClusterFile /path/to/my/cluster/membership/file --outputConfig

IMP Facts :

Version 4.2 onwards, No background or forground index creation method exists. Indexes can be created with new feature called " Hybrid Index Build" with good performance and without locking issue.
For proper utilization of memory , indexes can be created only for secondary nodes with few considerations : Prevent secondary to become primary by making priority=0 or hidden node or delayed secondary.
hint() can be used to force usage of index.
Benchmark tools : sysbench, iibench , YCSB, TPC , HiBench, JEPSEN
We can only use Index for both filtering and sorting if the keys in our query predicate are equality condition.
Index can be defined in this order of query : equality, sort, range
cover query : Query can be cover query when result set is covered by index keys itself and no need to refer documents. This is possible when projection fields are same as index fields. You cant cover a query when any of the keys are part of array/sub-documents, when run against a mongos if index does not contain shard key.
Aggregation pipeline result document memory limit is 16MB. Each stage in pipline has 100MB RAM usage limit. "allowDiskUse : true" should be last option to manage aggregation result and this result does not work with $graphLookup option.
Sharded cluster : Two types of read can be performed. Routed query and scattered gathered.
WiredTiger compresses disk storage, but the data in memory are uncompressed.

Routed query(query with shard key): mongos will ask one/few nodes for information.

scattered gather(query without shard key): mongos will ask information to all nodes.

Schema Model pattern :

1. Subset pattern :

The Subset Pattern solves the problem of having the working set exceed the capacity of RAM due to large documents that have much of the data in the document not being used by the application.

Pros
Reduction in the overall size of the working set.
Shorter disk access time for the most frequently used data.
Cons
We must manage the subset.

Pulling in additional data requires additional trips to the database.

2. Computed pattern :

When there are very read intensive data access patterns and that data needs to be repeatedly computed by the application, the Computed Pattern is a great option to explore.

Pros
Reduction in CPU workload for frequent computations.
Queries become simpler to write and are generally faster.
Cons
It may be difficult to identify the need for this pattern.

Applying or overusing the pattern should be avoided unless needed.

3. Attribute pattern :

The Attribute Pattern is useful for problems that are based around having big documents with many similar fields but there is a subset of fields that share common characteristics and we want to sort or query on that subset of fields. When the fields we need to sort on are only found in a small subset of documents. Or when both of those conditions are met within the documents.

Pros
Fewer indexes are needed.
Queries become simpler to write and are generally faster.

4. Schema Versioning pattern :

Just about every application can benefit from the Schema Versioning Pattern as changes to the data schema frequently occur in an application’s lifetime. This pattern allows for previous and current versions of documents to exist side by side in a collection.

Pros
No downtime needed.
Control of schema migration.
Reduced future technical debt.
Cons

Might need two indexes for the same field during migration.

5. Extended reference pattern :

You will find the Extended Reference pattern most useful when your application is experiencing lots of JOIN operations to bring together frequently accessed data.

Pros
Improves performance when there are a lot of JOIN operations.
Faster reads and a reduction in the overall number of JOINs.
Cons

Data duplication.

6. Bucket pattern :

The Bucket Pattern is a great solution for when needing to manage streaming data, such as time-series, real-time analytics, or Internet of Things (IoT) applications.

Pros
Reduces the overall number of documents in a collection.
Improves index performance.

Can simplify data access by leveraging pre-aggregation.

MongoDB Data Model Thumb Rules :

Ref : https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3

One: favor embedding unless there is a compelling reason not to

Two: needing to access an object on its own is a compelling reason not to embed it

Three: Arrays should not grow without bound. If there are more than a couple of hundred documents on the “many” side, don’t embed them; if there are more than a few thousand documents on the “many” side, don’t use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.

Four: Don’t be afraid of application-level joins: if you index correctly and use the projection specifier (as shown in part 2) then application-level joins are barely more expensive than server-side joins in a relational database.

Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.

Six: As always with MongoDB, how you model your data depends – entirely – on your particular application’s data access patterns. You want to structure your data to match the ways that your application queries and updates it.

Interesting example for Flat structure Vs Array in MongoDB :

1.Embedded VS Flat data structure: In this case there is not much difference between both of the pattern but in case of data model in embedded form we are grouping similar kind of data so that makes your query little bit easy or small in size while you will $project data from any collection.

For example: if you want to fetch complete address then in case of embedded doc you don't need to $project address fields individually and if you want to skip address field while fetching document then you do not need to skip address fields individually.

2.Embedded (one to one) VS Embedded (one to many): As we discuss benefits of the embedded document on flat data structure but in case, if our users are having more then one addresses then we need to go for embedded documents with one to many relationship.

Capacity Planning :

https://lamada.eu/blog/2016/04/26/mongodb-how-to-perform-sizing-capacity-planning/

Search This Blog

Database Interview Preparation

Mongo DB

Comments

Post a Comment

Popular posts from this blog