In this blog post, I’ll discuss my top five go-to tips for MongoDB troubleshooting.
Every DBA has a war chest of their go-to solutions for any support issues they run into for a specific technology. MongoDB is no different. Even if you have picked it because it’s a good fit and it runs well for you, things will change. When things change – sometimes there is a new version of your application, or a new version of the database itself – you need to have a solid starting place.
To help new DBA’s, I like to point out my top five things that cover the bulk of requests a DBA might need to work on.
Table of Contents
- Common greps to use
- Did any elections happen? Why did they happen?
- Is replication lagged, do I have enough oplog?
- Taming the profiler
- CurrentOp and killOp Explained
Common greps to use
This issue is all about what are some ways to pair down the error log and make it a bit more manageable. The error log is a slew of information and sometimes, without grep, it’s challenging to correlate some events.
Is an index being built?
As a DBA you will often get a call saying the database has “stopped.” The developer might say, “I didn’t change anything.” Looking at the error log is a great first port of call. With this particular grep, you just want to see if all index builds were done, if a new index was built and is still building, or an index was removed. This will help catch all of the cases in question.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 | >grep -i index mongod.log 2016-11-11T17:08:53.731+0000 I INDEX [conn458] build index on: samples.col1 properties: { v: 1, key: { friends: 1.0 }, name: "friends_1", ns: "samples.col1" } 2016-11-11T17:08:53.733+0000 I INDEX [conn458] building index using bulk method 2016-11-11T17:08:56.045+0000 I - [conn458] Index Build: 24700/1000000 2% 2016-11-11T17:08:59.004+0000 I - [conn458] Index Build: 61000/1000000 6% 2016-11-11T17:09:02.001+0000 I - [conn458] Index Build: 103200/1000000 10% 2016-11-11T17:09:05.013+0000 I - [conn458] Index Build: 130800/1000000 13% 2016-11-11T17:09:08.013+0000 I - [conn458] Index Build: 160300/1000000 16% 2016-11-11T17:09:11.039+0000 I - [conn458] Index Build: 183100/1000000 18% 2016-11-11T17:09:14.009+0000 I - [conn458] Index Build: 209400/1000000 20% 2016-11-11T17:09:17.007+0000 I - [conn458] Index Build: 239400/1000000 23% 2016-11-11T17:09:20.010+0000 I - [conn458] Index Build: 264100/1000000 26% 2016-11-11T17:09:23.001+0000 I - [conn458] Index Build: 286800/1000000 28% 2016-11-11T17:09:30.783+0000 I - [conn458] Index Build: 298900/1000000 29% 2016-11-11T17:09:33.015+0000 I - [conn458] Index Build: 323900/1000000 32% 2016-11-11T17:09:36.000+0000 I - [conn458] Index Build: 336600/1000000 33% 2016-11-11T17:09:39.000+0000 I - [conn458] Index Build: 397000/1000000 39% 2016-11-11T17:09:42.000+0000 I - [conn458] Index Build: 431900/1000000 43% 2016-11-11T17:09:45.002+0000 I - [conn458] Index Build: 489100/1000000 48% 2016-11-11T17:09:48.003+0000 I - [conn458] Index Build: 551200/1000000 55% 2016-11-11T17:09:51.004+0000 I - [conn458] Index Build: 567700/1000000 56% 2016-11-11T17:09:54.004+0000 I - [conn458] Index Build: 589600/1000000 58% 2016-11-11T17:10:00.929+0000 I - [conn458] Index Build: 597800/1000000 59% 2016-11-11T17:10:03.008+0000 I - [conn458] Index Build: 633100/1000000 63% 2016-11-11T17:10:06.001+0000 I - [conn458] Index Build: 647200/1000000 64% 2016-11-11T17:10:09.008+0000 I - [conn458] Index Build: 660000/1000000 66% 2016-11-11T17:10:12.001+0000 I - [conn458] Index Build: 672300/1000000 67% 2016-11-11T17:10:15.009+0000 I - [conn458] Index Build: 686000/1000000 68% 2016-11-11T17:10:18.001+0000 I - [conn458] Index Build: 706100/1000000 70% 2016-11-11T17:10:21.006+0000 I - [conn458] Index Build: 731400/1000000 73% 2016-11-11T17:10:24.006+0000 I - [conn458] Index Build: 750900/1000000 75% 2016-11-11T17:10:27.000+0000 I - [conn458] Index Build: 773900/1000000 77% 2016-11-11T17:10:30.000+0000 I - [conn458] Index Build: 821800/1000000 82% 2016-11-11T17:10:33.026+0000 I - [conn458] Index Build: 843800/1000000 84% 2016-11-11T17:10:36.008+0000 I - [conn458] Index Build: 874000/1000000 87% 2016-11-11T17:10:43.854+0000 I - [conn458] Index Build: 896600/1000000 89% 2016-11-11T17:10:46.009+0000 I - [conn458] Index Build: 921800/1000000 92% 2016-11-11T17:10:49.000+0000 I - [conn458] Index Build: 941600/1000000 94% 2016-11-11T17:10:52.011+0000 I - [conn458] Index Build: 955700/1000000 95% 2016-11-11T17:10:55.007+0000 I - [conn458] Index Build: 965500/1000000 96% 2016-11-11T17:10:58.046+0000 I - [conn458] Index Build: 985200/1000000 98% 2016-11-11T17:11:01.002+0000 I - [conn458] Index Build: 995000/1000000 99% 2016-11-11T17:11:13.000+0000 I - [conn458] Index: (2/3) BTree Bottom Up Progress: 8216900/8996322 91% 2016-11-11T17:11:14.021+0000 I INDEX [conn458] done building bottom layer, going to commit 2016-11-11T17:11:14.023+0000 I INDEX [conn458] build index done. scanned 1000000 total records. 140 secs 2016-11-11T17:11:14.035+0000 I COMMAND [conn458] command samples.$cmd command: createIndexes { createIndexes: "col1", indexes: [ { ns: "samples.col1", key: { friends: 1.0 }, name: "friends_1" } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:173 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, MMAPV1Journal: { acquireCount: { w: 9996326 }, acquireWaitCount: { w: 1054 }, timeAcquiringMicros: { w: 811319 } }, Database: { acquireCount: { w: 1, W: 1 } }, Collection: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { W: 12 } }, oplog: { acquireCount: { w: 1 } } } 140306ms |
What’s happening right now?
Like with the above index example, this helps you remove many of the messages you might not care about, or you want to block off. MongoDB does have some useful sub-component tags in the logs, such as “ReplicationExecutor” and “connXXX” that can be helpful, but I find it helpful to remove the noisy lines as opposed to the log facility types. In this example, I opted to also not have “| grep -v connection” – typically I will look at the log with connections first to see if they are acting funny, and filter those out to see the core data of what is happening. If you only want to see the long queries and command, replace “ms” with “connection” to make them easier to find.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | >grep -v conn mongod.log | grep -v auth | grep -vi health | grep -v ms 2016-11-11T14:41:06.376+0000 I REPL [ReplicationExecutor] This node is localhost:28001 in the config 2016-11-11T14:41:06.377+0000 I REPL [ReplicationExecutor] transition to STARTUP2 2016-11-11T14:41:06.379+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state STARTUP 2016-11-11T14:41:06.383+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state STARTUP 2016-11-11T14:41:06.385+0000 I STORAGE [FileAllocator] allocating new datafile /Users/dmurphy/Github/dbmurphy/MongoDB32Labs/labs/rs2-1/local.1, filling with zeroes... 2016-11-11T14:41:06.586+0000 I STORAGE [FileAllocator] done allocating datafile /Users/dmurphy/Github/dbmurphy/MongoDB32Labs/labs/rs2-1/local.1, size: 256MB, took 0.196 secs 2016-11-11T14:41:06.610+0000 I REPL [ReplicationExecutor] transition to RECOVERING 2016-11-11T14:41:06.614+0000 I REPL [ReplicationExecutor] transition to SECONDARY 2016-11-11T14:41:08.384+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state STARTUP2 2016-11-11T14:41:08.386+0000 I REPL [ReplicationExecutor] Standing for election 2016-11-11T14:41:08.388+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state STARTUP2 2016-11-11T14:41:08.390+0000 I REPL [ReplicationExecutor] not electing self, localhost:28002 would veto with 'I don't think localhost:28001 is electable because the member is not currently a secondary (mask 0x8)' 2016-11-11T14:41:08.391+0000 I REPL [ReplicationExecutor] not electing self, we are not freshest 2016-11-11T14:41:10.387+0000 I REPL [ReplicationExecutor] Standing for election 2016-11-11T14:41:10.389+0000 I REPL [ReplicationExecutor] replSet info electSelf 2016-11-11T14:41:10.393+0000 I REPL [ReplicationExecutor] received vote: 1 votes from localhost:28003 2016-11-11T14:41:10.395+0000 I REPL [ReplicationExecutor] replSet election succeeded, assuming primary role 2016-11-11T14:41:10.396+0000 I REPL [ReplicationExecutor] transition to PRIMARY 2016-11-11T14:41:10.631+0000 I REPL [rsSync] transition to primary complete; database writes are now permitted 2016-11-11T14:41:12.390+0000 I REPL [ReplicationExecutor] Member localhost:28003 is now in state SECONDARY 2016-11-11T14:41:12.393+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state SECONDARY versus 2016-11-11T14:41:12.393+0000 I REPL [ReplicationExecutor] Member localhost:28002 is now in state SECONDARY 2016-11-11T14:41:36.433+0000 I NETWORK [conn3] end connection 127.0.0.1:65497 (1 connection now open) 2016-11-11T14:41:36.433+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49191 #8 (3 connections now open) 2016-11-11T14:41:36.490+0000 I NETWORK [conn2] end connection 127.0.0.1:65496 (1 connection now open) 2016-11-11T14:41:36.490+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49192 #9 (3 connections now open) 2016-11-11T14:41:54.480+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49257 #10 (3 connections now open) 2016-11-11T14:41:54.486+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49258 #11 (4 connections now open) 2016-11-11T14:42:06.493+0000 I NETWORK [conn8] end connection 127.0.0.1:49191 (3 connections now open) 2016-11-11T14:42:06.494+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49290 #12 (5 connections now open) 2016-11-11T14:42:06.550+0000 I NETWORK [conn9] end connection 127.0.0.1:49192 (3 connections now open) 2016-11-11T14:42:06.550+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49294 #13 (5 connections now open) 2016-11-11T14:42:36.550+0000 I NETWORK [conn12] end connection 127.0.0.1:49290 (3 connections now open) 2016-11-11T14:42:36.550+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49336 #14 (5 connections now open) 2016-11-11T14:42:36.601+0000 I NETWORK [conn13] end connection 127.0.0.1:49294 (3 connections now open) 2016-11-11T14:42:36.601+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49339 #15 (5 connections now open) 2016-11-11T14:43:06.607+0000 I NETWORK [conn14] end connection 127.0.0.1:49336 (3 connections now open) 2016-11-11T14:43:06.608+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49387 #16 (5 connections now open) 2016-11-11T14:43:06.663+0000 I NETWORK [conn15] end connection 127.0.0.1:49339 (3 connections now open) 2016-11-11T14:43:06.663+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49389 #17 (5 connections now open) 2016-11-11T14:43:36.655+0000 I NETWORK [conn16] end connection 127.0.0.1:49387 (3 connections now open) 2016-11-11T14:43:36.656+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49436 #18 (5 connections now open) 2016-11-11T14:43:36.718+0000 I NETWORK [conn17] end connection 127.0.0.1:49389 (3 connections now open) 2016-11-11T14:43:36.719+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49439 #19 (5 connections now open) 2016-11-11T14:44:06.705+0000 I NETWORK [conn18] end connection 127.0.0.1:49436 (3 connections now open) 2016-11-11T14:44:06.705+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49481 #20 (5 connections now open) 2016-11-11T14:44:06.786+0000 I NETWORK [conn19] end connection 127.0.0.1:49439 (3 connections now open) 2016-11-11T14:44:06.786+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49483 #21 (5 connections now open) 2016-11-11T14:44:36.757+0000 I NETWORK [conn20] end connection 127.0.0.1:49481 (3 connections now open) 2016-11-11T14:44:36.757+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49526 #22 (5 connections now open) 2016-11-11T14:44:36.850+0000 I NETWORK [conn21] end connection 127.0.0.1:49483 (3 connections now open) |
Did any elections happen? Why did they happen?
While this isn’t the most common command to run, it is very helpful if you aren’t using Percona Monitoring and Management (PMM) to track the historical frequency of elections. In this example, we want up to 20 lines before and after the word “SECONDARY”, which typically guards when a step-down or election takes place. Then you can see around that time if a command was issued, did a network error occur, was there a heartbeat failure or other such scenario.
0 1 2 3 4 5 6 7 8 9 10 11 12 | grep -i SECONDARY -A20 -B20 2016-11-11T14:44:38.622+0000 I COMMAND [conn22] Attempting to step down in response to replSetStepDown command 2016-11-11T14:44:38.625+0000 I REPL [ReplicationExecutor] transition to SECONDARY 2016-11-11T14:44:38.627+0000 I NETWORK [conn10] end connection 127.0.0.1:49253 (4 connections now open) 2016-11-11T14:44:38.627+0000 I NETWORK [conn11] end connection 127.0.0.1:49254 (4 connections now open) 2016-11-11T14:44:38.630+0000 I NETWORK [thread1] trying reconnect to localhost:27001 (127.0.0.1) failed 2016-11-11T14:44:38.628+0000 I NETWORK [conn22] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:49506] 2016-11-11T14:44:38.630+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49529 #25 (5 connections now open) 2016-11-11T14:44:38.633+0000 I NETWORK [thread1] reconnect localhost:27001 (127.0.0.1) ok 2016-11-11T14:44:40.567+0000 I REPL [ReplicationExecutor] replSetElect voting yea for localhost:27002 (1) 2016-11-11T14:44:42.223+0000 I REPL [ReplicationExecutor] Member localhost:27002 is now in state PRIMARY 2016-11-11T14:44:44.314+0000 I NETWORK [initandlisten] connection accepted from 127.0.0.1:49538 #26 (4 connections now open) |
Is replication lagged, do I have enough oplog?
Always write a single test document just to ensure replication has a recent write:
1 2 | db.getSiblingDB('repltest').col.insert({x:1}); db.getSiblingDB('repltest').dropDatabase(); |
Checking lag information:
0 1 2 3 4 5 6 | rs1:PRIMARY> db.printSlaveReplicationInfo() source: localhost:27002 syncedTo: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT) 0 secs (0 hrs) behind the primary source: localhost:27003 syncedTo: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT) 0 secs (0 hrs) behind the primary |
Oplog Size and Range:
0 1 2 3 4 5 | rs1:PRIMARY> db.printReplicationInfo() configured oplog size: 192MB log length start to end: 2154secs (0.6hrs) oplog first event time: Fri Nov 11 2016 16:35:20 GMT+0000 (GMT) oplog last event time: Fri Nov 11 2016 17:11:14 GMT+0000 (GMT) now: Fri Nov 11 2016 17:16:46 GMT+0000 (GMT) |
Taming the profiler
MongoDB is filled with tons of data in the profiler. I have highlighted some key points to know:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | { "queryPlanner" : { "mongosPlannerVersion" : 1, "winningPlan" : { "stage" : "SINGLE_SHARD", "shards" : [ { "shardName" : "rs3", "connectionString" : "rs3/localhost:29001,localhost:29002,localhost:29003", "serverInfo" : { "host" : "Davids-MacBook-Pro-2.local", "port" : 29001, "version" : "3.0.11", "gitVersion" : "48f8b49dc30cc2485c6c1f3db31b723258fcbf39" }, "plannerVersion" : 1, "namespace" : "blah.foo", "indexFilterSet" : false, "parsedQuery" : { "name" : { "$eq" : "Bob" } }, "winningPlan" : { "stage" : "COLLSCAN", "filter" : { "name" : { "$eq" : "Bob" } }, "direction" : "forward" }, "rejectedPlans" : [ ] } ] } }, "executionStats" : { "nReturned" : 0, "executionTimeMillis" : 0, "totalKeysExamined" : 0, "totalDocsExamined" : 1, "executionStages" : { "stage" : "SINGLE_SHARD", "nReturned" : 0, "executionTimeMillis" : 0, "totalKeysExamined" : 0, "totalDocsExamined" : 1, "totalChildMillis" : NumberLong(0), "shards" : [ { "shardName" : "rs3", "executionSuccess" : true, "executionStages" : { "stage" : "COLLSCAN", "filter" : { "name" : { "$eq" : "Bob" } }, "nReturned" : 0, "executionTimeMillisEstimate" : 0, "works" : 3, "advanced" : 0, "needTime" : 2, "needFetch" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "invalidates" : 0, "direction" : "forward", "docsExamined" : 1 } } ] }, "allPlansExecution" : [ { "shardName" : "rs3", "allPlans" : [ ] } ] }, "ok" : 1 } |
Metric | Description |
---|---|
Filter | Formulated query that was run. Right above it you can find the parsed query. These should be the same. It’s useful to know what the engine was sent in the end. |
nReturned | Number of documents to return via the cursor to the client running the query/command. |
executionTimeMillis | This used just to be called “ms”, but it means how long did this operation take. Typically you would measure this like a slow query in any database. |
total(Keys|Docs)Examined | Unlike returned, this is what might be considered since not all indexes have perfect coverage, and sometimes you scan many documents to find no results. |
stage | While poorly named, this will tell you if a collection scan (table scan) or index is used to answer a given operation. In the case of an index, it will say the name. |
CurrentOp and killOp explained
When using db.CurrentOp() to see what is running, I frequently include db.currentOp(true) so that I can see everything and not just limited items. This makes the currentOp function look and act much more like SELECT * from information_schema.processlist in MySQL. One significant difference that commonly catches a new DBA off guard is the killing of operations between MySQL and MongoDB. While Mongo does have a handy db.killOp(<op_id>) function, it is important to know that unlike MySQL – which immediately kills the thread running the process – MongoDB is a bit different. When you run killOp(), MongoDB appends “killed: true” into the document structure. When the next yield occurs (if it occurs), it will tell the operation to quit. This is also how a shutdown works: if it seems like it’s not shutting down, it might be waiting for an operation to yield and notice the shutdown request.
I’m not arguing that this is bad or good, just different from MySQL and something of which you should be aware. One thing to note, however, is that MongoDB has great built in HA. Sometimes it is better to cause an election and let the drivers gracefully handle things, rather than running the killOp command (unless it’s a write, then you should always try and use killOp).
Conclusion
I hope you have found some of this insightful. Look for future posts from the MongoDB team around other MongoDB areas we like to look at (or in different parts of the system) to help ourselves and clients get to the root of an issue.