You know how in addition to the main movie you have extras on the DVD. Extra commentary, bloopers, extra scenes, etc? Well welcome the Tyrant extras. With my previous blog posts I was trying to set-up a case for looking at NOSQL tools, and not meant to be a decision making tool. Each solution has pros and cons that will impact how well the technology works for you. Based on some of the comments and questions to the other blogs, I thought I would put together a little more detail into some of the deficiencies and strengths of Tokyo Tyrant.
#1 How durable is Tokyo Tyrant?
Well I went ahead and built a quick script that just inserted data into a TC table ( an id, and a timestamp) and did a kill -9 on the server in the middle of it.
|
1 |
Insert:<br>159796,1256131127.17329<br>159797,1256131127.17338<br>159798,1256131127.17345<br>159799,1256131127.17355<br>put error: recv error<br>159800,1256131127.17364 |
Here we failed at a time of 1256131127.17355 , before the next record was inserted.
After bringing the server up from a crash:
|
1 |
159795,1256131127.1732<br>159796,1256131127.17329<br>159797,1256131127.17338<br>159798,1256131127.17345<br>159799,1256131127.17355 |
All the records are still there. So we are good right? Looking in the code, Tokyo Cabinet actually utilizes memory mapped files. I personally have not using mmaped files, so feel free to correct me if you know better then I. Using mmap here and performing a kill -9 seems to preserve the changes in memory, while powering down the server does not:
|
1 |
163,1257780699.10123<br>164,1257780699.35172<br>165,1257780699.60209<br>166,1257780699.85246 |
insert yanking of power cord here… gives us Post crash data of:
|
1 |
142,1257780693.84303<br>143,1257780694.09345 |
So we basically lost 5 secondish of data.
Looking at the Tyrant & Cabinet documentation you will see mention of a SYNC command which they say does the following:
“The function `tcrdbsync’ is used in order to synchronize updated contents of a remote database object with the file and the device.”
Let’s dig a little deeper into the code and see what’s going on:
|
1 |
/* Synchronize updated contents of a hash database object with the file and the device. */<br>bool tchdbsync(TCHDB *hdb){<br>assert(hdb);<br>if(!HDBLOCKMETHOD(hdb, true)) return false;<br>if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){<br>tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);<br>HDBUNLOCKMETHOD(hdb);<br>return false;<br>}<br>if(hdb->async && !tchdbflushdrp(hdb)){<br>HDBUNLOCKMETHOD(hdb);<br>return false;<br>}<br>bool rv = tchdbmemsync(hdb, true);<br>HDBUNLOCKMETHOD(hdb);<br>return rv;<br>} |
If it first checks if the file descriptor for the database is less then 0, or your not operating as a writer… in which case it errors. Then if checks if your running in async io mode. If your running async it flushes the records from the delayed record pool. If you’re running async and you do not flush your records, then you’re at the mercy of Tokyo cabinet, or your application to call one of the numerous operations that flushes the delayed record pool ( i.e. all regular sync operations like tchdbput will flush it ). I did not test with async, in fact to the best of my knowledge it does not look like tyrant supports async, even though cabinet does. Which means the meat of the sync command coming from tyrant is tchdbmemsync.
|
1 |
/* Synchronize updating contents on memory of a hash database object. */<br>bool tchdbmemsync(TCHDB *hdb, bool phys){<br>assert(hdb);<br>if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){<br>tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);<br>return false;<br>}<br>bool err = false;<br>char hbuf[HDBHEADSIZ];<br>tchdbdumpmeta(hdb, hbuf);<br>memcpy(hdb->map, hbuf, HDBOPAQUEOFF);<br>if(phys){<br>size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;<br>if(msync(hdb->map, xmsiz, MS_SYNC) == -1){<br>tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);<br>err = true;<br>}<br>if(fsync(hdb->fd) == -1){<br>tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);<br>err = true;<br>}<br>}<br>return !err;<br>} |
Here you see the call to msync. What does msync do? The man page says:
“The msync() function writes all modified data to permanent storage locations, if any, in those whole pages containing any part of the address space of the process starting at address addr and continuing for len bytes.”
Basically in the Tokyo Tyrant context msync will flush all the changes to a memory mapped object to disk. This msync is crucial as you can not guarantee data ever makes it to disk if it’s not called. (more below)
The tchdbmemsync function is the only place I saw calling msync. What calls tchdbmemsync?
tchdbmemsync Called via:
|
1 |
tchdboptimize<br>tchdbsync<br>tchdbtranbegin<br>tchdbtrancommit<br>tchdbtranabort<br>tchdbcloseimpl<br>tchdbcopyimpl |
The commands that will indirectly call an msync are : running the optimize command, calling a sync directly, closing a connection to the db, or starting, committing, or aborting a transaction. Note a transaction in TC is actually a global transaction and locks all write operations ( used for maintenance ). What is missing here is a scheduled call to msync. I looked and traced back the calls from Tyrant into Cabinet and could not find anything that is called by automatically.
The documentation on msync actually says without calling msync there is no guarantee of the data making it to disk. This implies that it may eventually get written without a direct msync call ( When you purge/lru old data from memory ). Testing this theory I crashed my server several times and found that data written out to disk without calling msync was very flaky indeed. I had anywhere from 5 seconds of missing data to 60 seconds post crash.
This means for durability you really need to directly call the sync command. In my previous post someone pointed out a flaw in this approach saying that they had seen that calling a sync after writes ruined performance. Looking at the code you can see why calling a sync after each write can severely degrade performance. Before I explain lets look at the performance hit:

Saying there is a performance hit here is an understatement. The reason for this however is really how msync works and how it’s used in Tokyo Cabinet. In a sense it is implemented as a global sync, not a record sync. i.e. all changes to the underlying database are flushed at once. So instead of sync the record you just changed, all of the changed records in the DB will be flushed and synced. In order to perform this operation a lock is required, which blocks other SYNC calls. So if you have 32 threads, you could have 1 sync running and 31 others blocked. This means calling a sync after every call is going to severely degrade performance.
So what can we do to Make Cabinet more durable? Well the best option in my opinion is to steal a trick from Innodb:
We can easily write a script that calls a background sync every second ( i.e. like innodb_flush_log_at_trx_commit = 0/2). I have tested this and I see almost 0 impact on my gaming benchmark from when this is running to when it is not.

You can write this and cron the script or TTSERVER actually provides you a method to call functions periodically:
|
1 |
-ext path : specify the script language extension file.<br>-extpc name period : specify the function name and the calling period of a periodic command. |
Now while I did not see a drop in my benchmark, heavy write operations will see a drop in performance… for instance with 8 threads simply update/inserting data is saw this:

Ouch, a 2X hit. But you can configure the frequency of the sync up or down as needed to ensure you have the proper recovery -vs- performance setting.
Resources
RELATED POSTS