Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

Tokyo Tyrant – The Extras Part I : Is it Durable?

November 10, 2009

Author

Share this Post:

You know how in addition to the main movie you have extras on the DVD. Extra commentary, bloopers, extra scenes, etc? Well welcome the Tyrant extras. With my previous blog posts I was trying to set-up a case for looking at NOSQL tools, and not meant to be a decision making tool. Each solution has pros and cons that will impact how well the technology works for you. Based on some of the comments and questions to the other blogs, I thought I would put together a little more detail into some of the deficiencies and strengths of Tokyo Tyrant.

#1 How durable is Tokyo Tyrant?

Well I went ahead and built a quick script that just inserted data into a TC table ( an id, and a timestamp) and did a kill -9 on the server in the middle of it.

Insert:
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355
put error: recv error
159800,1256131127.17364

Insert:

159796,1256131127.17329

159797,1256131127.17338

159798,1256131127.17345

159799,1256131127.17355

put error: recv error

159800,1256131127.17364

Here we failed at a time of 1256131127.17355 , before the next record was inserted.

After bringing the server up from a crash:

159795,1256131127.1732
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355

159795,1256131127.1732

159796,1256131127.17329

159797,1256131127.17338

159798,1256131127.17345

159799,1256131127.17355

All the records are still there. So we are good right? Looking in the code, Tokyo Cabinet actually utilizes memory mapped files. I personally have not using mmaped files, so feel free to correct me if you know better then I. Using mmap here and performing a kill -9 seems to preserve the changes in memory, while powering down the server does not:

163,1257780699.10123
164,1257780699.35172
165,1257780699.60209
166,1257780699.85246

163,1257780699.10123

164,1257780699.35172

165,1257780699.60209

166,1257780699.85246

insert yanking of power cord here… gives us Post crash data of:

142,1257780693.84303
143,1257780694.09345

1 2	142,1257780693.84303 143,1257780694.09345

So we basically lost 5 secondish of data.

Looking at the Tyrant & Cabinet documentation you will see mention of a SYNC command which they say does the following:

“The function `tcrdbsync’ is used in order to synchronize updated contents of a remote database object with the file and the device.”

Let’s dig a little deeper into the code and see what’s going on:

/* Synchronize updated contents of a hash database object with the file and the device. */
bool tchdbsync(TCHDB *hdb){
assert(hdb);
if(!HDBLOCKMETHOD(hdb, true)) return false;
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
HDBUNLOCKMETHOD(hdb);
return false;
}
if(hdb->async && !tchdbflushdrp(hdb)){
HDBUNLOCKMETHOD(hdb);
return false;
}
bool rv = tchdbmemsync(hdb, true);
HDBUNLOCKMETHOD(hdb);
return rv;
}

/* Synchronize updated contents of a hash database object with the file and the device. */

bool tchdbsync(TCHDB *hdb){

assert(hdb);

if(!HDBLOCKMETHOD(hdb, true)) return false;

if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){

tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);

HDBUNLOCKMETHOD(hdb);

return false;

}

if(hdb->async && !tchdbflushdrp(hdb)){

HDBUNLOCKMETHOD(hdb);

return false;

}

bool rv = tchdbmemsync(hdb, true);

HDBUNLOCKMETHOD(hdb);

return rv;

}

If it first checks if the file descriptor for the database is less then 0, or your not operating as a writer… in which case it errors. Then if checks if your running in async io mode. If your running async it flushes the records from the delayed record pool. If you’re running async and you do not flush your records, then you’re at the mercy of Tokyo cabinet, or your application to call one of the numerous operations that flushes the delayed record pool ( i.e. all regular sync operations like tchdbput will flush it ). I did not test with async, in fact to the best of my knowledge it does not look like tyrant supports async, even though cabinet does. Which means the meat of the sync command coming from tyrant is tchdbmemsync.

/* Synchronize updating contents on memory of a hash database object. */
bool tchdbmemsync(TCHDB *hdb, bool phys){
assert(hdb);
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
return false;
}
bool err = false;
char hbuf[HDBHEADSIZ];
tchdbdumpmeta(hdb, hbuf);
memcpy(hdb->map, hbuf, HDBOPAQUEOFF);
if(phys){
size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;
if(msync(hdb->map, xmsiz, MS_SYNC) == -1){
tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);
err = true;
}
if(fsync(hdb->fd) == -1){
tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);
err = true;
}
}
return !err;
}

/* Synchronize updating contents on memory of a hash database object. */

bool tchdbmemsync(TCHDB *hdb, bool phys){

assert(hdb);

if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){

tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);

return false;

}

bool err = false;

char hbuf[HDBHEADSIZ];

tchdbdumpmeta(hdb, hbuf);

memcpy(hdb->map, hbuf, HDBOPAQUEOFF);

if(phys){

size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;

if(msync(hdb->map, xmsiz, MS_SYNC) == -1){

tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);

err = true;

}

if(fsync(hdb->fd) == -1){

tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);

err = true;

}

return !err;

}

Here you see the call to msync. What does msync do? The man page says:

“The msync() function writes all modified data to permanent storage locations, if any, in those whole pages containing any part of the address space of the process starting at address addr and continuing for len bytes.”

Basically in the Tokyo Tyrant context msync will flush all the changes to a memory mapped object to disk. This msync is crucial as you can not guarantee data ever makes it to disk if it’s not called. (more below)

The tchdbmemsync function is the only place I saw calling msync. What calls tchdbmemsync?

tchdbmemsync Called via:

tchdboptimize
tchdbsync
tchdbtranbegin
tchdbtrancommit
tchdbtranabort
tchdbcloseimpl
tchdbcopyimpl

tchdboptimize

tchdbsync

tchdbtranbegin

tchdbtrancommit

tchdbtranabort

tchdbcloseimpl

tchdbcopyimpl

The commands that will indirectly call an msync are : running the optimize command, calling a sync directly, closing a connection to the db, or starting, committing, or aborting a transaction. Note a transaction in TC is actually a global transaction and locks all write operations ( used for maintenance ). What is missing here is a scheduled call to msync. I looked and traced back the calls from Tyrant into Cabinet and could not find anything that is called by automatically.

The documentation on msync actually says without calling msync there is no guarantee of the data making it to disk. This implies that it may eventually get written without a direct msync call ( When you purge/lru old data from memory ). Testing this theory I crashed my server several times and found that data written out to disk without calling msync was very flaky indeed. I had anywhere from 5 seconds of missing data to 60 seconds post crash.

This means for durability you really need to directly call the sync command. In my previous post someone pointed out a flaw in this approach saying that they had seen that calling a sync after writes ruined performance. Looking at the code you can see why calling a sync after each write can severely degrade performance. Before I explain lets look at the performance hit:

Sync After every Call

Saying there is a performance hit here is an understatement. The reason for this however is really how msync works and how it’s used in Tokyo Cabinet. In a sense it is implemented as a global sync, not a record sync. i.e. all changes to the underlying database are flushed at once. So instead of sync the record you just changed, all of the changed records in the DB will be flushed and synced. In order to perform this operation a lock is required, which blocks other SYNC calls. So if you have 32 threads, you could have 1 sync running and 31 others blocked. This means calling a sync after every call is going to severely degrade performance.

So what can we do to Make Cabinet more durable? Well the best option in my opinion is to steal a trick from Innodb:

We can easily write a script that calls a background sync every second ( i.e. like innodb_flush_log_at_trx_commit = 0/2). I have tested this and I see almost 0 impact on my gaming benchmark from when this is running to when it is not.

Once a Second Sync

You can write this and cron the script or TTSERVER actually provides you a method to call functions periodically:

-ext path : specify the script language extension file.
-extpc name period : specify the function name and the calling period of a periodic command.

1 2	-ext path : specify the script language extension file. -extpc name period : specify the function name and the calling period of a periodic command.

Now while I did not see a drop in my benchmark, heavy write operations will see a drop in performance… for instance with 8 threads simply update/inserting data is saw this:

heavy insert sync once a second

Ouch, a 2X hit. But you can configure the frequency of the sync up or down as needed to ensure you have the proper recovery -vs- performance setting.

0 0 votes

Article Rating

5 Comments

Oldest

Newest Most Voted

Alexis

16 years ago

A minor comment, the “sync”, “nosync” colors used for the charts change from one to the other; this is confusing.

Toru Maesaka

16 years ago

Hi!

Awesome post and analysis of TC/TT. Mikio wrote his thoughts about this matter on his blog. Thought you’d find it interesting.

http://1978th.net/tech-en/promenade.cgi?id=6

Cheers,
Toru

Nicolas

16 years ago

Very good post! I stopped testing TT after getting timeouts when calling sync. However, I was calling it every 5 minutes. I will now try the 1-second sync to see if response time keeps < 10 ms

Mark

16 years ago

FYI: TokyoTyrant / LUA doesn’t provide a mechanism to call sync() or any of the other methods required except optimize but that sounds like something that shouldn’t be called each second on a live database…
Source: http://1978th.net/tokyotyrant/spex.html#luaext

perl does, but using -ext requires a LUA extension.
I’m referencing the latest docs. Did I miss something?

Thanks.

leebert

15 years ago

If ACID conformance is crucial then sync’ing once per second won’t be sufficient, hardware tuning is the last alternative here.

If TC has transaction logging (Berkeley DB has it) you could isolate the tranlog file to a separate “fast” spindle & do an fsync against just that tranlog. The tranlog then plays the actual commits against the main tables.

Another thing to consider is the hotspot in the filesystem (at the end of the data tables or tranlog). Also tuning your I/O buffers & cache to actually be *smaller* might help, avoiding the pitfall of big deferred writes or having to make manual sync/fsync calls from your application.

Likewise if you can push the table’s indices off to a different set of spindle that’s yet more I/O wait averted by hardware-based tuning.

As for indices, if the DBM (TC) allows for using clustered PK indices (esp. with tunable fillfactor) the table itself can be tuned so that write hotspots – in conjunction with any striped RAID – are more likely to spread out across the IO system & reduce IO wait. The table bloats more quickly but the performance gains are readily had.

Basically all tuning tricks for a file-based DBMS are the same as with any RDBMS (like DB/2) apply. The difference is that with a file-baesd DBMS you have to tune the OS more in keeping with a database machine. Think: AS/400 or S390. One can can even schedule such tuning parameters, where the OS buffer/caches might be tuned for certain times of the day/month/year for the a write-intensive periods of heavy writes to flush its dirty pages quickly & regularly in the background. VLDB systems are often tuned accordingly.

Another trick is the appropriate use of RAID. RAID5 brings write CRC overhead (pure striping RAID0 doesn’t). RAID1+0 to the rescue – the hotspots can spread out across a LVM on RAID10 – the writes will be appreciably faster & if you mirror twice the odds are much lower readers will contend on disk heads against a big writer. Although RAID10 brings 2x (or 3x) overhead in terms of disk usage, if it’s reliability & speed you want then hardware to the rescue….

As for speed, the type of IO hardware also plays a role here – SSA is going to be faster & more reliable than SCSI b/c SSA runs on a loop. There’s less I/O contention for starters & a failed drive on a SSA loop won’t disrupt the loop.

Also a good HDD controller would have a cache battery that ensures buffers are flushed if there’s power interruption. The beauty of this is that you can tune your buffers & cache down near the size of the controller’s own cache & know your OS is keeping the data pumped at the pace of the IO controller’s ability to safely perform writes.

This requires understanding what parameters to set in sysctl, but it ain’t rocket science either….