EmergencyEMERGENCY? Get 24/7 Help Now!

Looking for someone with Chinese knowledge

 | January 11, 2007 |  Posted In: Events and Announcements

PREVIOUS POST
NEXT POST

We’re looking to implement CJK Support in Open Source Full Text search engine Sphinx .
Initially we’re thinking to base search ob bi-gram indexing to keep it simple, especially as according to research papers it offers decent quality for most cases. This is not that complex to implement however there is no way we can test it as we have zero knowledge of Chinese or Japanese.

If you know Chinese Japanese or Korean and would like us help us testing Sphinx support for these languages let us know. No special development skills are required. If you’re reading this blog you should be technical enough.

PREVIOUS POST
NEXT POST
Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

25 Comments

  • Hi Peter, I’m a Chinese guy living in Dalian, China. I’m a big fun of LAMP though only have little
    knowledge of them. But if you just want someone who knows Chinese much better than you and also
    desires to help, please feel free to contact me via email or MSN.

    P.S., please prepare to bear my poor English and I’d better let you know that I just began to learn LAMP
    for a couple of days. 🙂

    Best wishes.

    Nick

  • Peter, just as an FYI, I’ve actually implemented this in Sphinx for edgeio.com. You can see it in action at:

    http://www.edgeio.com/ss/%E6%88%91%E7%9A%84%E6%B1%BD%E8%BD%A6?location=0

    However, I don’t think we’re contributing the code back to Sphinx. We used bigrams along with proximity relevance scoring. Based on what I’ve seen, the relevance ranking is pretty good. So far we’re just doing Chinese UTF-8. We have some folks in China who have done some testing with it.

    My knowledge of Chinese was just good enough to get by here, but I’d be interested in seeing how your effort goes, and helping out a bit if I can.

  • Thank you guys,

    I have not expected so many people to respond so quickly. We’ll now look into how organized it best and will contact ones who provided emails and post some information here.

  • You should collaborate with the Namazu developers (http://www.namazu.org/index.html.en). Namazu is a search engine made primarily for CJK languages, but also works with English. The engine is written in C, and the indexer is written in perl. I’ve found their code fairly easy to read and follow (and I do not know any of CJK), and submitted a few patches in the past. The developers are quite helpful.

  • hi,peter,i’m the owner of http://imysql.cn,i’m Chinese,i’m a DBA, i’m skilled with MySQL optimization, i would like to join with you 🙂

  • hi, peter, I am a Chinese web programmer, 3 years PHP experience, if you want to test Sphinx CJK Support on Debian AMD64, please contact me.

    epaulin AT gmail dot com

  • How is the progress with this Sphinx Chinese language search test?

    The quality of Chinese language search also pretty much depends on the quality of word segmentation.

    I am wondering if we can do just unigram when indexing (though bigger index) and do word segmentation for user submitted search query (or ask users to segment their query, that makes sense as they know what they want to search for). and then we use sphinx to search using, say, maximum length match, and relevance sorting etc.

    Does it make sense this way if we can not beat Google/Baidu on word segmentation.

Leave a Reply

 
 

Percona’s widely read Percona Data Performance blog highlights our expertise in enterprise-class software, support, consulting and managed services solutions for both MySQL® and MongoDB® across traditional and cloud-based platforms. The decades of experience represented by our consultants is found daily in numerous and relevant blog posts.

Besides specific database help, the blog also provides notices on upcoming events and webinars.
Want to get weekly updates listing the latest blog posts? Subscribe to our blog now! Submit your email address below and we’ll send you an update every Friday at 1pm ET.

No, thank you. Please do not ask me again.