September 2, 2014

Integers in PHP, running with scissors, and portability

Until recently I thought that currently popular scripting languages, which mostly evolved over last 10 years or something, must allow for easier portability across different platforms compared to ye good olde C/C++.

After all, their development started a few decades after C, so its notorious caveats are all well-known and should be easy to avoid when designing a new language, right?

However, PHP just brought me a new definition of “portable” – and that was when working with… integers.

PHP is not able to handle unsigned integers, and converts values over 2^31 to signed. So if your IDs go slightly over 2 billion, and PHP decides to treat them as integers, you’re in trouble.

Oh wait, no – that’s on 32-bit platforms only! PHP int size is platform-dependent, and it seems to be 8 bytes on our 64-bit boxes. Yes, the very same ones where C/C++ int is 4 bytes, you know.

That was the easy part. It was mostly documented.

Now, there’s a function called unpack() which essentially allows to convert different types of data from binary strings to PHP variables. What if you try to unpack unsigned 32-bit big endian integer (format code “N”)? Let’s check the doc:

If you specify a number beyond the bounds of the integer type, it will be interpreted as a float instead.

Having read the doc I personally blatantly relied upon it and expected that large unsigned 32bit numbers would be converted to float, or string, or something, but handled properly. However, a couple or so weeks ago the following notice suddenly appeared:

“Note that PHP internally stores integral values as signed. If you unpack a large unsigned long and it is of the same size as PHP internally stored values the result will be a negative number even though unsigned unpacking was specified.”

How sweet. No, it just could not behave like documented and convert 32-bit unsigned value to float on x32 or keep it integer on x64 – you now suddenly have to care about value size yourself. Ah, and by the way, there’s no official way to know what’s int size.

To make things even better, 5.2.1 introduced a nice bug in unpack(), which f..ed unpacking less-than-16-bit values on x64. (I assume you understand that “f..ed” means “fixed”). It took some time and several tries to convince PHP team that x64 has enough bits to hold 16-bit unpacked value, but thankfully its now acknowledged and assigned.

To summarize, if you need to unpack an unsigned 32bit int from binary stream, you have to:

  • convert it to float or string manually,
  • do that depending on int size on current platform,
  • which can not be done using anything documented,
  • and specifically avoid PHP 5.2.1 on x64.

Most people could probably learn all that, and then use sprintf(“%u”,$id), work with string IDs everywhere, avoid 5.2.1 and be happy.

Unfortunately, my final goal was to have support for 64-bit document IDs…

Let’s do a small time travel. Integer types in C/C++ have always been a pain, but back in 1999 ISO commitee ratified ISO/IEC 9899:1999 standard, also known as ISO C99, which guarantees that “long long int” integer type must be at least 64 bits in size. By now, most compilers support that part perfectly.

However, designers of PHP 5 (released in 2004) type system were either not aware of this change, or decided to not rely on the standard which has been out for “only” 5 years by then, or just thought that 31 (no typo) bits and 640K should be enough for everybody.

Long story short, it’s 2007 now but there’s no native 64-bit integer type in PHP. Let me remind that built-in “int” might be 64-bit, but then again it might be not, and there’s no official way to tell.

This time, there’s a number of routes one could take – either use ints (and pray that the app is never run on x32, and that “platform dependent” size does not change to 4 next version); or use GMP or bcmath extensions if they are available.

Fine, so 99.999% of the world would hit that, compile in bcmath, and be happy again.

Unfortunately, I needed to develop a library which could be deployed in any environment – and still work, and produce reasonable results. The worst case is x32, and neither GMP nor bcmath available.

And this is how the following code was born.

For reference, this is what would the equivalent C/C++ snippet look like:

Portability in year 2007.

About Andrew Aksyonoff

Andrew is the creator of the Sphinx full-text search engine, which powers Craigslist, Slashdot, Dealnews, and many other large-scale websites.

Comments

  1. peter says:

    Thank you Andrew for posting your findings. I know you spent quite a while to implement it portable way for Sphinx API.

    I had similar surprise moving to 64bit PHP with crc32() function which magically started to return different values.

    I think MySQL had much better approach in this regard. MySQL internal integer math was always 64bit even on 32bit platforms. This was a bit of performance penalty but not too much in reality, but at least you have good portability.

  2. Holy crap, your _Make64 function doesn’t look fast at all when GMP or BCMATH isn’t available… After reading your story, I can feel your pain…

  3. shodan says:

    I’ve just done quick-n-dirty speed testing and the slowest “manual” route is 5.5 times slower than bcmath. bcmath yields ~128K calls/sec, and manual ~23K calls/sec. This is on AXP-3200+ under WinXP and PHP 4.4.1.

    Normally there would be only a few records processed (say, 20-100) so both speeds can be tolerated. So I’m much more surpised by the amount of issues and workaround which is required to perform very simple 64-bit operation which has been in C standard for ages now.

  4. shodan says:

    For the record, with PHP 5.1.6 on x64 Linux box it’s ~64K calls/sec for manual, 125K calls/sec for GMP, and 1200K+ calls/sec with 64bit ints.

  5. “Ah, and by the way, there’s no official way to know what’s int size.”

    Since PHP 4.4.0 and PHP 5.0.5, there is a PHP_INT_MAX constant documented on http://www.php.net/manual/en/reserved.constants.php

  6. shodan says:

    Jakub, thanks for the link.

    I stand corrected. There is an official way since 4.4.0, called PHP_INT_SIZE.

    However, this bit is hidden pretty well IMO. PHP_INT_SIZE is neither mentioned in the section on integer types, nor can be easily found in the documentation (results from http://www.php.net/results.php?q=int+size&p=manual&l=en are just irrelevant).

  7. peter says:

    Shodan,

    Other thing, PHP_INT_SIZE is rather recent addition meaning it will make sphinx_api incompatible with large amount of old PHP versions which it could be otherwise.

    I guess it was added as these problems started to pop up a lot relatively recently. 3 years ago I guess vast majority of PHP users were running 32bit systems or at least there was no massive migration 32bit->64bit.

  8. [6] Thanks for the spot, I’ve added the info to the PHP documentation XML sources.

  9. Jan Steemann says:

    Type size depending on architecture is definitely a big problem in PHP if you try to have your code working on both 32 and 64 bit machines in parallel.

    In PHP, the variable size of the integer type is a “feature” (probably the developers wanted to make use of the native C types for efficiency).
    It’s also documented on http://www.php.net/manual/en/language.types.integer.php:
    “The size of an integer is platform-dependent, although a maximum value of about two billion is the usual value (that’s 32 bits signed). PHP does not support unsigned integers.”

    You will also run into problems when doing bit shifting operations in your code or check if a certain bit is set or not – the result may vary depending on the machine you run the code on!

    There are also documented issues with the pack and unpack functions (http://www.php.net/manual/en/function.pack.php) as they allow you using machine dependent formats:
    i signed integer (machine dependent size and byte order)
    I unsigned integer (machine dependent size and byte order)
    f float (machine dependent size and representation)
    d double (machine dependent size and representation)

    If possible, try to avoid using these formats as they may give you different results depending on your architecture.

    You may well have a problem if you rely on third party code that does not care about these issues. We are using some external code to read in binary files produced by Excel and that code happily used pack, unpack and bit shift operations all over the place without caring about int size, e.g.:

    function GetInt4d($data, $pos)
    {
    $res=ord($data[$pos]) | (ord($data[$pos+1]) 4 and $res>2147483647)
    $res-=2*pow(2,31);
    return $res;
    }

    The problem is that you have to adjust the application to be aware of the architecture and put in nasty and slow workarounds.

    There are also well known issues with the crc32 function which may give you different results on 32 and 64 bit machines if you do not reformat its results with dechex or sprintf accordingly.

    Finally, you may also run into trouble with serialize and unserialize. For example, try on
    php -r “var_dump(unserialize(‘i:234444444444444344;’));”
    It will return either
    int(234444444444444344)
    or
    int(-424884552)
    depending on architecture.
    So you cannot properly unserialize big int values serialized on 64 bit systems on 32 bit systems. Imagine using a mixed environment with 32 and 64 bit servers that need to share their data.

    Some PHP extensions or the underlying library also had issues on 64bit systems in the past, e.g. cracklib did not work on 64 bit some time ago because it used machine dependent ints in structs that read in some file header information.

    There are other nasty portability issues with PHP which are even worse if you ask me.
    For instance, fgetcsv is locale-dependent (documented in the manual):
    “Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function.”
    So you have to either switch the system locale by your application or write your own fgetcsv substitution.

    Probably the PHP documentation team can add a page to the PHP manual listing the obvious portability issues?

  10. Peufeu says:

    Ah, man, this is so the “PHP way”, it’s ridiculous. How can they do that ? I ask myself this every day as I discover yet another pile of shit hidden under the carpet of this joke of a language. Like their object model which doesn’t support overloading static methods. DUMB DUMB DUMB ! And they won’t fix it !

    Or their supreme idiocy of not making the base types objects ! Unicode (man, your native language sure isn’t US-ASCII, I feel your pain)

    When you want to relax, try a bit of Python.

    >>> 3**172
    11610630703530923996233764322605633554400975674804937772291047972101377433780374641L

    >>> print u”Добрый день, Петр”.upper()
    ДОБРЫЙ ДЕНЬ, ПЕТР

    >>> s = u”໗໘໙”
    >>> map( unicodedata.name, s )
    ['LAO DIGIT SEVEN', 'LAO DIGIT EIGHT', 'LAO DIGIT NINE']
    >>> int( s )
    789

    DUH. Like, it works. No pain.

    >>> type( -2147483648 )

    >>> type( 2147483648 )

    Ha, ha. (now you know I use IA32).

    And the one I like best :

    >>> ‘a’ + 0
    TypeError: cannot concatenate ‘str’ and ‘int’ objects

  11. Just a quick note to say thank you. I’m running into exactly the same issues and have had to battle with them while I was writing a bytecode assembler in PHP previously also. This is a really great reference and it’s good to know that I’m not going crazy! (Well, maybe a little!) :)

  12. Bill says:

    Just reviewing the code, I think there is a typo

    line 41;

    41. list(,$a) = unpack ( “N”, “\xff\xff\xff\xff” );
    42. list(,$b) = unpack ( “N”, “\xff\xff\xff\xff” );

    I think 42 should use “V” to get the low part of the unsigned int. Otherwise, a==b.

  13. Xaprb says:

    I am trying to graph some InnoDB stats in Cacti, and must extract them from SHOW INNODB STATUS, which prints them as a hi/lo 64-bit unsigned number, of course. I have to do some math on them to subtract one from the other. I ended up just sending them back to MySQL and doing a SELECT, casting them to string at the same time:


    $sql = "SELECT "
    . "CONCAT('', (($innodb_lsn[0] << 32) + $innodb_lsn[1])) "
    . "AS log_bytes_written, "
    . "CONCAT('', (($flushed_to[0] << 32) + $flushed_to[1])) "
    . "AS log_bytes_flushed, "
    . "CONCAT('', ((($innodb_lsn[0] << 32) + $innodb_lsn[1]) "
    . "- (($flushed_to[0] << 32) + $flushed_to[1]))) "
    . "AS unflushed_log";

    Not high-performance, but the code sure is easy to write.

  14. Jared says:

    If you don’t expect a negative value return from unpack() then test for it, and add 0×100000000

    list(, $a) = unpack(“N”, “\xff\xff\xff\xff”);
    if ($a < 0)
    $a += 0×100000000;

    The add forces a float conversion, and corrects the sign.

  15. Damn.

  16. markus says:

    These days I dont understand anyone that picks PHP instead of Ruby or Python.

    PHP is only a success on the www, but other than that is crippled design are a huge problem.

  17. Christian Kruse says:

    Hi,

    the code you are using has a bug, since integer overflow handling is plattform dependent in PHP. I ran php -r ‘var_dump((int)4294967296);’ on different plattforms, results vary:

    - Mac OS X, Intel Core2Duo, 32bit: int(-1)
    - Linux, Intel Celeron, 32bit: int(0)
    - Linux, Intel Core2Duo, 64bit: int(4294967296)

  18. Mahdi says:

    Thank you Andrew.

    I’m trying to implement reverse of function _Make (i.e. I have an string and want to convert it to low and high value). not use any library just pure PHP or JS.
    What should I do?

  19. Andrew says:

    Mahdi, the easiest way should be to divide it by 2^32 using bcmath functions

  20. Mahdi says:

    Instead of using “bcmath” and “GMP”, I want to compute algorithm manullay.

  21. Andrew says:

    Implement long division then. http://en.wikipedia.org/wiki/Long_division

  22. Frans says:

    Thanks for a great solution – I needed to convert 64-bit FileTime values to (32-bit) Unix timestamps, and your _Make64 function solves most of that problem.

    Two remarks:
    the “$r4 = …” line includes “+ +”, which I presume is a typo.

    And it seems to me that everything from the “$l = strlen…” line down to the final return can be replaced by “return ltrim($r, ’0′);”
    Or is there a special reason for your loop approach?

  23. Frans says:

    One more thing. After testing with a considerable number of different 64-bit values, I found that some were combined incorrectly via the manual computation route. The bug is in the three “while ( $r4>100000 )” ($r3, $r2) loops, which should test with “>=” rather than “>”.

  24. Ellie says:

    We absolutely love your blog and find many of your post’s to be precisely what I’m looking for. can you offer guest writers to write content in your case? I wouldn’t mind publishing a post or elaborating on a lot of the subjects you write concerning here. Again, awesome web log!

Speak Your Mind

*