2015年12月31日 星期四

2015年12月30日 星期三

Arbitrary Precision and Big Numbers in PHP

Mathematical questions and applications always trigger me. Recently I just finished a book by Stephen Hawking “God Created The Integers” and being able to “talk” to those great mathematicians in history thrills me. This is the main reason I decided to write about this topic.
In this article, we will review the PHP capability to provide arbitrary precision number calculation / big integer calculation by reviewing 3 PHP modules: GMPBC Math and php-bignumbers. We will demonstrate two real-world examples to see the powers/limitations of each. The first one will be calculating PI to arbitrary precision – well, for the sake of the article, we will restrict the precision, say, to 1000 digits; the second will be a simple demonstration on RSA encryption/decryption.
Let’s get started.

Preparation

To get the GMP PHP module, simply issue the following command in your terminal on any Unix system (or install it manually if on Windows – but please use Vagrant!):
sudo apt-get install php5-gmp
php-bignumbers is distributed via Composer. Create a composer.json file and runphp composer.phar install after adding the requirement. The php-bignumbers Github site has the detailed information to how to write the composer.json file.
BC is usually pre-installed and enabled.
After installation, we can test that all are working fine by writing a simple PHP script and running it in the terminal:
<?php

// Below is to test php-bignumber lib
require_once 'vendor/autoload.php'; // Required for autoload

use \Litipk\BigNumbers\Decimal as Decimal;

$one = Decimal::fromInteger(1);
$two= Decimal::fromInteger(2);
$three=Decimal::fromInteger(3);
$seven = Decimal::fromString('7.0');

$a=$one->div($seven, 100); // Should display something like 0.142857142857 .....and the last digit is 9 in consideration of rounding up the 101th digit. 
$b=$two->div($seven, 100);
$c=$three->div($seven, 100);

echo ($c->sub(($a->add($b)))); //Displays 0.00000... to the last digit

echo "\n";

// Now we test GMP

echo gmp_strval(gmp_div("1.0", "7.0")); //Displays 0. Not really useful!
echo "\n";

// Now we test BC

$oneseven=bcdiv('1', '7', 100); // Should display something like 0.142857142857 .....but note the last digit is 8, not 9.
$twoseven=bcdiv('2','7', 100);
$threeseven=bcdiv('3','7', 100);

echo bcsub(bcadd($oneseven, $twoseven,100), $threeseven, 100); // Displays 0.000000000... to the last digit
echo "\n";
The above code will output like below:
In the first and the third output (using php-bignumbers and BC, respectively), they are identical and accurate (1/7+2/7-3/7=0).
For GMP, it only handles big integers and does not provide arbitrary precision floating numbers calculation method so it says 1/7=0, a result we can expect for one integer dividing another. For this reason, I won’t use GMP for arbitrary precision demo (PI calculation) and will only use it in the RSA demo.

A brief comparison of three libraries

php-bignumber:
  • New, under development;
  • Object Oriented;
  • More typing to create a number;
  • Limited operators and functions (at the time of writing, it is missing pow function and all other important mathematical function families like trigonometry, factorial, etc);
  • Provides arbitrary precision capability, which is what we needed.
BC:
  • Mature, pre-installed and enabled;
  • Not Object Oriented;
  • 10 functions available, missing all functions as php-bignumber, except pow, and missing log/ln;
  • Provides arbitrary precision capability, which is what we needed.
GMP:
  • Mature, easy to install;
  • Not Object Oriented;
  • Over 40 functions available, but all focused on integer manipulation;
  • Provides arbitrary precision for integers, not fully what we want.
Theoretically, with +-*/ provided, there is actually no limitation to simulate other operations on numbers. It may be easier to expand php-bignumbers than to expand the other two. So I am looking forward to the author of php-bignumbers to add in more functions in the future.
If we are only dealing with integers, and particularly not much with the division (or at least, as we can see in RSA demo, we are more interested in the quotient and remainder), all 3 libs are a good candidate and GMP may be even better as it provides the largest number of functions. But if we are determined to get a floating result with arbitrary precision, GMP is out of consideration and we will rely on BC and php-bignumbers.
Now, let’s get into the real world demonstration.

Calculating PI

PI as an irrational number, its value can be calculated infinitely to any number of digits. The calculation of PI itself is useful in testing the speed of a PC, and testing the optimization of the algorithm.
As this calculation involves floating numbers, we will use BC and php-bignumbers. For the two programs, we will use the simplest way to calculate PI:
π^2/6=1+1/4+1/9+......+1/(2n-1)^2
This is not the fastest way to calculate PI as its convergence speed is a bit slow, but never mind.
To benchmark the speed of these two libraries, we will set a stopwatch. As I know it is a lengthy process, I will use the PHP built-in time function to get the timestamp and calculate the difference – I can tell you now that the difference is obvious. We also will compare if the two results are identical (not close enough, but identical).
The biggest n used in the above formula will be set to 1,999,999. This is to avoid too much time spent on finishing the computation on my PC.
Also, to avoid extra function call overhead, we will not use any extra user functions. This may create some redundant lines but let’s bear with it.
I have done some basic optimization in the calculation. More optimization suggestions are more than welcome!
The code (and the RSA code shown later) has been uploaded to Github.
I have tested the performance (both in terms of speed and accuracy) with different n and the below table shows the summary:
Frankly speaking, this result really disappoints me. With a max N set to 2 million, we are only able to achieve an accuracy to the 6th decimal point (3.141592) compared to the real PI. But this is due to the convergence speed of the algorithm itself.
Both methods (BC and php-bignumbers) demonstrate a high level of identity of the results: for 1000+ digits, they only differ after the 990th digit. This is equivalent to 10E-990 difference, which is sufficient enough for us to put the same level of confidence on the identity of their outputs. In other words, given the same algorithm, these two libs give out the “same” result.
However, the speeds are of a huge gap. Roughly, BC takes only 1/3 of the time of that of php-bignumbers. I believe this is because the php-bignumbers lib is spending a lot of overhead on OO-related processing.
Both libs (and also GMP) internally use strings to store arbitrary precision numbers by forming that string to the digits we specified. String manipulation will also slow down the performance. There’s also the fact that BC is written and executed at a low level – in C – while bignumbers is just a high level PHP library.
If we decrease the precision to 100 digits, the time consumed will be tremendously shortened. When n=999,999, 15s and 85s will elapse for BC and php-bignumbers, respectively and the accuracy of the output is still preserved. The results from the two libs will differ from the 95th digit out of 100+ digits. We are still comfortable to say they are identical.
So the suggestion is, at this stage, BC is still recommended. It provides the same accuracy but is much faster. To save time when doing such arbitrary precision calculation, it is highly recommended to set a lower scale of precision. The time will be roughly 1/10 if the scale is 1/10 (1000 digits costs 128s while 100 digits costs 15s for BC. For php-bignumbers, it is only 1/5 though).
Just to bring our discussion to the extreme, I have inserted a few lines of code to see what the result will be if we don’t rely on any libraries and just use the built-in floating numbers.
The result is astonishing. When n is set to 999,999, the built-in floating number gives a result of 3.1415916986586, almost identical to what we can get from BC (3.1415916986595…). But guess what? It finishes almost instantaneously!
Thus my conclusion is this: If we don’t really care about the digits at and beyond the 10th digit, we probably shouldn’t use any of the arbitrary libs at all.

An RSA demo

RSA encryption used to be very much trusted but a piece of recent news put it under the spot light, accusing RSA’s possible cooperation with NSA by planting a backdoor inside RSA’s encryption algorithm (here). Nevertheless, the algorithm itself is still worth discussing here.
A detailed description of RSA is beyond this article. Interested parties can refer to its WIKI page for a detailed explanation. We will just illustrate the code we write to implement the RSA encryption and decryption.
Note: The code and the steps illustration below is inspired by this article, written in Chinese by a Chinese programmer Mr Ruan.
The preparation work for our RSA demo is illustrated as below:
We won’t cover the algorithm that is behind the generation of these numbers. So far, we just need to know that there are 6 numbers generated/selected after these steps: p, q, n, φ(n), e and d. We also know the public key is generated as (3233, 17) and private key generated as (3233, 2753).
Encrypting a number (strings should be converted to ASCII or Unicode character by character) is done in the process to find c in the below calculation:
m^e = c (mod n), and m<n
That is, to find the remainder of m^e divided by n.
Let’s say we would like to encrypt the letter ‘A’; its ASCII code is 65, then:
65^17 = 2790 (mod 3233)
So c=2790. We can send 2790 to my partner and he will use the private key (3233, 2753) to decrypt it. (In practice, we will receive the public key from my partner and he will keep the private key to himself.)
The decryption will be like solving the following:
c^d=m (mod n)
That is, to find out the remainder (m) of c^d divided by n. Thus,
2790^2753 = 65 (mod 3233)
That’s it! My friend decrypted my message (2790) and gets 65, which is exactly the ASCII code for ‘A’!
The PHP program to do this process is shown below. I am using BC first and then GMP:
<?php

$letter = 'A';

$m = ord($letter); // ASCII code of letter 'A'
$d = 2753;

$e = 17;
$n = 3233;

echo "Encryption / Decryption using BC:\n";
$c = bcmod(bcpow($m, $e, 40), $n);
$res = bcmod(bcpow($c, $d, 40), $n);

echo "Original data: $letter\n";
echo "Encrypted data: $c\n";
echo "Decryption:\n";
echo "Input: $c\n";
echo "Decrypted: $res\n";
echo "The letter is: ".chr($res)."\n";

echo "=========================\n";
echo "Encryption / Decryption using GMP:\n";
$c = gmp_powm($m, $e, $n);
$res = gmp_powm($c, $d, $n);

echo "Original data: $letter\n";
echo "Encrypted data:".gmp_strval($c)." \n";
echo "Decryption:\n";
echo "Input: ".gmp_strval($c)."\n";
echo "Decrypted: ".gmp_strval($res)."\n";
echo "The letter is: ".chr(gmp_strval($res))."\n";
The results are shown in below screen shot:
BC and GMP perform equally well in this demo. The encryption and decryption are equally fast and finish instantaneously. As the RSA algorithm only involves big integers, the accuracy is well preserved.
Something not shown here is the capability that both libraries can handle really big numbers. For example, in our decryption process, we calculated a very big number (2790^2753, which is a number of 9486 digits)!
The demo above, of course, can’t be really used as a serious RSA program. The length of the key (length of 3233 expressed in binary) is only 12. In actual usage, the length should be normally set to 1024 digits, or 2048 digits for particularly sensitive information.

Conclusion

In this article, we reviewed the 3 libraries that provide big integer and/or arbitrary precision calculation capabilities: BC Math, GMP, php-bignumbers.
The author’s recommendation is to use BC Math as it supports both integers and floating numbers and it also runs pretty fast. Or, we can choose to rely on PHP’s built-in floating numbers if we only care about the first 10 digits or less.
To conclude this article, I will display an encrypted message below using the above RSA configuration. Let me know if you can decrypt it! (The code to decrypt the below sequence is also included on Github)
The message goes:
1486 1992 745 2185 2578 1313 1992 884 2185 1992 281 2185 2235 884 2412 3179 2570 2160 884 1313 1992 884 2185 1992 2680 3179 884 1313 612 2185 3179 2235 884 1853

Fixed Point Math in PHP with BCMath, precision loss cases

When dealing with fixed point numbers, you have to be very careful – especially if you develop with PHP and MySQL. In this article, obstacles and subtleties of working with the PHP BCMath extension, MySQL fixed point expression handling and persisting fixed point data from PHP to MySQL are described. Despite the occurring barriers we try to figure out how to work with fixed point numbers and not to lose a digit.

Troubles with BCMath

BCMath documentation says:
For arbitrary precision mathematics PHP offers the Binary Calculator which supports numbers of any size and precision, represented as strings.
So BCMath function parameters should be represented as strings. Passing numeric values to bcmathcan lead to wrong results, the same precision loss as when we treat double value as string

Case 1

echo bcmul(776.210000, '100', 10) . PHP_EOL;
    echo bcmul(776.211000, '100', 10) . PHP_EOL;
    echo bcmul(776.210100, '100', 10) . PHP_EOL;

    echo bcmul(50018850776.210000, '100', 10) . PHP_EOL;
    echo bcmul(50018850776.211000, '100', 10) . PHP_EOL;
    echo bcmul(50018850776.210100, '100', 10) . PHP_EOL;
Results are:
77621.00
    77621.100
    77621.0100
    5001885077621.00
    5001885077621.100
    5001885077621.00 //here we can see precision loss
Never pass numeric values to BCMath functions, only string values that represent numbers. Even when not dealing with floating points, BCMath can output strange results:

Case 2

echo bcmul('10', 0.0001, 10) . PHP_EOL;
 echo bcmul('10', 0.00001, 10) . PHP_EOL;
 echo 10*0.00001 . PHP_EOL;
Results are:
0.0010
 0 // thats really strange!!!
 0.0001
The reason for this is that BCMath converts its arguments to strings, and there are cases in which a number’s string representation has exponential notation.

Case 3

echo bcmul('10', '1e-4', 10) . PHP_EOL; //outputs 0 as well
PHP is a weakly typed language and in some cases you can’t control input in a strict way – you want to process as many requests as possible.
For example we can “fix” Case 2 and Case 3 by applying sprintf transformation:
$val = sprintf("%.10f", '1e-5');
 echo bcmul('10', $val, 10) . PHP_EOL;
 // gives us 0.0001000000
but applying the same transformation can break Case 1 “proper” behaviour:
$val = sprintf("%.10f", '50018850776.2100000000');
 echo bcmul('10', $val, 10) . PHP_EOL;
 echo bcmul('10', 50018850776.2100000000, 10) . PHP_EOL;
 500188507762.0999908450 //WRONG
 500188507762.10 //RIGHT
So the sprintf solution is not suitable for BCmath. Assuming all user inputs are strings, we can implement a simple validator, catching all exponential notation numbers and converting them properly. This technique is done in php-bignumbers, so we can safely pass in arguments like 1e-20 and50018850776.2101 without losing precision.
echo bcmul("50018850776.2101", '100', 10) . PHP_EOL;
 echo bcmul(Decimal::create("50018850776.2101"), '100', 10) . PHP_EOL;

 echo bcmul(Decimal::create("1e-8"), '100', 10) . PHP_EOL;
 echo bcmul("1e-8", '100', 10) . PHP_EOL;

 echo bcmul(50018850776.2101, '100', 10) . PHP_EOL;
 echo bcmul(Decimal::create(50018850776.2101), '100', 10) . PHP_EOL;
 
 // Result
 // 5001885077621.0100
 // 5001885077621.0100
 // 0.00000100
 // 0
 // 5001885077621.00
 // 5001885077621.00982700
But the last two lines of the example show us that floating point caveats cannot be avoided by input parsing (which is completely logical – we can not deal with PHP internal double representation).

BCMath final guidelines

Never use floating point numbers as fixed point operation arguments. String conversion does not help, because we can not manage the precision loss in any way.
When using BCMath extension operations, be careful with arguments in exponential representation. BCMath functions do not process exponential arguments (i.e. ‘1e-8’) correctly, so you should convert them manually. Be careful, do not use sprintf or similar conversion techniques, because it leads to precision loss.
You can use the php-bignumbers library which handles input arguments in exponential form and provides users with fixed point math operations functions. However, its performance is worse than that of the BCMath extension, so it’s a kind of compromise between a robust package and performance.

MySQL and fixed point numbers

In MySQL, fixed point numbers are handled with the DECIMAL column type. You can read the official MySQL documentation for data types and precision math operations.
The most interesting part is how MySQL handles expressions:
Handling of a numeric expression depends on the kind of values the expression contains:
If any approximate values are present, the expression is approximate and is evaluated using floating-point arithmetic.
If no approximate values are present, the expression contains only exact values. If any exact value contains a fractional part (a value following the decimal point), the expression is evaluated using DECIMAL exact arithmetic and has a precision of 65 digits. The term “exact” is subject to the limits of what can be represented in binary. For example, 1.0/3.0 can be approximated in decimal notation as .333…, but not written as an exact number, so (1.0/3.0)*3.0 does not evaluate to exactly 1.0.
Otherwise, the expression contains only integer values. The expression is exact and is evaluated using integer arithmetic and has a precision the same as BIGINT (64 bits).
If a numeric expression contains any strings, they are converted to double-precision floating-point values and the expression is approximate.
Here is a short example that demonstrates fractional part cases:
mysql> CREATE TABLE fixed_point (
        ->   amount NUMERIC(40,20) NOT NULL
        -> ) engine=InnoDB, charset=utf8;
    Query OK, 0 rows affected (0.02 sec)

    mysql> INSERT INTO fixed_point (amount) VALUES(0.2);
    Query OK, 1 row affected (0.00 sec)

    mysql> SELECT amount, amount + 0.1, amount + 1e-1, amount + '0.1' FROM fixed_point;
    +------------------------+------------------------+---------------------+---------------------+
    | amount                 | amount + 0.1           | amount + 1e-1       | amount + '0.1'      |
    +------------------------+------------------------+---------------------+---------------------+
    | 0.20000000000000000000 | 0.30000000000000000000 | 0.30000000000000004 | 0.30000000000000004 |
    +------------------------+------------------------+---------------------+---------------------+
    1 row in set (0.00 sec)
It may seen quite straightforward, but let’s look at how to deal with it within PHP.

Precision math in PHP & MySQL

So now we have to persist our fixed point values from PHP into MySQL. The right way is to use prepared statements and placeholders within our queries. Then we do parameter binding and everything is safe and secure.
$amount_to_add = "0.01";
 $stmt = $dbh->prepare("UPDATE fixed_point SET amount = amount + :amount");
 $stmt->bindValue("amount", $amount_to_add);
 $stmt->execute();
When we bind a value to a statement placeholder, we can specify its type by the bindValue third argument. Possible types are represented by constants PDO::PARAM_BOOLPDO::PARAM_NULL,PDO::PARAM_INTPDO::PARAM_STRPDO::PARAM_LOB and PDO::PARAM_STMT. So the problem is that the PHP PDO extension does not have a decimal parameter type for binding. As a result, all math expressions in queries are treated as floating point expressions, not as fixed point expressions.
$dbh = new PDO("mysql:host=localhost;dbname=test", "root", "");
    $dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
    $sql = "
        CREATE TABLE IF NOT EXISTS fixed_point (
            amount DECIMAL(43,20)
        )
    ";
    $dbh->query($sql);

    $dbh->query("DELETE FROM fixed_point");
    $dbh->query("INSERT INTO fixed_point VALUES(0.2)");

    $amount_to_add = "0.1";
    $stmt = $dbh->prepare("UPDATE fixed_point SET amount = amount + :amount");
    $stmt->bindValue("amount", $amount_to_add);
    $stmt->execute();

    $stmt = $dbh->prepare("SELECT amount FROM fixed_point");
    $stmt->execute();
    var_dump($stmt->fetchColumn());

    //output is string(22) "0.30000000000000004000"
If we want to take the advantage of prepared statements and work with fixed point numbers, the best way is to perform all math operations in PHP and save results to MySQL.
$amount_to_add = "0.1";
 $stmt = $dbh->prepare("SELECT amount FROM fixed_point");
 $stmt->execute();
 $amount = $stmt->fetchColumn();
 $new_amount = bcadd($amount, $amount_to_add, 20);
 $stmt = $dbh->prepare("UPDATE fixed_point SET amount=:amount");
 $stmt->bindValue("amount", $new_amount);
 $stmt->execute();
 $stmt = $dbh->prepare("SELECT amount FROM fixed_point");
 $stmt->execute();
 $amount_after_change = $stmt->fetchColumn();
 echo $amount_after_change . PHP_EOL;

Conclusion

We’ve reached the following conclusions:
  • Never use floating point numbers as fixed point operations arguments in BCMath PHP extension funcitons. Only strings.
  • BCMath extension does not work with string numbers in exponential representation
  • MySQL supports fixed point number expressions, but all operands have to be in decimal format. If at least one agrument is in exponential format or string, it is treated as floating point number and the expression is evaluated as floating point number.
  • PHP PDO extension does not have Decimal parameter type, so if you use prepared statements and binding parameters in SQL expressions that contain fixed point operands – you won’t get precise results.
  • To perform precise math operations in PHP+MySQL applications you can choose two ways. The first one is to process all operations in PHP and persist data to MySQL only with INSERT or UPDATEstatements. In this case you can use prepared statements and parameter binding. The second one is to build SQL queries manually (you can still use prepared statements, but you have to escape parameters by yourself) so all SQL math expressions are in decimal number representation.
My personal favorite approach is the first one: all math operations in PHP. I agree that PHP and MySQL may be not the best choice for applications with precision math, but if you chose this technology stack, it’s good to know that there is a way to deal with it the right way.