In most projects we need to be able to generate unique identifiers. For single database or application server setups this is typically not a big deal. But when scaling out we need to make sure that identifiers are “globally unique”. In this post I will try to make a PHP implementation of such ID generator with influences from the Twitter Snowflake project.
Our requirements
Our requirements are as following:
- Time ordered with millisecond resolution (within reason).
- Using 64 bit integer.
- Uncoordinated id generation.
Solution
Using a 64 bit integer we have 63 bits to play with. Why? Because the first bit is reserved to determine if the value is positive or negative in PHP (and most other languages). So using that first bit would just make things hard to manage in various systems and languages.
Please note!
To be able to use this solution you need to run PHP on a 64 bit system with a 64 bit version of PHP. If you are on Windows, this solution will not work. See here why.
Why a Snowflake ripoff will not work in PHP
Twitter Snowflake composes a 64 bit ID in this way:
- 41 bits: Timestamp with a range of about 69 years.
- 10 bits: Machine id supporting 1024 ID generation machines.
- 12 bits: Sequential counter.
The system is written in Scala and the sequential counter is exactly 1 per machine id. PHP on the other hand is typically running multiple processes on the same machine. So using this exact way of generating numbers without moving across process boundaries we could end up with two different PHP processes generating the same ID.
To sum it up, we need to take the process into account with the same number of bits.
Breaking down the bits in PHP
PHP implementation overview:
- 41 bits: Timestamp with a range of about 69 years.
- 6 bits: Machine id supporting 64 machines.
- 10 bits: Process id.
- 6 bits: Sequential counter.
First 41 bits will be used as a timestamp. But using the typical system EPOC we will loose valuable bits. So we need to use a custom EPOC to get this timestamp to be 0 on project start. By doing so we will fit about 69 years forward in time with millisecond resolution.
6 bits are used for machine id allowing a total of 64 different machines.
10 bits are used for process id. Its important to understand that the process id could be any integer value. So 72222 would have the exact same 10 bits as 62222. It means that two processes on the same machine that has the same final 10 bit process id could potentially generate the same ID. The probability on this depends on you PHP setup and threading model.
6 bits are used as a sequential counter. It will add 1 for every ID generated on the exact same millisecond until it reaches 64. When that happens it will pause 1 millisecond and reset the counter. This will limit the numbed of generated ids to 64 per millisecond = 64 000 ids per second, per process, per machine. If you are running a PHP-FPM setup with 4 workers it would be 64 000 * 4 = 256 000 ids per second, max.
Adjust the metrics to your needs
These bit metrics could be altered if you know how much it needs to scale, how fast it needs to generate ids or how safe you need to be concerning process id conflicts.
If you will not have a lot of servers you could lower bits used for machine id, and add to the process id. If you don’t need to be able to generate 64 000 ids per second you could move some counter bits to the process id as well.
Implementation
So here it is, the ID generation class:
class Guid {
const EPOC_OFFSET = 1396783745; // Change to time() of project start.
const MAX_SEQUENCE_NUM = 63;
private static $generator = null;
private $lastTimestamp = 0;
private $sequence = 0;
private function __construct() {
// Only allow instantiation via singleton pattern (getGenerator()).
}
public function generate($machine_id)
{
// Get custom timestamp.
$time = $this->generateTimestamp();
// Reset sequence counter if timstamp is different from last used.
if ($this->lastTimestamp !== $time) {
$this->sequence = 0;
}
// If the same timestamp has been used MAX_SEQUENCE_NUM times, go to
// sleep for 1ms, then generate new timestamp.
if ( $this->sequence === self::MAX_SEQUENCE_NUM + 1 ) {
usleep(1000);
$this->sequence = 0;
$time = $this->generateTimestamp();
}
// Remember this timestamp.
$this->lastTimestamp = $time;
// Machine ID
$mid = ((63 & $machine_id) << 16);
// Process ID
$pid = ((1023 & getmypid()) << 6);
// Sequence.
$seq = (63 & $this->sequence);
$this->sequence++;
return $time | $mid | $pid | $seq;
}
/**
* Generates a custom EPOC timestamp positioned for merging with ID.
*
* @return int Internal timestamp.
*/
private function generateTimestamp()
{
$microtime = explode(' ', microtime());
$microtime[1] = (int)$microtime[1] - self::EPOC_OFFSET;
$time = $microtime[1] . substr($microtime[0], 2, 3);
return ((0x1FFFFFFFFFF & $time) << 22);
}
/**
* Singleton pattern to make sure only one generator is used in the
* current PHP process.
*
* @return Guid
*/
public static function getGenerator()
{
if (is_null(self::$generator)) {
self::$generator = new Guid();
}
return self::$generator;
}
/**
* Takes a generated ID and returns the Unix timestamp.
*
* @param int $id Generated ID.
* @return int Unix timestamp.
*/
public function getUnixTimestamp($id)
{
$time = ($id >> 22);
$time = (int)substr($time, 0, strlen($time) - 3);
return (int)$time + self::EPOC_OFFSET;
}
/**
* Takes a generated ID and returns the equivalence of PHP microtime().
*
* @param int $id Generated ID.
* @return string Microtime in millisecond resolution.
*/
public function getMicrotime($id)
{
$time = ($id >> 22);
$microtime = substr($time, strlen($time) - 3);
return "0.{$microtime}00000 " . $this->getUnixTimestamp($id);
}
}
Make sure to change the EPOC_OFFSET
constant to something close to project start to get the most out of the timestamp range.
Usage
Just include the class, get a reference to the generator and start generating ids.
require_once 'guid.php'
$machine_id = 23;
$guid = Guid::getGenerator();
// Generate three ids.
$id1 = $guid->generate($machine_id);
$id2 = $guid->generate($machine_id);
$id3 = $guid->generate($machine_id);
Once you have an ID you can obtain the timestamp information from it using getUnixTimestamp() and getMicrotime() functions.
require_once 'guid.php'
$machine_id = 23;
$guid = Guid::getGenerator();
$id = $guid->generate($machine_id);
echo "Unix timestamp: " . $guid->getUnixTimestamp($id) . "\n";
echo "Microtime: " . $guid->getMicrotime($id);
This would output something similar to:
Unix timestamp: 1396872143
Microtime: 0.58200000 1396872143
To test it a bit more we could generate 10 000 ids and see how many occurrences of each ID it generated. If functioning properly, each ID should only have been generated once.
require_once 'guid.php'
$machine_id = 23;
$guid = Guid::getGenerator();
for ($i = 0; $i < 10000; $i++) {
$ids[] = $guid->generate($machine_id);
}
$map = array();
for ($i = 0; $i < count($ids); $i++) {
$map[$ids[$i]] = (isset($map[$ids[$i]])) ? $map[$ids[$i]]++ : 1;
}
print_r($map);
Output will look like this:
Array
(
[372540213608129] => 1
[372540213608130] => 1
[372540213608131] => 1
[...]
)
Final thoughts
As far as I see it, the weakness in this implementation is the possibility for multiple processes to be identified as the same process, thus possibly generating the same ID. I typically use PHP-FPM with 2, 4 or 8 static worker processes (depending on CPU) per machine. These tend to get sequential process ids.
The possibility that two processes on the same machine has the same signature and generates the same ID on the same millisecond of execution is fairly small. By adding a check of key existence this is a non-issue (unless that read is done extremely frequently). In such case, relying on something like Twitter Snowflake might be required, although adding to the overall response time by typically hitting the network.
Feel free to leave feedback on this implementation in the comments section below.
This is perfect for my project. I am wondering how would you extract the machine_id from the GUID after is generated?