• http://technews.am Matt (technews.am)

    Thanks Jeff for sharing your code snippet. I do share your views on the subject, it could be quite damaging for a startup if a set of UIDs conflicted with a public press statement.

  • http://solutious.com/ Delano Mandelbaum

    Good post! This stuff should be common knowledge.
    Also, I’m not an expert is hash algorithms, but I understand SHA-1 is a better alternative to MD5 with regards to random distribution.

  • Thom

    SHA-1’s advantages of MD5 have little to do with random distribution. Neither SHA-1 nor MD5 are cryptographically secure, with collisions for SHA-1 requiring a total complexity of 2^52 (paper pending) and the best collision attacks on MD5 have an average total complexity of roughly 2^32.
    Lawson’s intentions with MD5 don’t appear to be as a pseudorandom function, but rather an alternative to padding or truncation. I would suggest replacing rand() (Which uses libc’s rand() function, nortorious for its non-randomness) with mt_rand(), and I would utilize the base64_encode function to minimize the size of the key and therefore reduce the size of your database.
    base64_encode(md5(uniqid(mt_rand(), true), true))

  • Fizz

    Some guy on php.net’s uniqid page says md5-ing the output of uniqid is a dumb idea…
    John Haugeland
    Generating an MD5 from a unique ID is naive and reduces much of the value of unique IDs, as well as providing significant (attackable) stricture on the MD5 domain. That’s a deeply broken thing to do. The correct approach is to use the unique ID on its own; it’s already geared for non-collision.
    IDs should never be obfuscated for security, so if you’re worried about someone guessing your ID, fix the system, don’t just make it harder to guess (because it’s nowhere near as difficult to guess as you imagine: you can just brute force the 60,000 MD5s that are generatable from millisecond IDs over the course of a given minute, which the typical computer can do in less than 0.1s).
    If you absolutely need to involve a hash somehow – maybe to placate a boss who thinks they understand security much better than they actually do – append it instead.
    function BadIdeaID() { return uniqid() . ‘_’ . md5(mt_rand()); }
    Is he right?

  • Thom

    (unless by ‘random distribution’ you meant with respect to differential cryptanalysis, and then I’d agree)

  • http://www.secretstolen.com Christopher Burgess

    Excellent piece. I’m currently writing a book “Why company’s take your intellectual property? ‘Cause you give it to them.” – Your example is prime. Whom do I talk to about use w/attribution of course.
    Thanks for sharing!

  • http://freelock.com John Locke

    I’ve typically not been worried about such information disclosure. It’s easy enough to start a sequence at some arbitrarily bigger number than 1 to make it an inexact thing. But there are other reasons to come up with some sort of hashing system–being able to merge data from multiple sources, for example.
    GUIDs can be good for this kind of thing–combine something unique about the computer and the current time to get something unique.

  • http://blog.bumblebeelabs.com/ Xianhang Zhang

    Hi Jeff,
    While some may take the position that it’s a benefit that such information remain private to your competitors, it could also be argued that the transparency would be beneficial to your customers.
    Also, with your scheme, how do you handle key collisions? On a large enough dataset, the probability that two records will share the same id increases to 1.

  • Joe

    You might want to learn how to spell the title of your book ;)

  • http://profile.typepad.com/mjs Michael S.

    @Xianhang With 820 billion documents, the probably of an MD5 collision (128 bits) is 1 in a billion. See
    However, this requires that you have at least 128 bits of entropy in the first place, which rand() (and probably even mt_rand()) won’t give you. Still, once the random number generating part is fixed you should be okay for most applications.
    (I am also doubtful that feeding the output of one random number generator into another does any good.)

  • http://profile.typepad.com/6p011570b19379970b pbhj

    @Jeff, of course message-id 4 doesn’t necessarily mean the fourth message. It could be the fourth message you’ve sent. Or the fourth message relative to another key, such as an operator key (they may have just taken on a new hire and you’re their 4th contact).
    The personal-id could be concatentated with the first part of your email address (which would show 66 Jeffs or whatever) or with the last part (66 gmail users, say) or with a particular domain name …
    It’s likely your analysis is correct but not an assumption you’d want to put much money on.

    @Joe, lol.

  • Chris Johnson

    John Haugeland’s comments were taken somewhat out of context. The original thread on php.net he was responding to had someone advocating the use of system time as the seed for an ID generator, which has a number of implementation-specific problems including resolution (resistance to collisions) and reverse engineering (being able to determine the date the ID was generated from the ID itself). John’s comments are right in that context.
    To recast the discussion — what we are really discussing is a set of requirements. (1) ability to generate a unique ID with vanishingly small probability of duplication; (2) ID format and content that provide no useful information to a competitor / adversary; (3) ID format that is suitable for embedding in a URL query string; (4) niceties — compactness, human-readable etc.
    The earlier suggestion that this could be accomplished (in pseudo code) as Encode(Hash(Random())) is conceptually correct but over-complicated. The standard UUID function accomplishes all of this, if you use a recent enough version. E.g. UUID version 5 uses a high entropy random source, and runs the result through SHA-1, and the final format is URL-friendly if not particularly compact.
    If compactness is needed then wrapping all this in a base conversion is a good idea, but straight Base 64 isn’t the best choice because the + and / characters require extra encoding & decoding for use in URLs. Life is a lot simpler, if somewhat less compact, if you use Ascii 32 or Base 62 encoding. Compactness usually isn’t a real requirement unless the number needs to be readable, speakable, transcribable or memorizable by a person, which doesn’t seem to be the case here.