pymultihash documentation

Python implementation of the multihash specification

This is an implementation of the multihash specification in Python. The main component in the module is the Multihash class, a named tuple that represents a hash function and a digest created with it, with extended abilities to work with hashlib-compatible hash functions, verify the integrity of data, and encode itself to a byte string in the binary format described in the specification (possibly ASCII-encoded). The decode() function can be used for the inverse operation, i.e. converting a (possibly ASCII-encoded) byte string into a Multihash object.

Basic usage

Decoding

One of the basic cases happens when you have a multihash-encoded digest like:

>>> mhash = b'EiAsJrRraP/Gj/mbRTwdMEE0E0ItcGSDv6D5il6IYmbnrg=='

You know beforehand that the multihash is Base64-encoded. You also have some data and you want to check if it matches that digest:

>>> data = b'foo'

To perform this check, you may first decode the multihash (i.e. parse it) into a Multihash object, which provides the verify() method to validate the given byte string against the encoded digest:

>>> import multihash
>>> mh = multihash.decode(mhash, 'base64')
>>> mh.verify(data)
True

Please note that you needed to specify that the multihash is Base64-encoded, otherwise binary encoding is assumed (and the decoding will probably fail). The verification internally uses a hashlib-compatible implementation of the function indicated by the encoded multihash to check the data. Read more about codecs and hash functions further below.

The function in a Multihash object is stored as a member of the Func enumeration, which contains one member per function listed in the multihash specification. The name of a Func member is the name of that function in the specification (with hyphens replaced by underscores), and its value is the function code. The Multihash object also contains the binary string with the raw hash digest. Application-specific hash functions are also supported, but their numeric code is used instead of a Func member.

>>> mh  # doctest: +ELLIPSIS
Multihash(func=<Func.sha2_256: 18>, digest=b'...')
>>> hex(mh.func.value)
'0x12'
>>> len(mh.digest)
32

The short representation of a Multihash object only shows the function name (or its code if application-specific), and the Base64-encoded version of the raw hash digest:

>>> print(mh)
Multihash(sha2_256, b64:LCa0a2j/xo/5m0U8HTBBNBNCLXBkg7+g+YpeiGJm564=)

If you need a shorter multihash, you may truncate it while keeping the initial bytes of the raw hash digest, but it will no longer be able to validate the same byte strings (unless explicitly instructed to truncate their digests):

>>> mh_trunc = mh.truncate(16)
>>> print(mh_trunc)
Multihash(sha2_256, b64:LCa0a2j/xo/5m0U8HTBBNA==)
>>> mh_trunc.verify(data)
False
>>> mh_trunc.verify_truncated(data)
True

Encoding

Now imagine that you have some data and you want to create a multihash out of it. First you must create a Multihash instance with the desired function and the computed binary digest. If you already know them, you may create the Multihash instance directly:

>>> mh = multihash.Multihash(multihash.Func.sha2_512, b'...')
>>> print(mh)  # doctest: +ELLIPSIS
Multihash(sha2_512, b64:...)

Instead of the Func member, you may find more comfortable to use the function name (e.g. 'sha2-512' or 'sha2_512') or its code (e.g. 19 or 0x13). Or you may create Multihash instances straight from hashlib-compatible objects:

>>> import hashlib
>>> hash = hashlib.sha1(data)
>>> mh = Multihash.from_hash(hash)
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)

However the easiest way to get a Multihash instance is with the digest() function, which internally uses a hashlib-compatible implementation of the indicated function to do the job for you:

>>> mh = multihash.digest(data, 'sha1')
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)

In any case, getting the multihash-encoded digest is very simple:

>>> mh.encode('base64')
b'ERQL7se16j8P28ldDdR/PFvCddqKMw=='

As before, an encoding (Base64) was specified to avoid getting the binary version of the multihash.

The hash function registry

As the multihash specification indicates, you may use hash function codes in the range 0x01-0x0f to specify application-specific hash functions. The decode() function allows such multihashes, and the Multihash constructor allows specifying such hash functions by their integer code:

>>> import multihash
>>> import hashlib
>>> class MyMD5:
...     def __init__(self, data=b''):
...         self.name = 'mymd5'
...         self._md5 = hashlib.md5(data)
...     def update(self, data):
...         return self._md5.update(data)
...     def digest(self):
...         return self._md5.digest()
...
>>> data = b'foo'
>>> mh = multihash.Multihash(0x05, MyMD5(data).digest())
>>> print(mh)  # doctest: +ELLIPSIS
Multihash(0x5, b64:rL0Y20zC+Fzt72VPzMSk2A==)

However this does not allow using more intuitive strings instead of numbers for application-specific functions, and digesting or verifying with such a function is not possible:

>>> multihash.digest(data, 'mymd5')
Traceback (most recent call last):
    ...
KeyError: ('unknown hash function', 'mymd5')
>>> mh.verify(data)
Traceback (most recent call last):
    ...
KeyError: ('unknown hash function', 5)

The FuncReg class helps work around these problems by providing a registry of hash functions. You may add your application-specific hash functions there with a code, a name, and optionally a name and a callable object for hashlib-compatible operations:

>>> multihash.FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5)
>>> multihash.digest(data, 'md-5')  # doctest: +ELLIPSIS
Multihash(func=5, digest=b'...')
>>> mh.verify(data)
True

You may remove your application-specific functions from the registry as well:

>>> multihash.FuncReg.unregister(0x05)

FuncReg also allows you to iterate over registered functions (as Func members or function codes), and check if it contains a given function (i.e. whether the Func or code is registered or not).

>>> [f.name for f in multihash.FuncReg if f == multihash.Func.sha3]
['sha3_512']
>>> 0x05 in multihash.FuncReg
False

The codec registry

Although a multihash is properly a binary packing format for a hash digest, it is not normally exchanged in binary form, but in some ASCII-encoded representation of it. As seen above, multihash decoding and encoding calls support an encoding argument to allow ASCII decoding or encoding for your convenience.

The encodings mentioned in the multihash standard are already enabled and available by using their name (a string) as the encoding argument. The base58 encoding needs that the base58 package is installed, though.

The CodecReg class allows you to access the available codecs and register your own ones (or replace existing ones) with a name and encoding and decoding callables that get and return byte strings. For instance, to add the uuencode codec:

>>> import multihash
>>> import binascii
>>> multihash.CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu)

To use it:

>>> mhash = b'6$10+[L>UZC\\/V\\E=#=1_/%O"==J*,P  \n'
>>> mh = multihash.decode(mhash, 'uu')
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)
>>> mh.encode('uu') == mhash
True

You may remove any codec from the registry as well:

>>> multihash.CodecReg.unregister('uu')

CodecReg also allows you to iterate over registered codec names, and check if it contains a given codec (i.e. whether it is registered or not).

>>> {'hex', 'base64'}.issubset(multihash.CodecReg)
True
>>> 'base32' in multihash.CodecReg
True

API

class multihash.Multihash

A named tuple representing a multihash function and digest.

The hash function is usually a Func member.

>>> mh = Multihash(Func.sha1, b'BINARY_DIGEST')
>>> mh == (Func.sha1, b'BINARY_DIGEST')
True
>>> mh == (mh.func, mh.digest)
True

However it can also be its integer value (the function code) or its string name (the function name, with either underscore or hyphen).

>>> mhfc = Multihash(Func.sha1.value, mh.digest)
>>> mhfc == mh
True
>>> mhfn = Multihash('sha2-256', b'...')
>>> mhfn.func is Func.sha2_256
True

Application-specific codes (0x01-0x0f) are also accepted. Other codes raise a KeyError.

>>> mhfc = Multihash(0x01, b'...')
>>> mhfc.func
1
>>> mhfc = Multihash(1234, b'...')
Traceback (most recent call last):
    ...
KeyError: ('unknown hash function', 1234)
encode(encoding=None)

Encode into a multihash-encoded digest.

If encoding is None, a binary digest is produced:

>>> mh = Multihash(0x01, b'TEST')
>>> mh.encode()
b'\x01\x04TEST'

If the name of an encoding is specified, it is used to encode the binary digest before returning it (see CodecReg for supported codecs).

>>> mh.encode('base64')
b'AQRURVNU'

If the encoding is not available, a KeyError is raised.

classmethod from_hash(hash)

Create a Multihash from a hashlib-compatible hash object.

>>> import hashlib
>>> data = b'foo'
>>> hash = hashlib.sha1(data)
>>> digest = hash.digest()
>>> mh = Multihash.from_hash(hash)
>>> mh == (Func.sha1, digest)
True

Application-specific hash functions are also supported (see FuncReg).

If there is no matching multihash hash function for the given hash, a ValueError is raised.

truncate(length)

Return a new Multihash with a shorter digest length.

If the given length is greater than the original, a ValueError is raised.

>>> mh1 = Multihash(0x01, b'FOOBAR')
>>> mh2 = mh1.truncate(3)
>>> mh2 == (0x01, b'FOO')
True
>>> mh3 = mh1.truncate(10)
Traceback (most recent call last):
    ...
ValueError: cannot enlarge the original digest by 4 bytes

Please note that a truncated multihash will no longer verify the same data, as the digest lengths will not match. It may still be convenient for representation purposes, though.

As a special case, identity hashes do not support truncation.

>>> mh4 = Multihash('identity', b'FOOBAR')
>>> mh4.truncate(3)
Traceback (most recent call last):
    ...
ValueError: cannot truncate identity digest
verify(data)

Does the given data hash to the digest in this multihash?

>>> import hashlib
>>> data = b'foo'
>>> hash = hashlib.sha1(data)
>>> mh = Multihash.from_hash(hash)
>>> mh.verify(data)
True
>>> mh.verify(b'foobar')
False

Please note that a multihash with a digest shorter than the standard for its hash function will fail to verify valid data, as digest lengths will not match. A warning will be issued in this case. See verify_truncated() for an alternative for such multihashes.

>>> mh1 = mh.truncate(len(mh.digest) // 2)
>>> mh1.verify(data)
False

Application-specific hash functions are also supported (see FuncReg).

verify_truncated(data)

Is the digest of this multihash a prefix of that of data?

Use this instead of verify() if the multihash has a digest shorter than the standard for its hash function (i.e. if it is the result of truncating another multihash).

>>> import hashlib
>>> data = b'foo'
>>> hash = hashlib.sha1(data)
>>> mh = Multihash.from_hash(hash)
>>> mh.verify(data)
True
>>> mh1 = mh.truncate(len(mh.digest) // 2)
>>> mh1.verify(data)
False
>>> mh1.verify_truncated(data)
True

However, please note that this verification may be weaker, and is indeed forbidden for multihashes using the identity function (as finding collisions is trivial).

>>> mh2 = Multihash(Func.identity, b'FOOBAR')
>>> mh2.verify(b'FOOBAR')
True
>>> mh2.verify(b'FOOBARBAZ')
False
>>> mh2.verify_truncated(b'FOOBARBAZ')
Traceback (most recent call last):
    ...
ValueError: cannot truncate data digest for the identity function

Application-specific hash functions are also supported (see FuncReg).

multihash.digest(data, func)

Hash the given data into a new Multihash.

The given hash function func is used to perform the hashing. It must be a registered hash function (see FuncReg).

>>> data = b'foo'
>>> mh = digest(data, Func.sha1)
>>> mh.encode('base64')
b'ERQL7se16j8P28ldDdR/PFvCddqKMw=='
multihash.decode(mhash, encoding=None)

Decode a multihash-encoded digest into a Multihash.

If encoding is None, a binary digest is assumed.

>>> mhash = b'\x11\x0a\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]'
>>> mh = decode(mhash)
>>> mh == (Func.sha1, mhash[2:])
True

If the name of an encoding is specified, it is used to decode the digest before parsing it (see CodecReg for supported codecs).

>>> import base64
>>> emhash = base64.b64encode(mhash)
>>> emh = decode(emhash, 'base64')
>>> emh == mh
True

If the encoding is not available, a KeyError is raised. If the digest has an invalid format or contains invalid data, a ValueError is raised.

Hash functions

class multihash.Func

An enumeration of hash functions supported by multihash.

The name of each member has its hyphens replaced by underscores. The value of each member corresponds to its integer code.

>>> Func.sha2_512.value == 0x13
True
class multihash.FuncReg

Registry of supported hash functions.

classmethod func_from_hash(hash)

Return the multihash Func for the hashlib-compatible hash object.

If no Func is registered for the given hash, a KeyError is raised.

>>> import hashlib
>>> h = hashlib.sha256()
>>> f = FuncReg.func_from_hash(h)
>>> f is Func.sha2_256
True
classmethod get(func_hint)

Return a registered hash function matching the given hint.

The hint may be a Func member, a function name (with hyphens or underscores), or its code. A Func member is returned for standard multihash functions and an integer code for application-specific ones. If no matching function is registered, a KeyError is raised.

>>> fm = FuncReg.get(Func.sha2_256)
>>> fnu = FuncReg.get('sha2_256')
>>> fnh = FuncReg.get('sha2-256')
>>> fc = FuncReg.get(0x12)
>>> fm == fnu == fnh == fc
True
classmethod hash_from_func(func)

Return a hashlib-compatible object for the multihash func.

If the func is registered but no hashlib-compatible constructor is available for it, None is returned. If the func is not registered, a KeyError is raised.

>>> h = FuncReg.hash_from_func(Func.sha2_256)
>>> h.name
'sha256'
classmethod register(code, name, hash_name=None, hash_new=None)

Add an application-specific function to the registry.

Registers a function with the given code (an integer) and name (a string, which is added both with only hyphens and only underscores), as well as an optional hash_name and hash_new constructor for hashlib compatibility. If the application-specific function is already registered, the related data is replaced. Registering a function with a code not in the application-specific range (0x01-0xff) or with names already registered for a different function raises a ValueError.

>>> class MyMD5:
...     def __init__(self):
...         self.name = 'mymd5'
...
>>> FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5)
>>> FuncReg.get('md-5') == FuncReg.get('md_5') == 0x05
True
>>> hashobj = FuncReg.hash_from_func(0x05)
>>> hashobj.name == 'mymd5'
True
>>> FuncReg.func_from_hash(hashobj) == 0x05
True
>>> FuncReg.reset()
>>> 0x05 in FuncReg
False
classmethod reset()

Reset the registry to the standard multihash functions.

classmethod unregister(code)

Remove an application-specific function from the registry.

Unregisters the function with the given code (an integer). If the function is not registered, a KeyError is raised. Unregistering a function with a code not in the application-specific range (0x01-0xff) raises a ValueError.

>>> class MyMD5:
...     def __init__(self):
...         self.name = 'mymd5'
...
>>> FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5)
>>> FuncReg.get('md-5')
5
>>> FuncReg.unregister(0x05)
>>> FuncReg.get('md-5')
Traceback (most recent call last):
    ...
KeyError: ('unknown hash function', 'md-5')

Codecs

class multihash.CodecReg

Registry of supported codecs.

classmethod get_decoder(encoding)

Return a decoder for the given encoding.

The decoder gets a bytes object as argument and returns another decoded bytes object. If the encoding is not registered, a KeyError is raised.

>>> decode = CodecReg.get_decoder('hex')
>>> decode(b'464f4f00')
b'FOO\x00'
classmethod get_encoder(encoding)

Return an encoder for the given encoding.

The encoder gets a bytes object as argument and returns another encoded bytes object. If the encoding is not registered, a KeyError is raised.

>>> encode = CodecReg.get_encoder('hex')
>>> encode(b'FOO\x00')
b'464f4f00'
classmethod register(name, encode, decode)

Add a codec to the registry.

Registers a codec with the given name (a string) to be used with the given encode and decode functions, which take a bytes object and return another one. An existing codec is replaced.

>>> import binascii
>>> CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu)
>>> CodecReg.get_decoder('uu') is binascii.a2b_uu
True
>>> CodecReg.reset()
>>> 'uu' in CodecReg
False
classmethod reset()

Reset the registry to the standard codecs.

classmethod unregister(name)

Remove a codec from the registry.

Unregisters the codec with the given name (a string). If the codec is not registered, a KeyError is raised.

>>> import binascii
>>> CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu)
>>> 'uu' in CodecReg
True
>>> CodecReg.unregister('uu')
>>> 'uu' in CodecReg
False

Indices and tables