pymultihash
documentation¶
Python implementation of the multihash specification
This is an implementation of the multihash specification in Python.
The main component in the module is the Multihash
class, a named tuple that
represents a hash function and a digest created with it, with extended
abilities to work with hashlib-compatible hash functions, verify the integrity
of data, and encode itself to a byte string in the binary format described in
the specification (possibly ASCII-encoded). The decode()
function can be
used for the inverse operation, i.e. converting a (possibly ASCII-encoded)
byte string into a Multihash
object.
Basic usage¶
Decoding¶
One of the basic cases happens when you have a multihash-encoded digest like:
>>> mhash = b'EiAsJrRraP/Gj/mbRTwdMEE0E0ItcGSDv6D5il6IYmbnrg=='
You know beforehand that the multihash is Base64-encoded. You also have some data and you want to check if it matches that digest:
>>> data = b'foo'
To perform this check, you may first decode the multihash (i.e. parse it)
into a Multihash
object, which provides the verify()
method to validate
the given byte string against the encoded digest:
>>> import multihash
>>> mh = multihash.decode(mhash, 'base64')
>>> mh.verify(data)
True
Please note that you needed to specify that the multihash is Base64-encoded, otherwise binary encoding is assumed (and the decoding will probably fail). The verification internally uses a hashlib-compatible implementation of the function indicated by the encoded multihash to check the data. Read more about codecs and hash functions further below.
The function in a Multihash
object is stored as a member of the Func
enumeration, which contains one member per function listed in the multihash
specification. The name of a Func
member is the name of that function in
the specification (with hyphens replaced by underscores), and its value is the
function code. The Multihash
object also contains the binary string with
the raw hash digest. Application-specific hash functions are also supported,
but their numeric code is used instead of a Func
member.
>>> mh # doctest: +ELLIPSIS
Multihash(func=<Func.sha2_256: 18>, digest=b'...')
>>> hex(mh.func.value)
'0x12'
>>> len(mh.digest)
32
The short representation of a Multihash
object only shows the function name
(or its code if application-specific), and the Base64-encoded version of the
raw hash digest:
>>> print(mh)
Multihash(sha2_256, b64:LCa0a2j/xo/5m0U8HTBBNBNCLXBkg7+g+YpeiGJm564=)
If you need a shorter multihash, you may truncate it while keeping the initial bytes of the raw hash digest, but it will no longer be able to validate the same byte strings (unless explicitly instructed to truncate their digests):
>>> mh_trunc = mh.truncate(16)
>>> print(mh_trunc)
Multihash(sha2_256, b64:LCa0a2j/xo/5m0U8HTBBNA==)
>>> mh_trunc.verify(data)
False
>>> mh_trunc.verify_truncated(data)
True
Encoding¶
Now imagine that you have some data and you want to create a multihash out of
it. First you must create a Multihash
instance with the desired function
and the computed binary digest. If you already know them, you may create the
Multihash
instance directly:
>>> mh = multihash.Multihash(multihash.Func.sha2_512, b'...')
>>> print(mh) # doctest: +ELLIPSIS
Multihash(sha2_512, b64:...)
Instead of the Func
member, you may find more comfortable to use the
function name (e.g. 'sha2-512'
or 'sha2_512'
) or its code (e.g. 19
or 0x13
). Or you may create Multihash
instances straight from
hashlib-compatible objects:
>>> import hashlib
>>> hash = hashlib.sha1(data)
>>> mh = Multihash.from_hash(hash)
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)
However the easiest way to get a Multihash
instance is with the digest()
function, which internally uses a hashlib-compatible implementation of the
indicated function to do the job for you:
>>> mh = multihash.digest(data, 'sha1')
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)
In any case, getting the multihash-encoded digest is very simple:
>>> mh.encode('base64')
b'ERQL7se16j8P28ldDdR/PFvCddqKMw=='
As before, an encoding (Base64) was specified to avoid getting the binary version of the multihash.
The hash function registry¶
As the multihash specification indicates, you may use hash function codes in
the range 0x01-0x0f to specify application-specific hash functions.
The decode()
function allows such multihashes, and the Multihash
constructor allows specifying such hash functions by their integer code:
>>> import multihash
>>> import hashlib
>>> class MyMD5:
... def __init__(self, data=b''):
... self.name = 'mymd5'
... self._md5 = hashlib.md5(data)
... def update(self, data):
... return self._md5.update(data)
... def digest(self):
... return self._md5.digest()
...
>>> data = b'foo'
>>> mh = multihash.Multihash(0x05, MyMD5(data).digest())
>>> print(mh) # doctest: +ELLIPSIS
Multihash(0x5, b64:rL0Y20zC+Fzt72VPzMSk2A==)
However this does not allow using more intuitive strings instead of numbers for application-specific functions, and digesting or verifying with such a function is not possible:
>>> multihash.digest(data, 'mymd5')
Traceback (most recent call last):
...
KeyError: ('unknown hash function', 'mymd5')
>>> mh.verify(data)
Traceback (most recent call last):
...
KeyError: ('unknown hash function', 5)
The FuncReg
class helps work around these problems by providing a registry
of hash functions. You may add your application-specific hash functions there
with a code, a name, and optionally a name and a callable object for
hashlib-compatible operations:
>>> multihash.FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5)
>>> multihash.digest(data, 'md-5') # doctest: +ELLIPSIS
Multihash(func=5, digest=b'...')
>>> mh.verify(data)
True
You may remove your application-specific functions from the registry as well:
>>> multihash.FuncReg.unregister(0x05)
FuncReg
also allows you to iterate over registered functions (as Func
members or function codes), and check if it contains a given function
(i.e. whether the Func
or code is registered or not).
>>> [f.name for f in multihash.FuncReg if f == multihash.Func.sha3]
['sha3_512']
>>> 0x05 in multihash.FuncReg
False
The codec registry¶
Although a multihash is properly a binary packing format for a hash digest, it
is not normally exchanged in binary form, but in some ASCII-encoded
representation of it. As seen above, multihash decoding and encoding calls
support an encoding
argument to allow ASCII decoding or encoding for
your convenience.
The encodings mentioned in the multihash standard are already enabled and
available by using their name (a string) as the encoding
argument.
The base58
encoding needs that the base58
package is
installed, though.
The CodecReg
class allows you to access the available codecs and register
your own ones (or replace existing ones) with a name and encoding and decoding
callables that get and return byte strings. For instance, to add the uuencode
codec:
>>> import multihash
>>> import binascii
>>> multihash.CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu)
To use it:
>>> mhash = b'6$10+[L>UZC\\/V\\E=#=1_/%O"==J*,P \n'
>>> mh = multihash.decode(mhash, 'uu')
>>> print(mh)
Multihash(sha1, b64:C+7Hteo/D9vJXQ3UfzxbwnXaijM=)
>>> mh.encode('uu') == mhash
True
You may remove any codec from the registry as well:
>>> multihash.CodecReg.unregister('uu')
CodecReg
also allows you to iterate over registered codec names, and check
if it contains a given codec (i.e. whether it is registered or not).
>>> {'hex', 'base64'}.issubset(multihash.CodecReg)
True
>>> 'base32' in multihash.CodecReg
True
API¶
-
class
multihash.
Multihash
¶ A named tuple representing a multihash function and digest.
The hash function is usually a
Func
member.>>> mh = Multihash(Func.sha1, b'BINARY_DIGEST') >>> mh == (Func.sha1, b'BINARY_DIGEST') True >>> mh == (mh.func, mh.digest) True
However it can also be its integer value (the function code) or its string name (the function name, with either underscore or hyphen).
>>> mhfc = Multihash(Func.sha1.value, mh.digest) >>> mhfc == mh True >>> mhfn = Multihash('sha2-256', b'...') >>> mhfn.func is Func.sha2_256 True
Application-specific codes (0x01-0x0f) are also accepted. Other codes raise a
KeyError
.>>> mhfc = Multihash(0x01, b'...') >>> mhfc.func 1 >>> mhfc = Multihash(1234, b'...') Traceback (most recent call last): ... KeyError: ('unknown hash function', 1234)
-
encode
(encoding=None)¶ Encode into a multihash-encoded digest.
If
encoding
isNone
, a binary digest is produced:>>> mh = Multihash(0x01, b'TEST') >>> mh.encode() b'\x01\x04TEST'
If the name of an
encoding
is specified, it is used to encode the binary digest before returning it (seeCodecReg
for supported codecs).>>> mh.encode('base64') b'AQRURVNU'
If the
encoding
is not available, aKeyError
is raised.
-
classmethod
from_hash
(hash)¶ Create a
Multihash
from a hashlib-compatiblehash
object.>>> import hashlib >>> data = b'foo' >>> hash = hashlib.sha1(data) >>> digest = hash.digest() >>> mh = Multihash.from_hash(hash) >>> mh == (Func.sha1, digest) True
Application-specific hash functions are also supported (see
FuncReg
).If there is no matching multihash hash function for the given
hash
, aValueError
is raised.
-
truncate
(length)¶ Return a new
Multihash
with a shorter digestlength
.If the given
length
is greater than the original, aValueError
is raised.>>> mh1 = Multihash(0x01, b'FOOBAR') >>> mh2 = mh1.truncate(3) >>> mh2 == (0x01, b'FOO') True >>> mh3 = mh1.truncate(10) Traceback (most recent call last): ... ValueError: cannot enlarge the original digest by 4 bytes
Please note that a truncated multihash will no longer verify the same data, as the digest lengths will not match. It may still be convenient for representation purposes, though.
As a special case, identity hashes do not support truncation.
>>> mh4 = Multihash('identity', b'FOOBAR') >>> mh4.truncate(3) Traceback (most recent call last): ... ValueError: cannot truncate identity digest
-
verify
(data)¶ Does the given
data
hash to the digest in this multihash?>>> import hashlib >>> data = b'foo' >>> hash = hashlib.sha1(data) >>> mh = Multihash.from_hash(hash) >>> mh.verify(data) True >>> mh.verify(b'foobar') False
Please note that a multihash with a digest shorter than the standard for its hash function will fail to verify valid data, as digest lengths will not match. A warning will be issued in this case. See
verify_truncated()
for an alternative for such multihashes.>>> mh1 = mh.truncate(len(mh.digest) // 2) >>> mh1.verify(data) False
Application-specific hash functions are also supported (see
FuncReg
).
-
verify_truncated
(data)¶ Is the digest of this multihash a prefix of that of
data
?Use this instead of
verify()
if the multihash has a digest shorter than the standard for its hash function (i.e. if it is the result of truncating another multihash).>>> import hashlib >>> data = b'foo' >>> hash = hashlib.sha1(data) >>> mh = Multihash.from_hash(hash) >>> mh.verify(data) True >>> mh1 = mh.truncate(len(mh.digest) // 2) >>> mh1.verify(data) False >>> mh1.verify_truncated(data) True
However, please note that this verification may be weaker, and is indeed forbidden for multihashes using the identity function (as finding collisions is trivial).
>>> mh2 = Multihash(Func.identity, b'FOOBAR') >>> mh2.verify(b'FOOBAR') True >>> mh2.verify(b'FOOBARBAZ') False >>> mh2.verify_truncated(b'FOOBARBAZ') Traceback (most recent call last): ... ValueError: cannot truncate data digest for the identity function
Application-specific hash functions are also supported (see
FuncReg
).
-
-
multihash.
digest
(data, func)¶ Hash the given
data
into a newMultihash
.The given hash function
func
is used to perform the hashing. It must be a registered hash function (seeFuncReg
).>>> data = b'foo' >>> mh = digest(data, Func.sha1) >>> mh.encode('base64') b'ERQL7se16j8P28ldDdR/PFvCddqKMw=='
-
multihash.
decode
(mhash, encoding=None)¶ Decode a multihash-encoded digest into a
Multihash
.If
encoding
isNone
, a binary digest is assumed.>>> mhash = b'\x11\x0a\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]' >>> mh = decode(mhash) >>> mh == (Func.sha1, mhash[2:]) True
If the name of an
encoding
is specified, it is used to decode the digest before parsing it (seeCodecReg
for supported codecs).>>> import base64 >>> emhash = base64.b64encode(mhash) >>> emh = decode(emhash, 'base64') >>> emh == mh True
If the
encoding
is not available, aKeyError
is raised. If the digest has an invalid format or contains invalid data, aValueError
is raised.
Hash functions¶
-
class
multihash.
Func
¶ An enumeration of hash functions supported by multihash.
The name of each member has its hyphens replaced by underscores. The value of each member corresponds to its integer code.
>>> Func.sha2_512.value == 0x13 True
-
class
multihash.
FuncReg
¶ Registry of supported hash functions.
-
classmethod
func_from_hash
(hash)¶ Return the multihash
Func
for the hashlib-compatiblehash
object.If no
Func
is registered for the given hash, aKeyError
is raised.>>> import hashlib >>> h = hashlib.sha256() >>> f = FuncReg.func_from_hash(h) >>> f is Func.sha2_256 True
-
classmethod
get
(func_hint)¶ Return a registered hash function matching the given hint.
The hint may be a
Func
member, a function name (with hyphens or underscores), or its code. AFunc
member is returned for standard multihash functions and an integer code for application-specific ones. If no matching function is registered, aKeyError
is raised.>>> fm = FuncReg.get(Func.sha2_256) >>> fnu = FuncReg.get('sha2_256') >>> fnh = FuncReg.get('sha2-256') >>> fc = FuncReg.get(0x12) >>> fm == fnu == fnh == fc True
-
classmethod
hash_from_func
(func)¶ Return a hashlib-compatible object for the multihash
func
.If the
func
is registered but no hashlib-compatible constructor is available for it,None
is returned. If thefunc
is not registered, aKeyError
is raised.>>> h = FuncReg.hash_from_func(Func.sha2_256) >>> h.name 'sha256'
-
classmethod
register
(code, name, hash_name=None, hash_new=None)¶ Add an application-specific function to the registry.
Registers a function with the given
code
(an integer) andname
(a string, which is added both with only hyphens and only underscores), as well as an optionalhash_name
andhash_new
constructor for hashlib compatibility. If the application-specific function is already registered, the related data is replaced. Registering a function with acode
not in the application-specific range (0x01-0xff) or with names already registered for a different function raises aValueError
.>>> class MyMD5: ... def __init__(self): ... self.name = 'mymd5' ... >>> FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5) >>> FuncReg.get('md-5') == FuncReg.get('md_5') == 0x05 True >>> hashobj = FuncReg.hash_from_func(0x05) >>> hashobj.name == 'mymd5' True >>> FuncReg.func_from_hash(hashobj) == 0x05 True >>> FuncReg.reset() >>> 0x05 in FuncReg False
-
classmethod
reset
()¶ Reset the registry to the standard multihash functions.
-
classmethod
unregister
(code)¶ Remove an application-specific function from the registry.
Unregisters the function with the given
code
(an integer). If the function is not registered, aKeyError
is raised. Unregistering a function with acode
not in the application-specific range (0x01-0xff) raises aValueError
.>>> class MyMD5: ... def __init__(self): ... self.name = 'mymd5' ... >>> FuncReg.register(0x05, 'md-5', 'mymd5', MyMD5) >>> FuncReg.get('md-5') 5 >>> FuncReg.unregister(0x05) >>> FuncReg.get('md-5') Traceback (most recent call last): ... KeyError: ('unknown hash function', 'md-5')
-
classmethod
Codecs¶
-
class
multihash.
CodecReg
¶ Registry of supported codecs.
-
classmethod
get_decoder
(encoding)¶ Return a decoder for the given
encoding
.The decoder gets a
bytes
object as argument and returns another decodedbytes
object. If theencoding
is not registered, aKeyError
is raised.>>> decode = CodecReg.get_decoder('hex') >>> decode(b'464f4f00') b'FOO\x00'
-
classmethod
get_encoder
(encoding)¶ Return an encoder for the given
encoding
.The encoder gets a
bytes
object as argument and returns another encodedbytes
object. If theencoding
is not registered, aKeyError
is raised.>>> encode = CodecReg.get_encoder('hex') >>> encode(b'FOO\x00') b'464f4f00'
-
classmethod
register
(name, encode, decode)¶ Add a codec to the registry.
Registers a codec with the given
name
(a string) to be used with the givenencode
anddecode
functions, which take abytes
object and return another one. An existing codec is replaced.>>> import binascii >>> CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu) >>> CodecReg.get_decoder('uu') is binascii.a2b_uu True >>> CodecReg.reset() >>> 'uu' in CodecReg False
-
classmethod
reset
()¶ Reset the registry to the standard codecs.
-
classmethod
unregister
(name)¶ Remove a codec from the registry.
Unregisters the codec with the given
name
(a string). If the codec is not registered, aKeyError
is raised.>>> import binascii >>> CodecReg.register('uu', binascii.b2a_uu, binascii.a2b_uu) >>> 'uu' in CodecReg True >>> CodecReg.unregister('uu') >>> 'uu' in CodecReg False
-
classmethod