packetcrypt

v1.0.0

Published

3 years ago

Bandwidth-hard proof of work. [![Build Status](https://travis-ci.org/cjdelisle/PacketCrypt.svg?branch=master)](https://travis-ci.org/cjdelisle/PacketCrypt)

Downloads

0High
0Medium
0Low

cjd

PacketCrypt

Bandwidth-hard proof of work.

Abstract

Since the invention of blockchains, there has been research into how to make the proof of work do something useful. Unfortunately, it has been remarkably difficult to make the work useful without allowing miners to influence the nature of the work problem to their own advantage, destroying the fairness of the algorithm.

PacketCrypt takes a different approach, while the work done in PacketCrypt is itself useless, PacketCrypt is designed to encourage investment into the design and deployment of hardware which is useful for other purposes.

PacketCrypt encourages development of hardware solutions for high speed encryption and decryption of messages about the size of an internet packet. It also uses randomized code in order to encourage CPU mining as well as next generation CPU design research. Perhaps most significantly, PacketCrypt encourages cooperation between many mining devices, allowing bandwidth to be expended in lieu of processor effort.

LEMME TEST IT

First, build it:

./autogen.sh
./configure
make

Try out a simple hash function based on random programs:

./simplehash 'seed to generate random program'  'input to the hash program'

Generate some announcements:

-t will generate fake test announcements
replace 8 with the number of processor cores on your machine
./pcann -t 8

How it works

PacketCrypt uses two distinct stages, the first stage mines a 1 KiB proof known as an announcement. The second stage, which is used to mine a block, gets a difficulty advantage based on the number of valid announcements which the miner can prove they had in memory at the time of mining. The default behavior of nodes in the network is to forward announcements, and announcements contain a payload hash, making them a dual use entity which can also be used for broadcasting messages across the network as well as for mining.

The block miner's job is first to collect or create a set of announcements, they then commit the merkle root of their set in the coinbase. They also commit the set size and minimum amount of work done on any announcement in that set. Mining consists of performing a sequence of 4 encryption and memory accesses to announcements in the set. The hash of the work done on the block must be less than the effectiveTarget, where the effectiveTarget is defined as:

    target_for_work( work_for_target(block_header.nBits)**3 / minimum_announcement_work / announcement_count )
      where:
        work_for_target(t) = 2**256 / (t + 1)
        target_for_work(w) = (2**256 - w) / w

If this work meets the target specified in the bitcoin block header, the block is valid and the miner sends a proof containing the 4 announcements which were accessed in the winning cycle as well as the merkle branches necessary to prove that they were included in the committed hash.

Though this proof is probabilistic only, faking the announcement_count or otherwise faking the announcements is not better than simply mining a smaller announcement set.

The Mining Algorithm

The core mining algorithm of PacketCrypt is a 4 cycle encryption and memory lookup loop over a 2 KiB buffer. Each cycle, a 1 KiB item (announcement) is selected and copied over a portion of the encrypted data before the data is encrypted again.

const zero [16]byte
func PacketCryptCycle(announcements [][1024]byte, seed [32]byte, nonce uint32) [32]byte {
  var state [2048]byte
  chacha20.XORKeyStream(state, state, zero, seed)
  writeUInt32LE(state[0:4], nonce)
  CryptoCycle(state)
  for i := 0; i < 4; i++ {
      itemNo := readUInt32LE(state[16:20]) % len(announcements)
      copy(state[32:1056], announcements[itemNo])
      CryptoCycle(state)
  }
  return concat(state[192:208], state[16:32])
}

CryptoCycle

In order to understand how the mining algorithm works and why it makes use of the particular offsets when copying data, its important to understand exactly what CryptoCycle does. CryptoCycle is based on an implementation of chacha20/poly1305 as standardized by the IETF in RFC-7539. Effort was made to ensure that building a device for performing CryptoCycle is not significantly easier than building a device which can encrypt internet packets.

Because encrypting internet traffic requires being able to encrypt messages of differing lengths with or without associated data, CryptoCycle treats the first 16 bytes of the buffer as a header with certain parameters about the encryption to be performed. The header specifies the length of the content to encrypt and additional data to authenticate as well as a flag to indicate whether the algorithm is in encryption or decryption mode. Decryption is slightly different, with the poly1305 authentication occuring before the chacha20 instead of after.

When attempting to "decrypt" random data with a random key, it's clear that the authentication check will almost certainly fail, so this algorithm requires the caller to verify the authenticator manually with an inexpensive constant-time memcmp() operation.

Crypto Request Layout

The following is the data structure which represents a request to encrypt or decrypt data:

     0               1               2               3
     0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  0 |                                                               |
    +                            nonce                              +
  4 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  8 |     unused    | add |D| unusd |     len     |T|   version   |F|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 12 |                             pad                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 16 |                                                               |
    +                                                               +
 20 |                                                               |
    +                                                               +
 24 |                                                               |
    +                                                               +
 28 |                                                               |
    +                        encryption_key                         +
 32 |                                                               |
    +                                                               +
 36 |                                                               |
    +                                                               +
 40 |                                                               |
    +                                                               +
 44 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 48 |                                                               |
    +                          content...
 52 |

The meaning of the fields are as follows:

nonce the encryption nonce, as per the IETF chacha20/poly1305 standard
add the length of the additional data in 16 byte elements
D if this bit is set, the message is being decrypted (order of chacha20/poly1305 is reversed)
len the length of the data to encrypt (in 16 byte units)
T (truncated) this field is ignored in the request but is set if the sum of len and add is greater than 2048
version this is always set to zero and exists for possible future use
F this must be cleared by the caller
pad / unused these are unused and ignored
encryption_key this is the key used for the chacha20 encryption
content the first add * 16 bytes of this is additional authenticated data, it is unaltered but included in the Poly1305 authenticator calculation as per RFC-7539. The len * 16 bytes following the end of the authenticated data is the data to be encrypted. If the sum of is greater than 2048 bytes, the len is decreased by the algorithm and the T flag is set.

Crypto Reply Layout

The reply is much like the request:

     0               1               2               3
     0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  0 |                                                               |
    +                            nonce                              +
  4 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  8 |     unused    | add |D| unusd |      len    |T|   version   |F|
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 12 |                             pad                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 26 |                                                               |
    +                                                               +
 20 |                                                               |
    +                    poly1305_authenticator                     +
 24 |                                                               |
    +                                                               +
 28 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 32 |                                                               |
    +                                                               +
 36 |                                                               |
    +                   key_and_authenticated_data                  +
 40 |                                                               |
    +
 44 |
    +

... 16 + (16 * add) bytes long ...
                                                                    +
                                                                    |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 ?? |                                                               |
    +                       encrypted_content...
 ?? |

The fields which are updated by the algorithm are as follows:

len this is potentially updated if the length of the message plus length of additional data is greater than 2048.
T (truncated) this flag is set if len has been updated, otherwise it is cleared.
F (failed) this flag is set if there is a failure, during mining this should never happen (see: Crypto Request Preparation)
poly1305_authenticator this is the resulting authenticator from encryption/decryption, if the caller is intending to encrypt data, they should send this along with their encrypted message, if they are intending to decrypt data, they should compare this value to the poly1305 authenticator using a constant-time implementation of memcmp().
key_and_authenticated_data this is bytes 32 to 48 of the encryption key plus the additional authenticated data which are untouched.
encrypted_content this is the content as it was before, but encrypted

Crypto Request Preparation

There are a few changes which are made to the data input before it is sent to the CryptoCycle core.

bytes 8 to 12 are set to the value of bytes 16 to 20, this is to make sure that the length of the encrypted data and additional authenticated data are randomized each cycle.
version and F are set to zero because an unexpected version or F flag set will cause the algorithm to give up. In the reference implementation, if the F flag is ever set, it's an assertion failure.
len is OR'd with 32 to make sure that no encryption cycle will ever happen on less than 32 16-byte elements (512 bytes). This is important to make sure that the miner cannot search for hashes where the length is zero, thus avoiding the need to encrypt anything at all.

You can see the real code which does this in CryptoCycle_makeFuzzable().

General Considerations

As you will note from the pseudocode above, the miner copies the 1 KiB announcement over bytes 32 to 1056 of the buffer, this is so the first 16 bytes of the announcement content will be mixed with the Poly1305 authenticator from the previous cycle to form the next cycle's encryption key.

You will also note that the output of the algorithm is taken from bytes 192 to 208 followed by bytes 16 to 32. Bytes 192 to 208 were chosen because they fall beyond the maximum size of additional authenticated data but within the minimum size of encrypted data, meaning they will always be encrypted. Bytes 16 to 32 are of course the Poly1305 authenticator. As Bitcoin's difficulty metric considers the hash as a little endian number, the last byte of the hash is most significant, this is why algorithm places the Poly1305 authenticator at the end of the output.

Announcement Mining

Each announcement contains a commitment of the block height (parent_block_height) as well as the most recent block hash at the time when the announcement was created. An announcement "matures" and becomes usable for mining a block when the block height is parent_block_height + 2, this gives announcement miners a bit of time to mine their announcements and then gives block miners time to receive them. After this height, the work value of an announcement (for the purposes of minimum_announcement_work) is divided by the number of blocks since it's maturity.

Announcement mining itself follows a similar pattern to block mining, reusing many of the same algorithmic primitives but with a few special tweaks.

First, as there are no Announcements to collect, the miner generates 16384 1 KiB data elements using RandMemoHash. These items are generated using the hash of the announcement header with the soft_nonce field blanked.

Secondly, because the verifier can re-generate the 1 KiB data items, the prover need not attach the items for verification, nor even the merkle branches except for the last item. This helps control the size of the resulting announcement which must be precisely 1KiB.

Third, in order to encourage CPU mining and also encourage research into advanced CPU designs, the announcement mining process also involves the execution of randomly generated programs.

The announcement header is as follows:

     0               1               2               3
     0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  0 |    version    |                   soft_nonce                  |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  4 |                          hard_nonce                           |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  8 |                          work_bits                            |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 12 |                     parent_block_height                       |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 16 |                                                               |
    +                         content_type                          +
 20 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 24 |                                                               |
    +                                                               +
 28 |                                                               |
    +                                                               +
 32 |                                                               |
    +                                                               +
 36 |                                                               |
    +                         content_hash                          +
 40 |                                                               |
    +                                                               +
 44 |                                                               |
    +                                                               +
 48 |                                                               |
    +                                                               +
 52 |                                                               |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 56

version is always zero and exists for future usage
soft_nonce is the only part of the announcement header which is not used when creating the table of data items
work_bits is the difficulty target for the announcement hashing, represented in the Bitcoin standard format
parent_block_height is the height of the most recent block at the time that the announcement was hashed, the hash of this block must be committed when making the table of data items for this announcement
content_type is a value which is opaque to the mining algorithm, it represents the meaning of the announcement (if any)
content_hash is the hash of the announcement, for announcements which were created purely for the purpose of creating blocks, miners are encouraged to set both content_type and content_hash to all zeros

Announcement Design Objectives

There are a number of design objectives which went into the design of the announcement:

Multi-use: An announcement which is mined for the purpose of broadcasting a message to the network should not be significantly less useful to a block miner than an announcement which was mined for the purpose of helping the block miner mine a block.
Bandwidth-hardness: The exchange of announcements in order to get a discount on the proof of work necessary to mine a block must be a bandwidth hard activity. This implies announcements should be difficult to compress and also that miners should be incentivised to prefer many announcements rather than few.
CPU-favoritism: Announcements are intended not only to serve as proof of work but also to allow communication, so announcement mining is designed to favor commodity hardware such as PCs and phones.
CPU-evolution: Finally, the announcement mining algorithm is intended to provide a basis for thinking about next generation CPU designs.

Irreducability

For the multi-use and the bandwidth-hardness objectives, it is desirable that an announcement miner cannot cause their announcements to be compressable to significantly less than 1 KiB in size. The announcement payload is separated from the 1 KiB announcement itself so that the only parts of the announcement which an announcement miner can really control are the content_type and content_hash. A block miner who has no interest in using casually crafted announcements can omit these 40 bytes from the memory locations where they store announcements, but these 40 bytes are about all.

The content of the announcement is the 56 byte header plus a Merkle branch 13 hashes long and a Merkle root needed to validate the announcement. The Merkle tree used for announcement hashing uses 64 byte Blake2b hashes which creates a total of 896 bytes of hashes which cannot be omitted without requiring the recipient to regenerate the 16384 data items used to create the announcement. The 56 byte header plus the 896 byte Merkle branch and root is 952 bytes, so the final 72 bytes comes from the 4th data item. While the 4th data item is probably the easiest thing to recreate, it is only 7% of the announcement data and while it can be omitted from the data sent by the announcement miner to the block miner, the work required to redo the RandMemoHash cycle for recreating the data is far too much for the miner to omit it from memory.

Announcement layout:

[ header (56 bytes) ][ merkle branch (832 bytes) ][ merkle root (64 bytes) ][ 4th item prefix (72 bytes) ]

Reusability

With casual announcement miners and "professional" announcement miners both participating in the network, one might expect the typical minimum_announcement_work to skyrocket as professionals generate almost all of the announcements needed for block mining. This is mitigated by the fact that announcements are reusable for mining multiple blocks - but at a degrading value. So if you receive 100 announcements per block period and they're worth 100 CPU-units each, you can start out by mining a block with minimum_announcement_work of 100, then next cycle you can double your announement count by halving minimum_announcment_work, keeping the same total effort but now being able to also include more casually mined announcements which have a lower amount of work done on them.

CPU-favoritism

The announcment mining algorithm uses a couple of tricks to favor CPU mining over GPU and to frustrate ASIC design efforts. However, the announcement mining algorithm is not designed to use every circuit on current generation CPUs, instead it is designed to try to match general purpose computing workloads to encourage research on the next generation of CPUs.

To this end, the announcement mining algorithm creates random programs which must be executed before each encryption cycle. The random programs are generated from the result of the previous cycle, which depend in part on the previous random program.

The programs use a stack of 32 bit elements and an instruction set of about 100 instructions. There are branches (both predictable and unpredictable) and loops. Programs are normally interpreted but they can be output as C language code which looks like the following:

char* RANDHASH_SEED = "2";
#include "prelude.h"
BEGIN
LOOP(loop_1, 3) { // 0x00300042 @ 0
  OP1(l_1_1, MEMORY(loop_1, 0x000044d8, 9, 11)); // 0x89b13641 @ 1
  OP1(l_1_2, MEMORY(loop_1, 0x000044d8, 9, 8)); // 0x89b13041 @ 2
  OP1(l_1_3, MEMORY(loop_1, 0x000044d8, 9, 11)); // 0x89b13641 @ 3
  OP1(l_1_4, IN(7)); // 0x27a9e440 @ 4
  OP1(l_1_5, ROTL32(l_1_4, l_1_2)); // 0x0020081d @ 5
  OP1(l_1_6, CLZ8(l_1_5)); // 0x00000a04 @ 6
  LOOP(loop_2, 22) { // 0x01600042 @ 7
    OP1(l_2_1, MEMORY(loop_2, 0x00003293, 3, 2)); // 0x65266441 @ 8
    OP1(l_2_2, MEMORY(loop_2, 0x00003293, 3, 3)); // 0x65266641 @ 9
    OP1(l_2_3, MEMORY(loop_2, 0x00003293, 3, 7)); // 0x65266e41 @ 10
    OP1(l_2_4, XOR(l_2_2, l_2_3)); // 0x00a01223 @ 11
    OP1(l_2_5, IN(3)); // 0xa4b0b740 @ 12
    OP2(l_2_6, l_2_7, MULU8C(l_1_6, l_2_5)); // 0x00c00c30 @ 13
    OP1(l_2_8, CTZ16(l_2_4)); // 0x00001608 @ 14
    LOOP(loop_3, 2) { // 0x00200042 @ 15
      OP1(l_3_1, MEMORY(loop_3, 0x0000504d, 12, 11)); // 0xa09b9741 @ 16
      OP1(l_3_2, MEMORY(loop_3, 0x0000504d, 12, 4)); // 0xa09b8941 @ 17
      OP1(l_3_3, IN(7)); // 0xde8ab140 @ 18
      OP1(l_3_4, IN(5)); // 0x9755cf40 @ 19
      OP1(l_3_5, IN(7)); // 0xddd89640 @ 20
      OP1(l_3_6, SHRL8(l_3_2, 0x000000cd)); // 0x0cd42415 @ 21
      OP2(l_3_7, l_3_8, MULSU16C(l_3_6, l_3_5)); // 0x01502c2e @ 22
      OP4(l_3_9, l_3_10, l_3_11, l_3_12, SUB64C(l_2_6, l_2_7, l_3_3, l_3_4)); // 0x01401c3c @ 23
      OP4(l_3_13, l_3_14, l_3_15, l_3_16, ADD64C(l_2_5, l_2_6, 0x000001c9, 0x00000000)); // 0x1c941a3b @ 24
      IF_LIKELY(l_1_6) { // 0x00200c43 @ 25
        OP1(l_4_1, IN(0)); // 0xec830e40 @ 27
        OP1(l_4_2, IN(7)); // 0x1b765740 @ 28
        OUT2(l_4_1, l_4_2);
      } // 0x00000046 @ 29
      else { // 0x00000545 @ 30
        OP1(l_4_1, OR(l_3_7, l_2_1)); // 0x00802e22 @ 31
        OP1(l_4_2, SHRL16(l_4_1, l_2_8)); // 0x00f04416 @ 32
        OP1(l_4_3, AND(l_4_2, l_4_2)); // 0x02304621 @ 33
        OP2(l_4_4, l_4_5, ROTR64(l_4_1, l_4_2, 0x000006fb, 0x00000000)); // 0x6fb44639 @ 34
        OUT4(l_4_1, l_4_2, l_4_3, l_4_4);
        OUT(l_4_5);
      } // 0x00000046 @ 35
      OP1(l_3_17, IN(3)); // 0x74322a40 @ 36
      OP1(l_3_18, CLZ8(l_2_8)); // 0x00001e04 @ 37
      OP1(l_3_19, IN(7)); // 0xa78a6d40 @ 38
      OP1(l_3_20, POPCNT8(l_3_8)); // 0x00003001 @ 39
      OP1(l_3_21, SUB32(l_1_3, l_3_9)); // 0x01900611 @ 40
      OP1(l_3_22, CLZ16(l_3_10)); // 0x00003405 @ 41
      OP2(l_3_23, l_3_24, MUL64(l_2_2, l_2_3, l_2_3, l_2_4)); // 0x00b0143a @ 42
      OP1(l_3_25, BSWAP32(l_3_14)); // 0x00003c0b @ 43
      OP1(l_3_26, IN(7)); // 0xad2a6540 @ 44
      IF_RANDOM(l_3_17) { // 0x00204244 @ 45
        OP1(l_4_1, IN(0)); // 0x1c035640 @ 47
        OP1(l_4_2, ADD32(l_2_1, l_4_1)); // 0x02c0100e @ 48
        OP2(l_4_3, l_4_4, MUL64(l_3_11, l_3_12, l_4_1, l_4_2)); // 0x02d0383a @ 49
        OP1(l_4_5, POPCNT32(l_4_3)); // 0x00005c03 @ 50
        OP2(l_4_6, l_4_7, MUL64(l_3_15, l_3_16, 0x000004bf, 0x00000000)); // 0x4bf4403a @ 51
        OP1(l_4_8, IN(5)); // 0xa0d7a540 @ 52
        IF_LIKELY(l_4_8) { // 0x00206643 @ 53
          OP1(l_5_1, IN(6)); // 0x80fe1a40 @ 55
          OP1(l_5_2, CLZ16(l_4_2)); // 0x00005a05 @ 56
          OP1(l_5_3, IN(7)); // 0x085f7e40 @ 57
          OP1(l_5_4, CLZ16(l_5_1)); // 0x00006a05 @ 58
          OP1(l_5_5, POPCNT32(l_5_2)); // 0x00006c03 @ 59
          OP2(l_5_6, l_5_7, ADD8C(l_4_4, l_4_6)); // 0x03105e24 @ 60
          OP2(l_5_8, l_5_9, ADD8C(l_5_3, l_4_5)); // 0x03006e24 @ 61
          OP2(l_5_10, l_5_11, MUL16C(l_5_4, 0xfffffb57)); // 0xb574702b @ 62
          OP2(l_5_12, l_5_13, ADD64(l_5_7, l_5_8, l_5_9, l_5_10)); // 0x03e07833 @ 63
          OP1(l_5_14, POPCNT32(l_5_5)); // 0x00007203 @ 64
          OP4(l_5_15, l_5_16, l_5_17, l_5_18, MULU64C(l_4_6, l_4_7, l_4_7, l_4_8)); // 0x0330643f @ 65
          OP1(l_5_19, IN(7)); // 0x3f4ebe40 @ 66
          OP1(l_5_20, CTZ8(l_3_1)); // 0x00002207 @ 67
          OP1(l_5_21, CLZ16(l_4_3)); // 0x00005c05 @ 68
          OP2(l_5_22, l_5_23, MULSU16C(l_5_19, l_4_8)); // 0x03308e2e @ 69
          OP2(l_5_24, l_5_25, SHRA64(l_4_6, l_4_7, 0xfffffd66, 0xffffffff)); // 0xd6646437 @ 70
          OP2(l_5_26, l_5_27, MULSU8C(l_5_1, l_5_6)); // 0x03a06a2d @ 71
          OP1(l_5_28, POPCNT16(l_5_20)); // 0x00009002 @ 72
          OP2(l_5_29, l_5_30, ROTR64(l_4_1, l_4_2, l_5_12, l_5_13)); // 0x04105a39 @ 73
          OP1(l_5_31, SHRL32(l_5_24, l_5_25)); // 0x04d09817 @ 74
          OP4(l_5_32, l_5_33, l_5_34, l_5_35, MULU64C(l_5_14, l_5_15, 0x0000053d, 0x00000000)); // 0x53d4863f @ 75
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
          OUT8(l_5_9, l_5_10, l_5_11, l_5_12, l_5_13, l_5_14, l_5_15, l_5_16);
          OUT8(l_5_17, l_5_18, l_5_19, l_5_20, l_5_21, l_5_22, l_5_23, l_5_24);
          OUT8(l_5_25, l_5_26, l_5_27, l_5_28, l_5_29, l_5_30, l_5_31, l_5_32);
          OUT2(l_5_33, l_5_34);
          OUT(l_5_35);
        } // 0x00000046 @ 76
        else { // 0x00000c45 @ 77
          OP4(l_5_1, l_5_2, l_5_3, l_5_4, ADD64C(l_4_7, l_4_8, 0x00000707, 0x00000000)); // 0x7074663b @ 78
          OP1(l_5_5, ADD16(l_5_1, l_5_2)); // 0x03606a0d @ 79
          OP2(l_5_6, l_5_7, ADD8C(l_2_4, l_5_4)); // 0x03801624 @ 80
          OP1(l_5_8, CTZ8(l_5_6)); // 0x00007407 @ 81
          OP2(l_5_9, l_5_10, MUL64(l_5_4, l_5_5, l_4_6, l_4_7)); // 0x0320723a @ 82
          OP1(l_5_11, ROTL32(l_5_10, l_5_7)); // 0x03b07c1d @ 83
          OP1(l_5_12, CTZ32(l_4_1)); // 0x00005809 @ 84
          OP2(l_5_13, l_5_14, SHRA64(l_5_11, l_5_12, 0x000002f7, 0x00000000)); // 0x2f748037 @ 85
          OP4(l_5_15, l_5_16, l_5_17, l_5_18, SUB64C(l_5_13, l_5_14, l_5_1, l_5_2)); // 0x0360843c @ 86
          OP2(l_5_19, l_5_20, SHLL64(l_5_8, l_5_9, 0x0000058c, 0x00000000)); // 0x58c47a35 @ 87
          OP2(l_5_21, l_5_22, ADD16C(l_5_17, l_5_15)); // 0x04308a25 @ 88
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
          OUT8(l_5_9, l_5_10, l_5_11, l_5_12, l_5_13, l_5_14, l_5_15, l_5_16);
          OUT4(l_5_17, l_5_18, l_5_19, l_5_20);
          OUT2(l_5_21, l_5_22);
        } // 0x00000046 @ 89
        OP1(l_4_9, IN(7)); // 0x78dea840 @ 90
        OP2(l_4_10, l_4_11, SHRL64(l_3_11, l_3_12, 0x0000021a, 0x00000000)); // 0x21a43836 @ 91
        OP1(l_4_12, IN(3)); // 0xae2c7c40 @ 92
        OP2(l_4_13, l_4_14, SUB32C(l_4_12, l_4_9)); // 0x03406e29 @ 93
        OP4(l_4_15, l_4_16, l_4_17, l_4_18, MULU64C(l_3_18, l_3_19, l_4_13, l_4_14)); // 0x0390463f @ 94
        OP1(l_4_19, POPCNT32(l_4_15)); // 0x00007403 @ 95
        OP1(l_4_20, POPCNT8(l_4_16)); // 0x00007601 @ 96
        IF_LIKELY(l_4_10) { // 0x00206a43 @ 97
          OP2(l_5_1, l_5_2, MULU8C(l_4_11, 0xfffff9c4)); // 0x9c446c30 @ 99
          OP4(l_5_3, l_5_4, l_5_5, l_5_6, MULU64C(l_5_1, l_5_2, 0xfffffcfe, 0xffffffff)); // 0xcfe4843f @ 100
          OP2(l_5_7, l_5_8, MULSU32C(l_5_3, l_5_4)); // 0x0440862f @ 101
          IF_LIKELY(l_4_9) { // 0x00206843 @ 102
            OP1(l_6_1, IN(0)); // 0xc9853140 @ 104
            OP1(l_6_2, IN(7)); // 0xfddeb140 @ 105
            OP1(l_6_3, CTZ32(l_6_1)); // 0x00009409 @ 106
            OUT2(l_6_1, l_6_2);
            OUT(l_6_3);
          } // 0x00000046 @ 107
          else { // 0x00000445 @ 108
            OP1(l_6_1, IN(7)); // 0x88de6140 @ 109
            OP1(l_6_2, BSWAP16(l_6_1)); // 0x0000940a @ 110
            OP1(l_6_3, IN(1)); // 0x2690f740 @ 111
            OUT2(l_6_1, l_6_2);
            OUT(l_6_3);
          } // 0x00000046 @ 112
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
        } // 0x00000046 @ 113
        else { // 0x00001245 @ 114
          OP1(l_5_1, IN(6)); // 0xc17bf440 @ 115
          OP1(l_5_2, IN(1)); // 0xeb173e40 @ 116
          OP1(l_5_3, POPCNT32(l_5_2)); // 0x00008403 @ 117
          OP1(l_5_4, SHLL32(l_5_3, l_3_13)); // 0x01d08614 @ 118
          OP1(l_5_5, POPCNT32(l_5_4)); // 0x00008803 @ 119
          OP1(l_5_6, AND(l_5_1, l_4_16)); // 0x03b08221 @ 120
          OP2(l_5_7, l_5_8, ROTL64(l_4_4, l_4_5, l_5_5, l_5_6)); // 0x04606038 @ 121
          OP2(l_5_9, l_5_10, SHRL64(l_5_6, l_5_7, l_5_2, l_5_3)); // 0x04308e36 @ 122
          OP4(l_5_11, l_5_12, l_5_13, l_5_14, MUL64C(l_5_7, l_5_8, l_5_5, l_5_6)); // 0x0460903d @ 123
          OP1(l_5_15, BSWAP32(l_5_9)); // 0x0000920b @ 124
          IF_LIKELY(l_4_19) { // 0x00207c43 @ 125
            OP1(l_6_1, CLZ32(l_5_15)); // 0x00009e06 @ 127
            OUT(l_6_1);
          } // 0x00000046 @ 128
          else { // 0x00000245 @ 129
            OP1(l_6_1, POPCNT16(l_5_10)); // 0x00009402 @ 130
            OUT(l_6_1);
          } // 0x00000046 @ 131
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
          OUT4(l_5_9, l_5_10, l_5_11, l_5_12);
          OUT2(l_5_13, l_5_14);
          OUT(l_5_15);
        } // 0x00000046 @ 132
        OUT8(l_4_1, l_4_2, l_4_3, l_4_4, l_4_5, l_4_6, l_4_7, l_4_8);
        OUT8(l_4_9, l_4_10, l_4_11, l_4_12, l_4_13, l_4_14, l_4_15, l_4_16);
        OUT4(l_4_17, l_4_18, l_4_19, l_4_20);
      } // 0x00000046 @ 133
      else { // 0x00003a45 @ 134
        OP1(l_4_1, IN(7)); // 0xfdf5a340 @ 135
        OP1(l_4_2, CTZ8(l_4_1)); // 0x00005807 @ 136
        OP2(l_4_3, l_4_4, SHLL64(l_4_1, l_4_2, l_4_1, l_4_2)); // 0x02d05a35 @ 137
        OP4(l_4_5, l_4_6, l_4_7, l_4_8, MULSU64C(l_4_3, l_4_4, l_3_20, l_3_21)); // 0x02505e3e @ 138
        OP1(l_4_9, POPCNT16(l_4_6)); // 0x00006202 @ 139
        IF_LIKELY(l_4_2) { // 0x00205a43 @ 140
          OP1(l_5_1, IN(7)); // 0x90ee1140 @ 142
          OP1(l_5_2, SHRA16(l_5_1, l_4_9)); // 0x03406c19 @ 143
          OP1(l_5_3, POPCNT32(l_5_2)); // 0x00006e03 @ 144
          OP2(l_5_4, l_5_5, ADD64(l_5_1, l_5_2, l_5_1, l_5_2)); // 0x03706e33 @ 145
          OP1(l_5_6, SHRL32(l_5_3, l_5_4)); // 0x03907017 @ 146
          OP2(l_5_7, l_5_8, SUB16C(l_5_5, l_5_6)); // 0x03b07428 @ 147
          OP2(l_5_9, l_5_10, ROTL64(l_5_2, l_5_3, 0x00000081, 0x00000000)); // 0x08147038 @ 148
          OP2(l_5_11, l_5_12, MULSU16C(l_2_8, 0xfffffc67)); // 0xc6741e2e @ 149
          IF_LIKELY(l_5_8) { // 0x00207a43 @ 150
            OP1(l_6_1, IN(1)); // 0x1f93bb40 @ 152
            OP1(l_6_2, IN(3)); // 0x67312f40 @ 153
            OP1(l_6_3, CLZ32(l_6_1)); // 0x00008606 @ 154
            OP1(l_6_4, SHLL32(l_6_2, l_6_2)); // 0x04408814 @ 155
            OUT4(l_6_1, l_6_2, l_6_3, l_6_4);
          } // 0x00000046 @ 156
          else { // 0x00000545 @ 157
            OP1(l_6_1, IN(7)); // 0x35739940 @ 158
            OP1(l_6_2, IN(6)); // 0xf2e47d40 @ 159
            OP1(l_6_3, CTZ16(l_3_22)); // 0x00004c08 @ 160
            OP1(l_6_4, IN(7)); // 0x489fdd40 @ 161
            OUT4(l_6_1, l_6_2, l_6_3, l_6_4);
          } // 0x00000046 @ 162
          OP1(l_5_13, IN(5)); // 0xa7d11a40 @ 163
          OP2(l_5_14, l_5_15, ROTR64(l_5_12, l_5_13, l_5_9, l_5_10)); // 0x03f08439 @ 164
          OP2(l_5_16, l_5_17, SHRL64(l_5_13, l_5_14, l_4_2, l_4_3)); // 0x02e08636 @ 165
          OP1(l_5_18, CTZ16(l_5_2)); // 0x00006e08 @ 166
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
          OUT8(l_5_9, l_5_10, l_5_11, l_5_12, l_5_13, l_5_14, l_5_15, l_5_16);
          OUT2(l_5_17, l_5_18);
        } // 0x00000046 @ 167
        else { // 0x00000f45 @ 168
          OP4(l_5_1, l_5_2, l_5_3, l_5_4, SUB64C(l_4_7, l_4_8, l_2_7, l_2_8)); // 0x00f0663c @ 169
          OP1(l_5_5, CTZ8(l_5_2)); // 0x00006e07 @ 170
          OP1(l_5_6, AND(l_5_5, l_5_2)); // 0x03707421 @ 171
          OP2(l_5_7, l_5_8, SHLL64(l_5_3, l_5_4, l_5_1, l_5_2)); // 0x03707235 @ 172
          OP4(l_5_9, l_5_10, l_5_11, l_5_12, MUL64C(l_5_6, l_5_7, l_5_6, l_5_7)); // 0x03c0783d @ 173
          OP1(l_5_13, SUB32(l_5_10, l_4_5)); // 0x03007e11 @ 174
          OP4(l_5_14, l_5_15, l_5_16, l_5_17, MULU64C(l_5_8, l_5_9, 0xfffffa37, 0xffffffff)); // 0xa3747c3f @ 175
          OP1(l_5_18, SHRL8(l_5_11, 0xfffffb72)); // 0xb7248015 @ 176
          OP1(l_5_19, SHLL32(l_5_12, l_2_8)); // 0x00f08214 @ 177
          IF_RANDOM(l_5_18) { // 0x00208e44 @ 178
          } // 0x00000046 @ 180
          else { // 0x00000145 @ 181
          } // 0x00000046 @ 182
          OUT8(l_5_1, l_5_2, l_5_3, l_5_4, l_5_5, l_5_6, l_5_7, l_5_8);
          OUT8(l_5_9, l_5_10, l_5_11, l_5_12, l_5_13, l_5_14, l_5_15, l_5_16);
          OUT2(l_5_17, l_5_18);
          OUT(l_5_19);
        } // 0x00000046 @ 183
        LOOP(loop_5, 11) { // 0x00b00042 @ 184
          OP1(l_5_1, MEMORY(loop_5, 0x00003f06, 10, 4)); // 0x7e0d4841 @ 185
          OP1(l_5_2, MEMORY(loop_5, 0x00003f06, 10, 5)); // 0x7e0d4a41 @ 186
          OP1(l_5_3, IN(1)); // 0xe412cf40 @ 187
          OP2(l_5_4, l_5_5, ADD32C(l_2_2, l_5_1)); // 0x03601226 @ 188
          OP1(l_5_6, POPCNT8(l_5_3)); // 0x00007001 @ 189
          OP1(l_5_7, CLZ32(l_5_2)); // 0x00006e06 @ 190
          OUT4(l_5_1, l_5_2, l_5_3, l_5_4);
          OUT2(l_5_5, l_5_6);
          OUT(l_5_7);
        } // 0x00000046 @ 191
        OUT8(l_4_1, l_4_2, l_4_3, l_4_4, l_4_5, l_4_6, l_4_7, l_4_8);
        OUT(l_4_9);
      } // 0x00000046 @ 192
      LOOP(loop_4, 85) { // 0x05500042 @ 193
        OP1(l_4_1, MEMORY(loop_4, 0x00004afd, 4, 11)); // 0x95fa9741 @ 194
        OP1(l_4_2, ADD8(l_4_1, l_4_1)); // 0x02c0580c @ 195
        OP2(l_4_3, l_4_4, SUB8C(l_4_2, l_4_2)); // 0x02d05a27 @ 196
        OUT4(l_4_1, l_4_2, l_4_3, l_4_4);
      } // 0x00000046 @ 197
      OUT8(l_3_1, l_3_2, l_3_3, l_3_4, l_3_5, l_3_6, l_3_7, l_3_8);
      OUT8(l_3_9, l_3_10, l_3_11, l_3_12, l_3_13, l_3_14, l_3_15, l_3_16);
      OUT8(l_3_17, l_3_18, l_3_19, l_3_20, l_3_21, l_3_22, l_3_23, l_3_24);
      OUT2(l_3_25, l_3_26);
    } // 0x00000046 @ 198
    OUT8(l_2_1, l_2_2, l_2_3, l_2_4, l_2_5, l_2_6, l_2_7, l_2_8);
  } // 0x00000046 @ 199
  OUT4(l_1_1, l_1_2, l_1_3, l_1_4);
  OUT2(l_1_5, l_1_6);
} // 0x00000046 @ 200
END

Instructions

Arithmatic instructions have 4 variants for treating 8 bit, 16 bit, 32 bit and 64 bit numbers. If ADD8 for instance will add each 8 bit section of argument one to each 8 bit section of argument 2. Arithmatic instructions also have a variant which returns two outputs, a value and a "carry", for addition, carry is a 1 if the number rolls over when adding, for multiplication is is the high bits of the result.

One argument instructions, one output
- POPCNT count the number of bits which are set in the input value
  - POPCNT8
  - POPCNT16
  - POPCNT32
- CLZ count the number of leading zeros in the input value
  - CLZ8
  - CLZ16
  - CLZ32
- CLZ count the number of trailing zeros in the input
  - CTZ8
  - CTZ16
  - CTZ32
- BSWAP16 swap the high and low bytes of each 16 bit half of the value
- BSWAP32 swap value from big endian to little endian or the reverse
Two argument instructions, one output
- ADD add without carry (number rolls over)
  - ADD8
  - ADD16
  - ADD32
- SUB subtract without carry
  - SUB8
  - SUB16
  - SUB32
- SHLL shift left logical (unsigned)
  - SHLL8
  - SHLL16
  - SHLL32
- SHRL shift right logical (unsigned)
  - SHRL8
  - SHRL16
  - SHRL32
- SHRA shift right arithmatic (signed, extends the sign bit)
  - SHRA8
  - SHRA16
  - SHRA32
- ROTL rotate left
  - ROTL8
  - ROTL16
  - ROTL32
- MUL multiply without carry
  - MUL8
  - MUL16
  - MUL32
- AND logical AND
- OR logical OR
- XOR logical XOR
Two argument instructions with 2 outputs
- ADDC add values and output the result and a carry value
  - ADD8C
  - ADD16C
  - ADD32C
- SUBC subtract and carry
  - SUB8C
  - SUB16C
  - SUB32C
- MULC multiply signed*signed and carry
  - MUL8C
  - MUL16C
  - MUL32C
- MULSUC multiply signed*unsigned and carry
  - MULSU8C
  - MULSU16C
  - MULSU32C
- MULUC multiply unsigned*unsigned and carry
  - MULU8C
  - MULU16C
  - MULU32C
64 bit instructions with one 64 bit output
- ADD64
- SUB64
- SHLL64
- SHRL64
- SHRA64
- ROTL64
- ROTR64
- MUL64
64 bit instructions with two 64 bit outputs
- ADD64C
- SUB64C
- MUL64C
- MULSU64C
- MULU64C
Control instructions
- IN reads a word from the input hash
- MEMORY reads a value from memory, this is used with LOOP to stream memory
- LOOP repeat a set of instructions n times
- IF_LIKELY branch with a 87.5% chance that it will be taken
- IF_RANDOM branch with a 50% chance that it will be taken
- JMP unconditional jump (used with IF to implement the "else" block)
- END end a scope (LOOP, IF)

TODO list

This algorithm is not completely done, there are a few things which are still missing:

[] RandHash
- [] Can we use more memory ?
- [] MIX instruction to frustrate program loop parallelization
- [] ABS, MIN and MAX insns
- [] float instructions ?
- [] amd64 jit