fp16
v0.3.0
Published
Half-precision 16-bit floating point numbers
Downloads
66
Readme
fp16
Half-precision 16-bit floating point numbers.
DataView
has APIs for getting and setting float64s and float32s. This library provides the analogous methods for float16s, and utilities for testing how a given float64 value will convert to float32 and float16 values. Conversion implements the IEEE 754 default rounding behavior ("Round-to-Nearest RoundTiesToEven").
NaN
is always encoded as 0x7e00
, which extends the pattern of how browsers serialize NaN
in 32 bits and is the recommendation in the CBOR spec.
This library is TypeScript-native, ESM-only, and has zero dependencies. It works in Node, the browser, and Deno.
Table of Contents
Install
npm i fp16
Usage
Set a 16-bit float
declare function setFloat16(
view: DataView,
offset: number,
value: number,
littleEndian?: boolean,
): void
Get a 16-bit float
declare function getFloat16(
view: DataView,
offset: number,
littleEndian?: boolean,
): number
Precision
In addition to methods for getting and setting float16s, fp16
exports two methods for testing how a given number
value will convert to 32-bit and 16-bit values.
export const Precision = {
Exact: 0,
Inexact: 1,
Underflow: 2,
Overflow: 3,
} as const
export type Precision = typeof Precision[keyof typeof Precision]
declare function getFloat32Precision(value: number): Precision
declare function getFloat16Precision(value: number): Precision
Precision.Exact
: Conversion will not loose precision. The value is guaranteed to round-trip back to the samenumber
value. Positive and negative zero, positive and negative infinity, andNaN
all returnexact
. Values that can be represented losslessly as a subnormal value in the target format will returnexact
.Precision.Overflow
: the exponent of the given value is greater than the maximum exponent of the target size (127
for float32 or15
for float16). Conversion is guaranteed to overflow to +/- Infinity.Precision.Underflow
: the exponent of the given value is less than the minimum exponent minus the number of fractional bits of the target size (-126 - 23
for float32 or-14 - 10
for float16). Conversion is guaranteed to underflow to +/- 0 or to the smallest signed subnormal value (+/- 2^-24
for float16 or+/- 2^-149
for float32).Precision.Inexact
: the exponent is within the target range, but precision bits will be lost during rounding. The value may round to +/- 0 but will never round to +/- Infinity.
Note that the boundaries for overflow and underflow are not what you might necessarily expect; this is because values with exponents just under the minimum exponent for a format map to subnormal values.
Also note that fp16
treats all NaN values as identical, ignoring sign and signalling bits when decoding, and encoding every NaN
value as 0x7e00
. This means that not all 16-bit values will round-trip through setFloat16
and getFloat16
.
Testing
Tests use AVA and live in the test directory.
npm run test
Tests cover decoding all 65536 possible 16-bit values, rounding behaviour, subnormal values, underflows, and overflows. More tests are always welcome.
Credits
This PDF was extremely helpful as a reference for understanding the float16 format, even though fp16
doesn't use the table-based aproach it outlines.
The Golang github.com/x448/float16 package was used as a reference for implementing rounding. The test suite in tests/32to16.js was adapted from its test file float16_test.go.
Contributing
I don't expect to add any additional features to this library, or change any of the exported interfaces. If you encounter a bug or would like to add more tests, please open an issue to discuss it!
License
MIT © 2021 Joel Gustafson