Type Coercions and Floating-Point Types

Consider a language where we only allow automatic, implicit type conversions (coercion) among numeric types if every value of the source type can be represented as a value of the target type, i.e., there is no truncation and no round-off. Let us call this kind of coercion value-preserving coercion. With such a strict coercion rule, an unsigned integer with four bytes can be coerced into a signed integer with eight bytes but the compiler (interpreter) will not coerce signed integers to unsigned integers. In this post, we highlight a reason why coercions from single-precision to double-precision floating-point types may be undesirable although they are value preserving.

An IEEE 754 single-precision float has 24 bits of precision whereas an IEEE 754 double-precision float has 53 bits of precision so from a programmer's perspective, value-preserving coercion is simple:

  • coercion from floats to integers is always forbidden because of fractional values,
  • signed and unsigned integers with 8 or 16 bits can be coerced to single-precision floats,
  • signed and unsigned integers with 8, 16, or 32 bits can be coerced to double-precision floats, and
  • single-precision floats can be coerced to double-precision floats.

These rules are simple and intuitive.
[The number of bits in the precision refers to normalized floats. A floating-point variable representing integer values is always normalized.]

We will now highlight a scenario where coercion from single to double precision is undesirable. Let x and y be positive real numbers such that x+y is an accurate solution to a given problem and let x and y have similar modulus. Naturally, if both x and y are accurate to seven decimal digits (about 24 binary digits), then x+y will be accurate to about seven digits, too; if both x and y are accurate to 16 decimal digits (about 53 binary digits), then x+y will be accurate to about 16 digits, too. Now, if x is accurate to seven decimal digits and y is accurate to 16 decimal digits, then x+y will be accurate to about seven digits. The observation in this simple example can be extended to more complex problems involving vector-valued quantities and vector norms. Thus, from the point of view of a numerical analyst it makes sense to coerce double-precision to single-precision floats when combining these two types.

Both lines of thought made their way into programming languages. For example, the following Python 2 code using NumPy prints <type 'numpy.float64'> meaning the single-precision float was coerced to a double-precision float:

import numpy
x = numpy.float32(1)
y = numpy.float64(1)
print type(x*y)

Contrast this behavior with the following Matlab snippet:

x = single(1);
y = double(1);
disp(class(x*y));

It prints single implying a coercion from double to single precision.

Let us call a calculation where we use single-precision and double-precision floats as mixed-precision calculation. Computations with double-precision floats are decidedly more expensive than computations with single-precision floats, so we can assume that mixed-precision calculations by a numerical analyst must be intentional. Consequently, I argue that coercion from single-precision to double-precision floats is undesirable for a numerical analyst because it may hide unintentional mixed-precision calculations. Yet, coercions from double to single precision are undesirable, as well, because they are not value preserving. Hence, explicit type conversions between single- and double-precision floating-point types seem to be the only safe option reconciling the justified but contradictory points of view of programmers and numerical analysts.

As an example for an unintentional mixed-precision computation, consider the following piece of code, where x is a single-precision float:

0.5 * x

The value 0.5 can be represented exactly by all floating point types (with radix 2) so the programmer's intent may be a multiplication of two single-precision floats. Nevertheless, many programming languages like Python, C, C#, and Java treat 0.5 as a double-precision constant (0.5f would be the corresponding single-precision constant) and in conjunction with coercion, 0.5 * x will be a double-precision float. Matlab avoids the problem by interpreting 0.5 as a double-precision value and coercing to single precision if needed. Haskell avoids the problem, too:

Prelude> let x = 1 :: Float
Prelude> let y = 1 :: Double
Prelude> :t 0.5
0.5 :: Fractional a => a
Prelude> :t 0.5 * x
0.5 * x :: Float
Prelude> :t 0.5 * y
0.5 * y :: Double

Here, we executed statements in the Haskell interpreter GHCi. :t prints the type of the following expression and as we can see in line 4, the constant 0.5 is of neither a Double nor a Float and it is coerced as necessary in the expressions 0.5 * x and 0.5 * y.