Christoph Conrads' Blog Scientific Computing Mon, 12 Mar 2018 04:03:27 +0000 en-US hourly 1 Christoph Conrads' Blog 32 32 Project Post-Mortem: Bash was the Wrong Choice Mon, 12 Mar 2018 03:57:08 +0000 Continue reading Project Post-Mortem: Bash was the Wrong Choice ]]> In one of my recent projects, the goal was to extend the functionality and to reduce code duplication of a set of UNIX shell scripts. In this blog post I will discuss the errors that I recognized after completing the project and when I had opportunities to spot them.

The shell scripts were used to manage the build configuration and dependencies of several pieces of software. They also contained logic to work around idiosyncrasies of the build environment, e.g., the OpenBLAS development package libopenblas-dev in Ubuntu Trusty ships without header files (compare the Trusty file list to the Xenial file list). The scripts were expected to work on any Linux distribution and on OS X. Note that there are many UNIX shells but some of their features have been standardized in POSIX.

The existing build scripts consisted of about 2000 lines of mostly POSIX-compliant shell scripts and initially, I wondered if I should continue to use shell scripts and restrict myself to POSIX shell features, if I should continue to use shell scripts but use more powerful Bash features, or if I should chose a different programming language altogether, e.g., Python. The scripts have to call many different programs and in this regard, writing POSIX-conforming shell scripts would be a natural approach offering the best portability compared to the other solutions. A Bash script would offer an equally natural approach and all Linux distributions known to me ship with Bash though this is not always the default shell. For example, current Debian-based distributions use the much faster Debian Almquist Shell (Dash) and the Debian live system grml uses Z shell (zsh) by default. The major drawback of shell scripts were the lack of data structures and language features. Specifically I was looking for

    • arrays or lists,
    • associative arrays, and
    • scoped variables.

When I began the project I was not familiar with all Bash-specific features and I was aware of that. Therefore, I decided to take a closer look at Bash because it was more portable than say, Python. If Bash offered all of the features above, then I would start to use it for the project. It turns out Bash always supported arrays, associative arrays were built in since release 4.0 (see the maintainer's release notes), and scoped variables are supported with the local built-in. Hence, I decided to use Bash. At this point I had the first opportunity to decide against Bash: OS X ships with old releases of UNIX command line tools and I knew this. As of February 2018, the latest OS X release is still shipping with Bash 3.2 and I was indeed saved by the package manager Brew.

The first issue I ran into was passing associative arrays as arguments to functions and returning them. Returning values from functions is not possible in Bash because the return value is an integer that is used for the exit status.  Usually I prefer a functional programming style where I pass read-only arguments to the function and use the return value for the function outputs but this was clearly not possible in Bash with associative arrays. Instead of reviewing my choice of Bash scripts, I partially switched to an imperative programming style where a function would modify associative arrays given as arguments.

Passing associative arrays to functions is not directly possible in Bash (see this discussion) and instead you can

  • pass the list of keys and the list of values as two arguments and reconstruct the associative array with it or
  • make the associative array global, pass the variable name to the function and "dereference" it with the declare built-in (declare -n).

Since I only needed two associative arrays, I decided to make both of them global. This was in direct violation of my own requirements and the another opportunity to review the choice of Bash scripts.

Finally, error handling proved to be annoying. Error handling is challenging, especially if one wants to recover from an error, but luckily it is in many cases sufficient to print an informative error message and terminate the program. In practice, this is not possible with Bash: Functions can end script execution at any time by calling exit built-in but functions may be executed in a subshell and calling exit in a subshell does not terminate the calling shell. This problem was also noticed by the Bash developers and by calling set -e one can ask the interpreter to abort execution as soon as an error occurs. Unfortunately, this is purely heuristic because the exit status of programs is used during the evaluation of conditionals, e.g., executing set -e; false terminates the current shell with exit status 1 whereas set -e; if false; then echo 'true'; fi prints "true". Similarly, in pipelines the exit statuses of all programs except the last are ignored and statements of the form local var="$(program)" swallow the exit status, too, because local has an exit status of its own. I was aware that set -e works purely heuristically and this should have been a warning as well.

In conclusion, I ignored the following warning signs that Bash may be the wrong language for the project:

  • I did not check if Bash actually fulfilled all project requirements. In particular I did not check if Bash 4.0 is available on all target platforms.
  • I assumed that data structures can be passed to and returned from functions.
  • When I found out that data structures cannot be returned from functions or easily passed as arguments, I did not reconsider my choice of programming language.
  • I introduced global variables contrary to my personal preferences.
  • I assumed that I can stop script execution at any time.
Hylas and the Nymphs Thu, 08 Feb 2018 03:11:04 +0000 Continue reading Hylas and the Nymphs ]]> Between January 26 and February 3, 2018, the Manchester Art Gallery (MAG) removed John William Waterhouse' painting Hylas and the Nymphs. In a press release, the gallery states the reason for its action:

This gallery presents the female body as either a ‘passive decorative form’ or a ‘femme fatale’. Let’s challenge this Victorian fantasy!

The removal was filmed and is part of an exhibition by Sonia Boyce.

The painting shows the moment before Heracles' lover Hylas is abducted by the Nymphs. Neither the painting Hylas and the Water Nymphs by Henrietta Rae showing the very same scene nor The Sirens and Ulysses by William Etty were removed by the MAG.

You can find all paintings mentioned inside this blog post.

William Etty: "The Sirens and Ulysses" (1837) John William Waterhouse: "Hylas and the Nymphs" (1896) Henrietta Ray: "Hylas and the Water Nymphs" (1910) ]]>
Wake-Up Call Tue, 24 Jan 2017 22:54:13 +0000 Continue reading Wake-Up Call ]]> Partyzone Berlin: So gesichert wie nie (which translates roughly to Party zone Berlin: Secured like never before) is  the title of a newspaper article about the heightened security measures for the New Year's Eve party in Berlin. Expecting up to one million visitors, there will be 900 policemen (150 more than the year before), a ban of backpacks as well as big bags, and a fenced-in exclusion  zone around the event area. One year later, the expected 200.000 attendees of the New Year's Eve Party on December 31, 2016, will be protected by 1.000 policemen armed with submachine guns and the access roads to the exclusion zone will be blocked by armored vehicles, trucks, and concrete blocks (source with photos).

To appreciate the degree of desperation, you need to know that in the last two years, only two out of formerly 17 shooting ranges (five out of formerly 73 shooting lanes) were available for Berlin's police force for some time. On average, police officers had two minutes of shooting exercises per year and about 700 policemen were unable to complete their obligatory one-time training per year (source, source, source, source).

After the September 11 attacks, I started to see groups of policemen patroling airports, one of them being armed with a submachine gun. Next, policemen across Germany started wearing ballistic vests. On New Year's Eve, I was at the Berlin city center less than one kilometre away from the German parliament and most policemen carried submachine guns; there were even three policemen with submachine guns guarding the club where I celebrated New Year's Eve. Although Christmas season is over now, you can see policemen in Berlin's government district with submachine guns every day. It is planned that Berlin's police will upgrade its existing ballistic vests and acquire 6300 new ballistic vests, 12.000 new pistols, as well as new submachine guns for 8.8 Million Euros (there are 17.000 policemen in Berlin).

When I visited New York in 2009, I was dumbfounded seeing two policemen patroling the streets wearing body armor and carrying machine guns. In 2016, heavily armed police in Germany is a common sight. Something is amiss in Germany and militarizing the police will only suppress the symptoms but not cure the disease.

Why I Chose the Mozilla Public License 2.0 Thu, 08 Dec 2016 22:20:39 +0000 Continue reading Why I Chose the Mozilla Public License 2.0 ]]> Selecting a software license requires a conscious effort  because of the various rights and obligations that come with it. For me, the most important aspects of a license were

  • using an existing open source license,
  • commercial use clauses,
  • patent protection,
  • the obligation to disclose changes,
  • the ability to revoke the license as the sole copyright holder, and
  • compatibility with BSD-licensed code.

In this blog post I elaborate on why I chose the Mozilla Public License 2.0 for my eigensolver DCGeig based on these features.

Why Choose an Existing License

A copyright holder can release his code under his very own custom license. In my eyes, this approach has many drawbacks:

  • you need to find a lawyer to devise the license,
  • getting licenses right is hard,
  • there is no foundation enforcing the license, and
  • it abets license proliferation.

The Free Software Foundation ensures compliance for the GNU licenses and similarly, the Mozilla Foundation backs the Mozilla Public License. Note that all licenses of these two organizations have evolved over the years and while some of these changes were motivated by new technologies (case in point: Affero GPL and Software as a Service), I drew the conclusion that getting licenses right is hard. License proliferation makes it difficult to determine license compatibility and understanding what users and developers of the software are agreeing to. As a consequence, I decided that my code should be released under one of the following better known licenses:

Commercial Use

It would be great if I could benefit from commercial use of my software but none of the better known licenses forbids commercial use and it seems to me that you must use dual-licensing in this case. Since using an established license is more important to me than commercial use considerations, I decided to ignore this point.

Patent Protection

A patent holder may contribute code to an open source software project but that does not mean the patent holder granted patent rights to the users of the software. The AGPLv3, GPLv3, LGPLv3, MPL 2.0, and Apache License 2.0 all contain patent use clauses that either grant patent rights or (in layman's terms) forbid enforcing these rights. I wanted a license with patent protection meaning I could not use a BSD license or the MIT license. The GPLv2 provides an implicit patent grant that was deemed not extensive enough for the GPLv3 (see Why Upgrade to GPLv3) but I cannot say how the GPLv2 patent protection compares to the other licenses.

Obligation to Disclose Changes

There is no license that forces changes to be disclosed under all circumstances. The GPL only requires the disclosure of changes if the software is distributed. Hence Google can run a customized Linux kernel (see page 44 in the linked document) without ever disclosing these changes, for example. The AGPLv3 requires the disclosure of changes if the compiled code is distributed or if it is used to provide services over a network so if a company uses programs only internally, then none of the popular licenses requires the company to disclose changes. This raises the question if there exists useful software that is not distributed and not provided as a service and if not, then there would be no "disclosure gap". My solver DCGeig is well-suited for certain problems arising, e.g., in the design of automobiles, aircraft, spacecraft, or submarines and thus, it is exactly the kind of software a vehicle manufacturer does not want to make accessible outside of the company. Now while there may be a disclosure gap, I wonder how one can prove the existence of a piece of software used only internally in a company and the fact the company is in violation of the software license. In conclusion, this point does not narrow down the set of licenses to choose from.

Revoke License as Copyright Holder

So far, we found out that we cannot earn money from licensing our software without a custom license and we cannot enforce the disclosure of all changes. Nevertheless, one might be interested to revoke the license for certain individuals or organizations but such a license cannot be an open source license according to the definition of the Open Source Initiative (see Articles 5 and 6). Also, the GPLv3 contains an explicit irrevocability clause in Article 2. Nevertheless, can the copyright holder revoke the license or not? I found many contradicting answers (no, yes, no, in 35 years in the US, only in Australia). Assuming licenses cannot be revoked, the only possible approach is to put future releases under a different license. Again, we cannot narrow the set of licenses to choose from.

Compatibility with BSD-licensed Code

My eigensolver DCGeig relies heavily on the Python interpreter CPython, the numerical mathematics libraries BLAS, LAPACK, NumPy, SciPy including the linear system solver SuperLU, the graph partitioning software Metis, and soon maybe the sparse symmetric matrix library SPRAL. These software packages are released under the following licenses (in order): Python Software Foundation License (BSD-like), BSD, BSD, BSD, BSD, BSD, Apache 2.0 License, BSD. If I were to release DCGeig under a GNU license, then it would be impossible to make my solver available in, say, SciPy which comes to my mind here because this project had to remove its UMFPACK bindings since the copyright holder changed the license to GPL (see the corresponding GitHub issue #3178). I benefit from these projects and I want to keep the option of allowing these projects to use my work. This choice leaves me unable to use a GNU license.

Why I Licensed My Code Under the MPL 2.0

At this point, I could use the Apache License 2.0 or the Mozilla Public License 2.0. I wanted to ensure disclosure of changes of my code and thus, I decided to put future DCGeig versions under the MPL 2.0. As a bonus, the MPL 2.0 is free of political messages, the license makes concessions if licensees cannot fully comply with it (compare this to the GPLv3, Article 12), the code cannot just be taken und put under a different license (see MPL 2.0 FAQ, Q14.2), the boilerplate license notice that needs to be put at the top of every file is short, and in my opinion the license is quite readable. The article The Mozilla Public License Version 2.0: A Good Middle Ground? highlights the core ideas of this license well.

The Discretized Laplace Operator on Hyperrectangles with Zero Dirichlet Boundary Conditions Thu, 24 Nov 2016 19:48:05 +0000 Continue reading The Discretized Laplace Operator on Hyperrectangles with Zero Dirichlet Boundary Conditions ]]> In this blog post, I present stiffness and mass matrix as well as eigenvalues and eigenvectors of the Laplace operator (Laplacian) on domains (0, \ell), (0, \ell_1) \times (0, \ell_2), and so on (hyperrectangles) with zero Dirichlet boundary conditions discretized with the finite difference method (FDM) and the finite element method (FEM) on equidistant grids. For the FDM discretization, we use the central differences scheme with the standard five-point stencil in 2D. For the FEM, the ansatz functions are the hat functions. The matrices, standard eigenvalue problems A v = \sigma v, and generalized eigenvalue problems K w = \tau M w arising from the discretization lend themselves for test problems in numerical linear algebra because they are well-conditioned, not diagonal, and the matrix dimension can be increased arbitrarily.

Python code generating the matrices and their eigenpairs can be found in my git repository discrete-laplacian.


Let d \in \mathbb{N} \setminus \{0\}, let u: \mathbb{R}^d \rightarrow \mathbb{R}. The differential operator \Delta is called Laplacian and it is the sum of the second derivatives of a function:

 \Delta u = \sum_{i=1}^d \frac{\partial^2 u}{\partial x_i^2},

It is a well-researched differential operator in mathematics and for many domains, the exact eigenvalues \lambda and eigenfunctions u are known such that

 -\Delta u = \lambda u.

In this blog post, I discuss the solutions of -\Delta u = \lambda u on hyperrectangles \Omega with the Dirichlet boundary condition

 u = 0 \text{ on } \partial \Omega.

The combination of Laplace operator and the finite difference method (FDM) can be very well used in an introductory course on the numerical treatment of partial differential equations (PDEs) for the illustration of concepts such as discretization of PDEs, discretization error, the growth of the matrix condition number with finer grids, and sparse solvers.

In numerical linear algebra, the Laplace operator is appealing because the FDM discretization of the operator on a one-dimensional domain yields a standard eigenvalue problem A v = \sigma v with a sparse, real symmetric positive-definite, tridiagonal Toeplitz matrix A and known eigenpairs. For domains in higher dimensions, the matrices can be constructed with the aid of the Kronecker sum and consequently, the eigenpairs can be calculated in this case, too. With this knowledge, the Laplace operator makes for a good, easy test problem in numerical mathematics because we can distinguish between discretization errors and algebraic solution errors. Naturally, the Laplace operator can also be discretized with the finite element method (FEM) yielding a generalized eigenvalue problem Kw = \tau Mw with sparse, real symmetric positive-definite matrices. Here, the eigenpairs are known, too, and furthermore, the eigenvectors are exact in the grid points (so-called superconvergence) providing us with another well-conditioned test problem.

I will present the solutions for the continuous case before I briefly introduce Kronecker products and Kronecker sums since these linear algebra operations are used to construct the matrices corresponding to higher dimensional domains. Finally, I discuss domain discretization and notation before giving closed expressions for the matrices and their eigenpairs created by FDM and FEM. In the end, there is an example demonstrating the use of the quantities \ell_d, n_d, and h_d.

Continuous Case

In 1D, the exact eigenpairs (\lambda_i, u_i(x)) of the Laplace operator on the domain (0, \ell) are

\left(\frac{i^2}{\ell^2}\pi^2, \sin\frac{ix\pi}{\ell}\right), \, i = 1,2,\dotso.

In 2D, the exact eigenpairs (\lambda_{ij}, u_{ij}(x)) of the Laplace operator on the domain (0, \ell_1) \times (0, \ell_2) are

 \left( \left(\frac{i^2}{\ell_1^2} + \frac{j^2}{\ell_2^2}\right)\pi^2, \sin \frac{\pi i x_1}{\ell_1} \sin \frac{\pi j x_2}{\ell_2} \right), \, i, j = 1, 2, \dotso.

On a d-dimensional hyperrectangle, the eigenpairs are

 \left( \pi^2 \sum_{k=1}^d \frac{i_k^2}{\ell_k^2}, \operatorname*{\Pi}_{k=1}^d \sin \frac{\pi i_k x_k}{\ell_k} \right), \, i_k=1,2,\dotso.

These solutions can be found, e.g., in the book Methods of Mathematical Physics, Vol. I, Chapter VI, §4.1.

Kronecker Products

The matrices of the discretized Laplacian on higher dimensional domains can be constructed with Kronecker products. Given two matrices A = [a_{ij}] \in \mathbb{R}^{m,n}, B \in \mathbb{R}^{k,\ell}, the Kronecker product C = A \otimes B is defined as follows:

 C = \begin{pmatrix} a_{1,1} B & \cdots & a_{1,n} B \\ \vdots & \ddots & \vdots \\ a_{m,1} B & \cdots & a_{m,n} B \end{pmatrix}.

If A and B are square, then C is square as well; let (\lambda_i, v_i) be the eigenpairs of A and let (\mu_j, w_j) be the eigenpairs of B. Then the eigenpairs of C are

 (\lambda_i \mu_j, v_i \otimes w_j).

If A and B are real symmetric, then C is also real symmetric. For square A and B, the Kronecker sum of A and B is defined as

 D = A \oplus B = I_k \otimes A + B \otimes I_m,

where I_n is the n \times n identity matrix. The eigenvalues of D are

 (\lambda_i + \mu_j, w_j \otimes v_i).

The Kronecker sum occurs during the construction of the 2D FDM matrix. See, e.g., Matrix Analysis for Scientists and Engineers by Alan J. Laub, Chapter 13, for more information on these operations.

Domain Discretization

For the 1D case along the d-th axis, we use n_d+2 points uniformly distributed over (0, \ell_d), such that the step size is h_d = \frac{\ell_d}{n_d+1}. We use lexicographical ordering of points, e.g., in 2D let u: \mathbb{R}^2 \rightarrow \mathbb{R} be an eigenfunction of the continuous problem, let v \in \mathbb{R}^{n_1 n_2} be an eigenvector of the algebraic eigenvalue problem. Then v_1 is an approximation to u(x_{1,1}), v_2 approximates u(x_{2,1}), v_{n_1+1} approximates u(x_{1,2}), and so on.

To obtain accurate approximations to the solutions of the continuous eigenvalue problem, the distance between adjacent grid points should always be constant. Thus, if the length of the 1D domain is doubled, the number of grid points should be doubled, too.


In the following sections, we need to deal with matrices arising from discretizations of the Laplace operator on 1D domains with different step sizes and their eigenpairs as well as discretizations of the Laplace operator on higher dimensional domains and their eigenpairs. I denotes the identity matrix; its dimension can be gathered from the context.

Matrices A denote the Laplacian discretized with the FDM, matrices A^{(d)} \in \mathbb{R}^{n_d,n_d} denote the discrete Laplacian on 1D domains along the d-th axis, and matrices A_d denote the the discrete Laplacian on a d-dimensional domain. The eigenpairs of A^{(d)} are signified by (\sigma_i^{(d)}, v_i^{(d)}), i = 1, 2, \dotsc, n_d, (\sigma_{i_1 i_2 \dotsb i_d}, v_{i_1 i_2 \dotsb i_d}) are the eigenpairs of A_d.

Similarly, K and M are the stiffness and mass matrix, respectively, of the Laplace operator discretized with the finite element method. K^{(d)}, M^{(d)} \in \mathbb{R}^{n_d,n_d} denote the discrete Laplacian on a one-dimensional domain along the d-th axis, and K_d, M_d are the stiffness and mass matrix for the discrete Laplacian on a d-dimensional hyperrectangle. We speak of the solutions of Kw = \tau M w or of eigenpairs of the matrix pencil (K, M). The eigenpairs of (K^{(d)}, M^{(d)}) are denoted by (\tau_i^{(d)}, w_i^{(d)}), whereas (\tau_{i_1 i_2 \dotsb i_d}, w_{i_1 i_2 \dotsb i_d}) signify eigenpairs of (K_d, M_d).

Finite Difference Method

In this section, we will construct the matrices of the discretized Laplace operator on a d+1-dimensional domain with the aid of the matrices for the d-dimensional and the one-dimensional case.

Let e_i be the i-th unit vector. Let us consider the discretization along the d-th axis first. Let

 (D^+ u)(x) = \frac{u(x + h_d e_d) - u(x)}{h_d}

be the forward difference quotient, let

 (D^- u)(x) = \frac{u(x) - u(x - h_d e_d)}{h_d},

be the backward difference quotient. Then

 \frac{\partial^2}{\partial x_d^2}(x) \approx (D^- D^+ u)(x) = \frac{u(x + h_d e_d) - 2 u(x) + u(x - h_d e_d)}{h_d^2}.

Consequently, the discrete Laplacian has the stencil

 A^{(d)} = \frac{1}{h_d^2} \, \begin{bmatrix} -1 & 2 & -1 \end{bmatrix}

meaning A^{(d)} is a real symmetric tridiagonal n_d \times n_d matrix with the value 2 on the diagonal and -1 on the first sub- and superdiagonal. The eigenvalues of A^{(d)} are

 \sigma_i^{(d)} = \frac{2}{h_d^2} \left(1 - \cos \frac{\pi i}{n_d+1} \right), \, i=1,2,\dotsc,n_d,

the entries of the eigenvector v_i^{(d)} are

 (v_i^{(d)})_k = \sin \frac{\pi k i}{n_d+1}, \, k=1,2,\dotsc,n_d .

In 2D with lexicographic ordering of the variables, we have

 A_2 = \frac{1}{h_1^2} \begin{bmatrix} 0&0&0 \\ -1&2&-1 \\ 0&0&0 \end{bmatrix} + \frac{1}{h_2^2} \begin{bmatrix} 0&-1&0 \\ 0&2&0 \\ 0&-1&0 \end{bmatrix}

and this matrix can be constructed as follows:

 A_2 = I \otimes A^{(1)} + A^{(2)} \otimes I.

The eigenpairs can be derived directly from the properties of the Kronecker sum: the eigenvalues are

 \sigma_{ij} = \sigma_j^{(2)} + \sigma_i^{(1)},

and the eigenvectors are

 v_{ij} = v_j^{(2)} \otimes v_i^{(1)},

where i = 1,2, \dotsc, n_1, j=1, 2, \dotsc, n_2.

In higher dimensions, it holds that

 A_{d+1} = I \otimes A_d + A^{(d+1)} \otimes I,


 \sigma_{i_1 i_2 \dotsb i_{d+1}} = \sigma_{i_{d+1}}^{(d+1)} + \sigma_{i_1 i_2 \dotsb i_d}

are the eigenvalues and

 v_{i_1 i_2 \dotsb i_{d+1}} = v_{i_{d+1}}^{(d+1)} \otimes v_{i_1 i_2 \dotsb i_d}

are the eigenvectors, i_{d+1} = 1, 2, \dotsc, n_{d+1}.

Finite Element Method

Similarly to the finite differences method, we can construct the matrices of the discretized equation -\Delta u = \lambda u recursively. We use hat functions as ansatz functions throughout this section.

The 1D stencil is

 K^{(d)} = \frac{1}{h_d} \begin{bmatrix} -1 & 2 & -1 \end{bmatrix}

for the stiffness matrix and

 M^{(d)} = \frac{h_d}{6} \begin{bmatrix} 1 & 4 & 1 \end{bmatrix}

for the mass matrix. The generalized eigenvalues are

 \tau_i^{(d)} = \frac{6}{h_d^2} \frac{1 - \cos \frac{i \pi}{n_d+1}}{2 + \cos \frac{i \pi}{n_d+1}}, \, i = 1, 2, \dotsc, n_d,

and the k-th entry of the eigenvector w_i^{(d)} corresponding to \tau_i^{(d)} is

 (w_i^{(d)})_k = \sin \frac{k i \pi}{n_d+1}, \, k = 1, 2, \dotsc, n_d.

Observe that w_i^{(d)} is not only an eigenvector of the matrix pencil (K^{(d)}, M^{(d)}) but also of the matrices K^{(d)} and M^{(d)} themselves. Thus, the eigenvalues of K^{(d)} are

 \lambda_i(K^{(d)}) = \frac{1}{h_d} \, \left(2 - 2 \cos \frac{i \pi}{n_d+1} \right), \, i=1,2,\dotsc,n_d,

whereas the eigenvalues of M^{(d)} are

 \lambda_i(M^{(d)}) = \frac{h_d}{6} \, \left(4 + 2 \cos \frac{i \pi}{n_d+1} \right), \, i=1,2,\dotsc,n_d.

In 2D, the mass matrix stencil is

 M_2 = \frac{h_1 h_2}{36} \begin{pmatrix} 1&4&1 \\ 4&16&4 \\ 1&4&1 \end{pmatrix}.

Judging by the coefficient \frac{1}{36} h_1 h_2 and the diagonal blocks, it must hold that

 M_2 = M^{(2)} \otimes M^{(1)}.

The stiffness matrix stencil is

 K_2 = \frac{h_2}{h_1} \begin{bmatrix} -1&2&-1 \\ -4&8&-4 \\ -1&2&-1 \end{bmatrix} + \frac{h_1}{h_2} \begin{bmatrix} -1&-4&-1 \\ 2&8&2 \\ -1&-4&-1 \end{bmatrix}.

Seeing the factors h_1 and h_2, the stiffness matrix cannot be the Kronecker product or the Kronecker sum of 1D stiffness matrices. However, observe that the 1D mass matrices M^{(1)} and M^{(2)} have coefficients h_1 and h_2, respectively, and indeed, the 2D stiffness matrix can be constructed with the aid of the 1D mass matrices:

 K_2 = M^{(2)} \otimes K^{(1)} + K^{(2)} \otimes M^{(1)}.

Based on the properties of the Kronecker product and using the fact that mass and stiffness matrix have the same set of eigenvalues, (K_2, M_2) has the eigenvalues

 \tau_{ij} = \tau_j^{(2)} + \tau_i^{(1)}

and eigenvectors

 w_{ij} = w_j^{(2)} \otimes w_i^{(1)},

where i = 1, 2, \dotsc, n_1, j = 1, 2, \dotsc, n_2.

For the Laplacian on hyperrectangles in d+1-dimensional space, the stiffness matrix can be constructed by

 K_{d+1} = M^{(d+1)} \otimes K_d + K^{(d+1)} \otimes M_d

the mass matrix can be constructed using

 M_{d+1} = M^{(d+1)} \otimes M_d,

such that (K_{d+1}, M_{d+1}) has eigenvalues

 \tau_{i_1 i_2 \dotsb i_{d+1}} = \tau_{i_{d+1}}^{(d+1)} + \tau_{i_1 i_2 \dotsb i_d},

and eigenvectors

 w_{i_1 i_2 \dotsb i_{d+1}} = w_{i_{d+1}}^{(d+1)} \otimes w_{i_1 i_2 \dotsb i_d},

where i_{d+1} = 1, 2, \dotsc, n_{d+1}.


Now I will present the matrices of the discrete Laplacian on the domain (0, 6) \times (0, 5). The figure below shows the domain and the grid used for discretization: there are five interior grid points along the first axis, three interior points along the second axis, and the interior grid points are highlighted as black circles. Thus, \ell_1 = 6, \ell_2 = 5, n_1 = 5, n_2 = 3, the step size h_1 = 1, and h_2 = 1.25. In 2D, an eigenvector v of the algebraic eigenvalue problem will possess entries v_i, i = 1, 2, \dotsc, 15, such that v_1 \approx u(x_{1,1}), v_2 \approx u(x_{2,1}), \dotsc, v_5 \approx u(x_{5,1}), v_6 \approx u(x_{1,2}), \dotsc, v_{15} \approx u(x_{5,3}) because of the lexicographic ordering.

An equispaced grid on a 2D domain with five interior grid points along the first dimension and three interior grid points along the second dimension
An equispaced grid on a 2D domain with five interior grid points along the first dimension and three interior grid points along the second dimension

In the first dimension, the Laplacian discretized with the FDM has the following matrix:

 A^{(1)} = \begin{pmatrix} 2&-1 \\ -1&2&-1 \\ &-1&2&-1 \\ &&-1&2&-1 \\ &&&-1&2 \end{pmatrix}.

Along the second axis, we have

 A^{(2)} = \frac{1}{1.5625} \cdot \begin{pmatrix} 2&-1 \\ -1&2&-1 \\ &-1&2 \end{pmatrix}.

With the FEM, the stiffness matrix is

 K^{(1)} = \begin{pmatrix} 2&-1 \\ -1&2&-1 \\ &-1&2&-1 \\ &&-1&2&-1 \\ &&&-1&2 \end{pmatrix}

and the mass matrix is

 M^{(1)} = \frac{1}{6} \cdot \begin{pmatrix} 4&1 \\ 1&4&1 \\ &1&4&1 \\ &&1&4&1 \\ &&&1&4 \end{pmatrix}

along the first axis. The matrices corresponding to the second dimension are

 K^{(2)} = 0.8 \cdot \begin{pmatrix} 2&-1 \\ -1&2&-1 \\ &-1&2 \end{pmatrix}


 M^{(2)} = \frac{1.25}{6} \cdot \begin{pmatrix} 4&1 \\ 1&4&1 \\ &1&4 \end{pmatrix}.

CMake: "Errors occurred during the last pass" Sun, 13 Nov 2016 15:18:09 +0000 Continue reading CMake: "Errors occurred during the last pass" ]]> You executed

ccmake -G ninja /path/to/source-code

on the command line and after pressing

[c] to configure

ccmake presents you an empty window with the message

Errors occurred during the last pass

in the status bar. The cause of this problem is the misspelled generator name on the ccmake command line (argument -G): the generator is called Ninja with capital N as the first letter.

Cover Trees Tue, 06 Sep 2016 22:41:26 +0000 Continue reading Cover Trees ]]> A cover tree is a tree data structure used for the partitiong of metric spaces to speed up operations like nearest neighbor, k-nearest neighbor, or range searches. In this blog post, I introduce cover trees, their uses, their properties, and I measure the effect of the dimension of the metric space on the run-time in an experiment with synthetic data.


Let V be a set. A function d: V \times V \rightarrow \mathbb{R} is called a metric for V if

  • d(x, y) \geq 0,
  • d(x, y) = 0 iff x = y,
  • d(x, y) = d(y, x), and
  • d(x, z) \leq d(x, y) + d(y, z),

where x, y, z \in V. The last inequality is known as the triangle inequality. The pair (V, d) is then called a metric space. Let x, y \in \mathbb{R}^n and let x_i, y_i denote the ith component of x and y, respectively, i = 1, 2, \dotsc, n. Often used metrics are the Euclidean metric

 d(x, y) = \sum_{i=1}^n (x_i - y_i)^2,

the Manhattan metric

 d(x, y) = \sum_{i=1}^n \lvert x_i - y_i \rvert,

and the Chebychev metric

 d(x, y) = \max_{i=1,2,\dotsc,n} \lvert x_i - y_i \rvert.

Consider a set of points S \subset V. Given another point p \in V (note that we allow p \in S), we might be interested in the point q \in S closest to p, p \neq q, i.e.,

 \operatorname{arg\,min}_{q \in S} d(p, q).

This problem is known as the nearest-neighbor problem. k-nearest neighbors is a related problem where we look for the k points q_1, q_2, \dotsc, q_k \in S closest to p, q_i \neq p:

 \min_{q_i \in S} \sum_{i=1}^k d(p, q_i).

In the all nearest-neighbors problem, we are given sets S and R and the goal is to determine the nearest neighbor q \in S for each point r \in R. If S = R, then we have a monochromatic all nearest-neighbors problem, otherwise the problem is called bichromatic. Finally, there is also the range problem where we are given scalars 0 < r_1 < r_2 and where we seek the points q_1', q_2', \dotsc, q_{\ell}' \in S such that r_1 \leq d(p, q_i') \leq r_2 holds for all points q_i'.

The nearest-neighbor problem and its variants occur, e.g., during the computation of minimum spanning trees with vertices in a vector space or in n-body simulations, and they are elementary operations in machine learning (nearest centroid classification, k-means clustering, kernel density estimation, ...) as well as spatial databases. Obviously, computing the nearest neighbor using a sequential search requires linear time. Hence, many  space-partitioning data structures were devised that were able to reduce the average complexity \mathcal{O}(\log n) though the worst-case bound is often \mathcal{O} (n), e.g., for k-d trees or octrees.

Cover Trees

Cover trees are fast in practice and have great theoretical significance because nearest-neighbor queries have guaranteed logarithmic complexity and they allow the solution of the monochromatic all-nearest-neighbors problem in linear time instead of n \log n (see the paper Linear-time Algorithms for Pairwise Statistical Problems). Furthermore, cover trees require only a metric for proper operation and they are oblivious to the representation of the points. This allows one, e.g., to freely mix cartesian and polar coordinates, or to use implicitly defined points.

A cover tree T on a data set S with metric d is a tree data structure with levels. Every node in the tree references a point p \in S and in the following, we will identify both the node and the point with the same variable p. A cover tree tree T is either

  • empty, or
  • at level i the tree has a single root node p.

If p has children, then

  • the children are non-empty cover trees with root nodes at level i-1,
  • (nesting) there is a child tree with p as root node,
  • (cover) for the root node q in every child tree of p, it holds that d(p, q) <= 2^i, i.e., p covers q,
  • (separation) for each pair of root nodes q \neq q' in child trees of p, it holds that d(q, q') > 2^{i-1}.

Note that the cover tree definition does not prescribe that every descendant q of p must have distance d(p, q) \leq 2^i. Let p_1, p_2, \dotsc, p_n be the parent nodes of q. Then the triangle inequality yields

 d(p, q) \leq \sum_{j = 0}^{n-1} d(p_j, p_{j+1}) + d(p_n, q) \leq \sum_{j = 0}^n 2^j.

With an infinite amount of levels, we get the inequality

 d(p, q) \leq 2^{i+1}.

What is more, given a prescribed parent node p,  notice that the separation condition implies that child nodes q must inserted in the lowest possible level for otherwise we violate the separation inequality d(p, q) > 2^{i-1}.

The definition of the cover trees uses the basis 2 for the definition of the cover radius and the minimum separation but this number can be chosen arbitrarily. In fact, the implementation by Beygelzimer/Kakade/Langford uses the basis 1.3 and MLPACK defaults to 2 but allows user-provided values chosen at run-time. In this blog post and in my implementation I use the basis 2 because it avoids round-off errors during the calculation of the exponent.

Cover trees have nice theoretical properties as you can see below, where n = \lvert S \rvert, c denotes the expansion constant explained in the next section:

  • construction: \mathcal{O}(c^6 n \log n),
  • insertion: \mathcal{O}(c^6 \log n),
  • removal: \mathcal{O}(c^6 \log n),
  • query: \mathcal{O}(c^{12} \log n),
  • batch query: \mathcal{O}(c^{12} n).

The cover tree requires \mathcal{O}(n) space.

The Expansion Constant c

Let B_S(p, r) denote the set of points q \in S that are less than r > 0 away from p:

 B_S(p, r) := \{ q \in S: d(p, q) \leq r \}.

The expansion constant c of a set S is the smallest scalar c \geq 2 such that

 \lvert B_S(p, 2r) \rvert \leq c B_S(p, r)

for all p, r. We will demonstrate that the expansion constant can be large and sensitive to changes in S.

Let V = \mathbb{R} and let

 S = \{ 2^i: i = 0, 1, 2, \dotsc, n \}

for some integer n > 1. In this case, c = n because \lvert B_S(2^n, 2^{n-1} - 0.5) \rvert = 1 and \lvert B_S(2^n, 2^n - 1) \rvert = n and this is obviously the worst case. Moreover, c can be sensitive to changes of S, e.g., consider set S whose points are evenly distributed on the surface of a unit hypersphere, and let q \neq 0 be a point arbitrarily close to the origin. The expansion constant of the set S \cup \{0\} is \lvert S \rvert + 1 whereas the expansion constant of the set S \cup \{ 0, q \} is 1/2 (\lvert S \rvert + 1) (this example was taken from the thesis Improving Dual-Tree Algorithms). With these bounds in mind and assuming the worst-case bounds on the cover tree algorithms are tight, we have to  concede that these algorithms may require \mathcal{O}(n^6) operations or worse. Even if the points are regularly spaced, the performance bounds may be bad. Consider a set S forming a d-dimensional hypercubic honeycomb, i.e., with d=2 this is a regular square tiling. In this case, the expansion constant c is  proportional to 2^d. Note that the expansion constant c depends on the dimension of the subspace spanned by the points of q \in S and not on the dimension of the space containing these points.

Nevertheless, cover trees are used in practice because real-world data sets often have small expansion constants. The expansion constant is related to the doubling dimension of a metric space (given a ball B with unit radius in a d-dimensional metric space V, the doubling dimension of V is the smallest number of balls with radius r = 0.5 needed to cover B).

Implementing Cover Trees

In a cover tree, given a point p on level i, there is a node for p on all levels i, i-1, i-2, \dotso which raises the question how we can efficiently represent cover trees on a computer. Furthermore, we need to know if the number of levels in the cover tree can represented with standard integer types.

Given a point p that occurs on multiple levels in the tree, we can either

  • coalesce all nodes corresponding to p and store the children in an associtive array whose values are child nodes and levels as keys, or
  • we create one node corresponding to p on each level whenever there are children, storing the level of the node as well as its children.

The memory consumption for the former representation can be calculated as follows: every cover tree node needs to store the corresponding point and the associative array. If the associative array is a binary tree, then for every level i, there is one binary tree node with

  • a list of children of the cover tree node at level i,
  • a pointer to its left child,
  • a pointer to its right child, and
  • a pointer to the parent binary tree node so that we can implement iterators (this is how std::map nodes in the GCC C++ standard library are implemented).

Hence, for every level i in the binary tree, we need to store at least four references and the level. The other representation must store the level, a reference to the corresponding point, and a reference to the list of children of this cover tree node so this representation is more economic with respect to the memory consumption. There is no difference in the complexity of nearest-neighbor searches because for an efficient nearest-neighbor search, we have to search the cover tree top down starting at the highest level.

A metric maps its input to non-negative real values. On a computer (in finite precision arithmetic) there are bounds on the range of values that can be represented and we will elaborate on this fact using numbers in the IEEE-754 double-precision floating-point format as an example. An IEEE-754 float is a number \pm m \cdot 2^{(e-b)}, where m is called the mantissa (or significand), e is called the exponent, and b is a fixed number called bias. The sign bit of a double-precision float is represented with one bit, the exponent occupies 11 bits, the mantissa 52 bits, and the bias is b = 1024. The limited size of these two fields immediately bounds the number the quantities that can be represented and in fact, the largest finite double-precision float value is approximately 2^{1024} and the smallest positive value is 2^{-1074}. Consequently, we will never need have more than 1024 + 1074 = 2098 levels in a cover tree when using double-precision floats irrespective of the number of points stored in the tree. Thus, the levels can be represented with 16bit integers.

Existing Implementations

The authors of the original cover tree paper Cover Trees for Nearest Neighbors made their C and C++ implementations available on the website The first author of the paper Faster Cover Trees made the Haskell implementation of a nearest ancestor cover tree used for this paper available on GitHub. The C++ implementation by Manzil Zaheer features k-nearest neighbor search, range search, and parallel construction based on C++ concurrency features (GitHub). The C++ implementation by David Crane can be found in his repository on GitHub. Note that the worst-case complexity of node removal is linear in this implementation because of a conspicuous linear vector search. The most well maintained implementation of a cover tree can probably be found in the MLPACK library (also C++). I implemented a nearest ancestor cover tree in C++14 which takes longer to construct but has superior performance during nearest neighbor searches. The code can be found in my git repository.

Numerical Experiments

The worst-case complexity bounds of common cover tree operations, e.g., construction and querying, contain terms c^6 or c^12, where c is the expansion constant. In this section, I will measure the effect of the expansion constant on the run-time of batch construction and nearest-neighbor search on a synthetic data set.

For the experiment, I implemented a nearest ancestor cover tree described in Faster Cover Trees in C++14 with batch construction, nearest-neighbor search (single-tree algorithm), and without associative arrays. The first point in the data set is chosen as the root of the cover tree and on every level of the cover tree, the construction algorithm attempts to select the points farthest away from the root as children.

The data consists of random points in d-dimensional space with uniformly distributed entries in the interval [-1, +1), i.e., we use random points inside of a hypercube. The reference set (the cover tree) contains n = 10^4 points and we performed nearest-neighbor searches for m = 10^3 random points. The experiments are conducted using the Manhattan, the Euclidean, and the Chebychev metric and the measurements were repeated 25 times for dimensions d = 10, 12, 14, \dotsc, 20.

We do not attempt to measure the expansion constant for every set of points. Instead, we approximate the expansion constant from the dimension d. Let d_p(\cdot, \cdot), p = 1, 2, \infty, be a metric, where

  • d_1(\cdot, \cdot) is the Manhattan metric,
  • d_2(\cdot, \cdot) is the Euclidean metric, and
  • d_{\infty}(\cdot, \cdot) is the Chebychev metric,

and let B_p(d, r) be the ball centered at the origin with radius r:

 B_p(d, r) := \{ p \in \mathbb{R}^d: d_p(0, p) \leq r \}.

The expansion constant c of a set S was defined as the smallest scalar c \geq 2 such that

 \lvert B_S(p, 2r) \rvert \leq c \lvert B_S(p, r) \rvert

for all p \in \mathbb{R}^d, r > 0. We will now simplify both sides of the inequality.

In this experiment, all entries of the points are uniformly distributed around the origin and using the assumption that n and r are sufficiently large, B_S(p, r) will be approximately constant everywhere in the hypercube containing the points:

 c \approx \frac{\lvert B_S(p, 2r) \rvert}{\lvert B_S(p, r) \rvert}.

Using the uniform distribution property again, we can set p = 0 without loss of generality. Likewise, since \lvert B_S(0, r) \rvert is approximately constant, the fraction above will be close to the ratio of the volumes of the balls B_p(d, 2r) and B_p(d, r). B_1(\cdot, \cdot) is called a cross-polytope and its volume can be computed with

 V_1(d, r) = \frac{(2r)^d}{d!}.

The volume of the d-ball in Euclidean space B_2(\cdot, \cdot) is

 V_2(d, r) = \frac{\pi^{d/2}}{\Gamma(d/2+1)} r^d,

where \Gamma is the gamma function. Finally, B_{\infty}(d, r) is a hypercube with volume

 V_{\infty}(d, r) = (2r)^d.

Using our assumptions, it holds that

 c \approx \frac{V_p(d, 2r)}{V_p(d, r)} = 2^d, \, p = 1, 2, \infty.

Consequently, the worst-case bounds are 2^{6d} n \log n for construction and 2^{12d} \log n for nearest-neighbor searches in cover trees with this data set.

In the plots, p=1 indicates the Manhattan, p=2 the Euclidean, and p=\infty the Chebychev metric with the markers corresponding to the shape of the Ball B_p(2, 1).

The figures below show mean and standard deviation for construction and query phase. The construction time of the cover tree is strongly dependent on the used metric: cover tree construction using the Chebychev metric takes considerably more time than with the other norms; construction with the Manhattan metric is slightly faster than with the Euclidean metric. Observe that there is a large variation in the construction time when employing the Euclidean metric and this effect becomes more pronounced the higher the dimension d. Also, considering the small standard deviation in the data, the construction time slightly jumps at d=20 for the Manhattan norm. In the query time plot, we can see that the results are mostly independent of the metric at hand. What is more, the variance of the query time is without exception small in comparison to the mean. Nevertheless, there is a conspicuous bend at d=16 when using the Manhattan metric. This bend is unlikely to be a random variation because we repeated the measurements 25 times and the variance is small.

Mean and standard deviation of the CPU time needed for the batch construction of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric Mean and standard deviation of the CPU time needed for nearest-neighbor searches of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric

With our measurements we want to determine how the expansion constant influences construction and query time and according to the estimates above, we ought to see an exponential growth in operations. To determine if the data points could have been generated by an exponential function, one can plot the data with a logarithmic scaling along the vertical axis. Then, exponential functions will appear linear and polynomials sub-linear. In the figures below, we added an exponential function for comparison and it seems that the construction time does indeed grow exponentially with the dimension d irrespective of the metric at hand while the query time does not increase exponentially with the dimension.

Plot of the CPU time needed for the batch construction of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric with logarithmic scaling of the vertical axis Plot of the CPU time needed for nearest-neighbor searches of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric with logarithmic scaling of the vertical axis

Seeing the logarithmic plots, I attempted to fit an exponential function C \cdot b^d, C, b > 0, to the construction time data. The fitted exponential did not approximate the data of the Euclidean and Manhattan metric well even when considering the standard deviation. However, the fit for the Chebychev metric was very good. In the face of these results, I decided to fit a monomial C' d^e, C', e > 0, to both construction and query time data and the results can be seen below. Except for the Manhattan metric data, the fit is very good.

Polynomial fit of the CPU time needed for the batch construction of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric Polynomial fit of the CPU time needed for nearest-neighbor searches of a nearest ancestor cover tree with Manhattan, Euclidean, and Chebychev metric

The value of the constants are for construction:

  • p=1: C' = 1.05 \cdot 10^{-2}, e = 2.28,
  • p=2: C' = 5.74 \cdot 10^{-5}, e = 4.55,
  • p = \infty: C' = 1.42 \cdot 10^{-5}, e = 5.47.

For nearest-neighbor searches, the constants are

  • p=1: C' = 3.11 \cdot 10^{-2}, e = 2.12,
  • p=2: C' = 1.95 \cdot 10^{-2}, e = 2.34,
  • p = \infty: C' = 3.03 \cdot 10^{-2}, e = 2.16.

In conclusion, for our test data the construction time of a nearest ancestor cover tree is strongly dependent on the metric at the hand and the dimension of the underlying space whereas the query time is mostly indepedent of the metric and a function of the square of the dimension d. The jumps in the data of the Manhattan metric and the increase in variation in the construction time when using the Euclidean metric highlight that there must be non-trivial interactions between dimension, metric, and the cover tree implementation.

Originally, we asked how the expansion constant c impacts the run-time of cover tree operations and determined that we can approximate c by calculating c \approx 2^d. Thus, the run-time t of construction and nearest-neighbor search seems to be proportional to \log c because d \approx \log_2 c and \log d^e = e \log d.


We introduced cover trees, discussed their advantages as well as their unique theoretical properties. We elaborated on the complexity of cover tree operations, the expansion constant, and implementation aspects. Finally, we conducted an experiment on a synthetic data set and found that the construction time is strongly dependent on the metric and the dimension d of the underlying space while the time needed for nearest-neighbor search is almost independent of the metric. Most importantly, the complexity of operations seems to be polynomial in d and proportional to the logarithm of the expansion constant. There are unexplained jumps in the measurements.

SuperLU vs Direct Substructuring Sun, 31 Jul 2016 11:22:19 +0000 Continue reading SuperLU vs Direct Substructuring ]]> The eigenproblem solver in my master's thesis used SuperLU, a direct solver for the solution of systems of linear equations (SLE) Ax = b. For the largest test problems, the eigensolver ran out of memory when decomposing the matrix A which is why I replaced SuperLU with direct substructuring in an attempt to reduce memory consumption. For this blog post, I measured set-up time, solve time, and memory consumption of SuperLU and direct substructuring with real symmetric positive definite real-world matrices for SLEs with a variable number of right-hand sides, I will highlight that SuperLU was deployed with a suboptimal parameter choice, and why the memory consumption of the decomposition of A is the wrong objective function when you want to avoid running out of memory.


For my master's thesis, I implemented a solver for large, sparse generalized eigenvalue problems that used SuperLU to solve SLEs A x = b, where A \in \mathbb{C}^{n,n} is large, sparse, and Hermitian positive definite (HPD) or real symmetric positive definite. SuperLU is a direct solver for such SLEs but depending on the structure of the non-zero matrix entries, there may be so much fill-in that the solver runs out of memory when attempting to factorize A and this happened for the largest test problems in my master's thesis. Without carefully measuring the memory consumption, I replaced SuperLU with direct substructuring.

Direct substructuring is a recursive method for solving SLEs Ax = b.  Partition

 A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, x = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}, b = \begin{pmatrix} b_1 \\ b_2 \end{pmatrix}

conformally and let S be the Schur complement of A_{11} in A:

 S := A_{22} - A_{21} A_{11}^{-1} A_{12}

Then we can compute x as follows: solve A_{11} u = b_1 and calculate v := b_2 - A_{21} u. Then

 x_2 = S^{-1} v.


 x_1 = u - A_{11}^{-1} A_{12} x_2.

Our aim is to solve SLEs Ax = b and so far, I did not explain how to solve SLEs A_{11} y_1 = c_1. We could solve this SLE directly with a dense or a sparse solver or we can apply the idea from above again, i.e., we partition A_{11} into a 2 \times 2 block matrix, partition x_1 and b_1 conformally, and solve A_{11} x_1 = b_1. This recursive approach yields a method that is called direct substructuring and it belongs to the class of domain decomposition methods. Usually, direct substructuring is used in conjunction with Krylov subspace methods; the SLE solver MUMPS is an exception to this rule. See the paper Iterative versus direct parallel substructuring methods in semiconductor device modeling for a comparison of these approaches.

Observe for direct substructuring, we only need to store the results of the factorization of the Schur complements, e.g., the matrices L and U of the LU decomposition of S so the memory consumption of the factorization of A is a function of the dimension of A_{22}. Furthermore, we can quickly minimize the dimension of the block A_{22} with the aid of nested dissection orderings, a method based on minimal vertex separators in graph theory. Thus, it is possible to use direct substructuring in practice.

Before presenting the results of the numerical experiments, I elaborate on the suboptimal parameter choice for SuperLU and afterwards, I explain why using direct substructuring was ill-advised.

SuperLU and Hermitian Positive Definite Matrices

SuperLU is a sparse direct solver meaning it tries to find permutation matrices P and Q as well as triangular matrices L and U with minimal fill-in such that P A Q = LU. When selecting permutation matrices, SuperLU considers fill-in and the modulus of the pivot to ensure numerical stability; P is determined from partial pivoting and Q is chosen independently from P in order to minimize fill-in (see Section 2.5 of the SuperLU User's Guide). SuperLU uses threshold pivoting meaning SuperLU attempts to use the diagonal entry a_{ii} as pivot unless

 \lvert a_{ii} \rvert < c \max_{j < i} \lvert a_{ij} \rvert,

where c \geq 0 is a small, user-provided constant. Consequently, for sufficiently well-conditioned HPD matrices, SuperLU will use P = Q and compute a Cholesky decomposition such that P^* A P = L L^*, approximately halving the memory consumption in comparison to an LU decomposition. For badly conditioned or indefinite matrices, SuperLU will thus destroy hermiticity as soon as a diagonal element violates the admission criterion for pivoting elements above and compute an LU decomposition.

For HPD matrices small pivots \lvert a_{ii} \rvert do not hurt numerical stability and in my thesis, I used only HPD matrices. For large matrices, SuperLU was often computing LU instead of Cholesky decompositions and while this behavior is completely reasonable, it was also unnecessary and it can be avoided by setting c := 0 thereby forcing SuperLU to compute Cholesky factorizations.

Optimizing for the Wrong Objective Function

Given an SLE A x = b and a suitable factorization of A, a direct solver can overwrite b with x such that no temporary storage is needed. Hence as long as there is sufficient memory for storing the factorization of A and the right-hand sides b, we can solve SLEs.

Let us reconsider solving an SLE with direct substructuring. We need to

  • solve A_{11} u = b_1,
  • calculate v := b_2 - A_{21} u,
  • solve S x_2 = v, and
  • compute x_1 = u - A_{11}^{-1} A_{12} x_2.

Let w := A_{11}^{-1} A_{12}. Observe when computing x_1 in the last step, we need to store u and w simultaneously. Assuming we overwrite the memory location originally occupied by b_1 with u, we still need additional memory to store w. Thus, direct substructuring requires additional memory during a solve because we need to store the factorized Schur complements as well as w. That is, the maximum memory consumption of direct substructuring may be considerably larger than the amount of memory required only for storing the Schur complements. The sparse solver in my thesis was supposed to work with large matrices (ca. 10^5 \times 10^5) so even if the block A_{22} is small in dimension such that A_{12} has only few columns, storing w requires large amounts of memory, e.g., let m \in \mathbb{N} such that A_{22} \in \mathbb{C}^{m,m}, let m = 200 and let n = 100,000. With double precision floats (8 bytes), w requires

 8\text{B} \cdot n (n-m) \approx 153\text{MiB}

of storage. For comparison, storing the m \times m matrix S consumes 0.3MiB while the 2,455,670 non-zeros of the test matrix s3dkq4m2 (see the section with the numerical experiments) with dimension 90,449 require 18.7MiB of memory, and the sparse Cholesky factorization of s3dkq4m2 computed by SuperLU occupies 761MiB. Keep in mind that direct substructuring is a recursive method, i.e., we have to store a matrix w at every level of recursion.

If we want to avoid running out of memory, then we have to reduce the maximum memory usage. When I decided to replace SuperLU with direct substructuring, I attempted to reduce the memory required for storing the matrix factorization.  With SuperLU, maximum memory consumption and memory required for storing the matrix factorization are the same but not for direct substructuring. In fact, the maximum memory consumption is often magnitudes larger than the storage required for the factors of the Schur complements.

Numerical Experiments

I measured the following quantities:

  • fill-in,
  • maximum memory consumption,
  • the time needed to factorize A, and
  • the time needed to solve SLEs with 256, 512, 1024, and 2048 right-hand sides (RHS).

In the plots, LU signifies SuperLU with the default parameters, LL stands for SuperLU with parameters chosen to exploit hermiticity (LL is an allusion to the Cholesky decomposition A = LL^*, see the section on SuperLU and HPD matrices), and DS means direct substructuring. The set of 27 test matrices consists of all real symmetric positive definite Harwell-Boeing BCS structural engineering matrices, as well as

  • gyro_k, gyro_m,
  • vanbody,
  • ct20stif,
  • s3dkq4m2,
  • oilpan,
  • ship_003,
  • bmw7st_1,
  • bmwcra_1, and
  • pwtk.

These matrices can be found at the University of Florida Sparse Matrix Collection. The smallest matrix has dimension 1074, the largest matrix has dimension 217,918, the median is 6458, and the arithmetic mean is 37,566. For the timing tests, I measured wall-clock time and CPU time and both timers have a resolution of 0.1 seconds. The computer used for the tests has an AMD Athlon II X2 270 CPU and 8GB RAM. Direct substructuring was implemented in Python with SciPy 0.17.1 and NumPy 1.10.4 using Intel MKL 11.3. SuperLU ships with SciPy. The right-hand sides vectors b are random vectors with uniformly distributed entries in the interval [-1, 1).

With the fill-in plot, it is easy to see that LL requires less memory than LU. For direct substructuring, the fill-in is constant except for a few outliers where the matrix A is diagonal. For the smallest problems, both SuperLU variants create less fill-in but for the larger test matrices, DS is comparable to LL and sometimes better. Unfortunately, we can also gather from the plot that the fill-in seems to be a linear function of the matrix dimension n with SuperLU. If we assume that the number of non-zero entries in a sparse matrix is a linear function of n, too, then the number of non-zero entries in the factorization must be proportional to n^2. When considering maxmimum memory consumption, DS is significantly worse than LL and LU.

SuperLU vs DS: fill-in for SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring (DS) SuperLU vs DS: maximum memory consumption for SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring (DS)

The plot showing the setup time of the solvers are unambiguous: LL needs slightly less time than LU and LL as well as LU require both significantly less time than DS (note the logarithmic scale).

SuperLU vs DS: wall-clock time needed for the decomposition of a matrix by SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring SuperLU vs DS: CPU time needed for the decomposition of a matrix by SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring

Before I present the measurements for the time needed to solve the SLEs, I have to highlight a limitation of SciPy's SuperLU interface. As soon as one computed the LU or Cholesky decomposition of a matrix A of an SLE Ax = b, there is no need for additional memory because the RHS b can be overwritten directly with x. This is exploited by software, e.g., LAPACK as well as SuperLU. Nevertheless, SciPy's SuperLU interface stores the solution x in newly allocated memory and this leaves SuperLU unable to solve seven (LU) and six SLEs (LL), respectively. For comparison, DS is unable to solve six SLEs.

The time needed to solve an SLE depends on the number of RHS, on the dimension of the matrix A, other properties of A like node degrees. To avoid complex figures, I plotted the solve time normalized by the time needed by LL for solving an SLE with 256 RHS and the same matrix A over the number of RHS and the results can be seen below. Due to the normalization, I removed all SLEs where LL was unable to solve Ax = b with 256 RHS. Moreover, the plot ignores all SLEs where the solver took less than one second to compute the solution to reduce jitter caused by the timer resolution (0.1 seconds) and finally, I perturbed the x-coordinate of the data points to improve legibility.

SuperLU vs DS: relative wall-clock time needed for the solution of a system of linear equations with a given number of right-hand sides by SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring SuperLU vs DS: relative CPU time needed for the solution of a system of linear equations with a given number of right-hand sides by SuperLU with default parameters (LU), SuperLU with forced symmetric pivoting (LL), and direct substructuring

For most problems, LL is faster than LU and for both solvers, the run-time increases linearly with the number of RHS as expected (this fact is more obvious when considering CPU time). For DS, the situtation is more interesting: with 256 RHS, DS is most of the time signifcantly slower than both LL and LU but with 2048 RHS, DS is often significantly faster than both LL and LU. To explain this behavior, let us reconsider the operations performed by DS (A is a 2 \times 2 block matrix):

  • solve A_{11} u = b_1,
  • calculate v := b_2 - A_{21} u,
  • solve S x_2 = v, and
  • compute x_1 = u - A_{11}^{-1} A_{12} x_2.

Let m be the number of RHS (the number of columns of b) and let n_2 be the number of columns of A_{12}.  In practice, A_{22} will be considerably smaller than A_{11} in dimension hence the majority of computational effort is required for the evaluation of A_{11}^{-1} b_1 and A_{11}^{-1} A_{12} x_2. If we evaluate the latter term as (A_{11}^{-1} A_{12}) x_2, then DS behaves almost as if there were m + n_2 RHS. For small m, this will clearly impact the run-time, e.g., with m \approx n_2, the number of RHS is effectively doubled.


We introduced direct substructuring, briefly highlighted how SuperLU can be forced to compute Cholesky decompositions of Hermitian positive definite matrices, and that fill-in and maximum memory consumption are two different quantities when solving SLEs Ax = b with direct substructuring. We compared maximum memory consumption, setup time, and solve time of SuperLU and direct substructuring for SLEs Ax = b in experiments with real symmetric positive definite real-world matrices and a varying number of right-hand sides.  SuperLU with forced symmetric pivoting has the smallest setup time and solves SLEs faster than SuperLU with its default settings. Direct substructuring requires the most storage, it is slow if there are few right-hand sides but it is by far the fastest solver if there are many right-hand sides in relation to the dimension of the matrix A.

Code and Data

The data and the code that created the plots in this article can be found at Gitlab:

The code generating the measurements can be found in DCGeig commit c1e0, July 16, 2016.


Another Advantage of Garbage Collection Wed, 15 Jun 2016 20:37:12 +0000 Continue reading Another Advantage of Garbage Collection ]]> Most literature about garbage collection contains a list of advantages of garbage collection. The lists of advantages known to me omit one advantage: certain concurrent algorithms require garbage collection.

I will demonstrate this point with an example in the programming language C, specifically the revision C11. Consider a singly-linked list where the pointer to the head of the list is concurrently read and written by multiple threads:

struct node
    size_t value;
    struct node* p_next;
typedef struct node node;

_Atomic(node*) p_head = ATOMIC_VAR_INIT(NULL);

The singly-linked list is considered empty if p_head is a null pointer. The thread T1 is only reading the list:

void* T1(void* args)
    node* p = atomic_load(&p_head);

    // more computations
    // stop referencing the head of the list

    return NULL;

The thread T2 removes the first element of the singly-linked list by trying to overwrite the pointer stored in p_head:

void* T2(void* args)
    node* p = NULL;
    node* p_next = NULL;
    node* p_expected = NULL;

        p = atomic_load(&p_head);

        if( !p )

        p_next = p->p_next;
        p_expected = p;
    } while(!atomic_compare_exchange_strong(&p_head, &p_expected, p_next));
    // ensure other threads stopped referencing p

    return NULL;

T2 relies on compare-and-swap in line 16 to detect interference of other threads.

After successfully updating p_head, the memory referenced by p needs to be freed after all threads stopped referencing this memory and in general, this requires a garbage collector. Waiting does not help because the threads holding references might have been stopped by the operating system. Scanning the stack, the heap, and the other threads' CPU registers is not possible in many languages or not covered by the programming language standard and besides, such a scan is an integral part of any tracing garbage collector.

In the introduction I wrote that certain concurrent algorithms require garbage collection and more accurately, it should say: In the absence of special guarantees, certain concurrent algorithms require garbage collection. For example, if we can guarantee that threads hold their references to the singly-linked list only for a certain amount of time, then there is no need for garbage collection and this fact is used in the Linux kernel when using the read-copy-update mechanisms.

Master's Thesis: Projection Methods for Generalized Eigenvalue Problems Sat, 16 Apr 2016 22:07:24 +0000 Continue reading Master's Thesis: Projection Methods for Generalized Eigenvalue Problems ]]> My master's thesis deals with dense and sparse solvers for generalized eigenvalue problems (GEPs) with Hermitian positive semidefinite matrices. Key results are

  • structure-preserving backward error bounds computable in linear time,
  • the runtime of GSVD-based dense GEP solvers is within factor 5 of the fastest GEP solver with Netlib LAPACK in my tests,
  • computing the GSVD directly is up to 20 times slower than the computation by means of QR factorizations and the CS decomposition with Netlib LAPACK in my tests,
  • given a pair of matrices with 2x2 block structure, I show how to minimize eigenvalue perturbation by off-diagonal blocks with the aid of graph algorithms, and
  • I propose a new multilevel eigensolver for sparse GEPs that is able to compute up to 1000 eigenpairs on a cluster node with two dual-core CPUs and 16 GB virtual memory limit for problems with up to 150,000 degrees of freedom in less than eleven hours.

The revised edition of the thesis with fixed typos is here (PDF), the source code is available here, and the abstract is below. In February, I already gave a talk on the preliminary thesis results; more details can be found in the corresponding blog post.


This thesis treats the numerical solution of generalized eigenvalue problems (GEPs) Kx = \lambda Mx, where K, M are Hermitian positive semidefinite (HPSD). We discuss problem and solution properties, accuracy assessment of solutions, aspect of computations in finite precision, the connection to the finite element method (FEM), dense solvers, and projection methods for these GEPs. All results are directly applicable to real-world problems.

We present properties and origins of GEPs with HPSD matrices and briefly mention the FEM as a source of such problems.

With respect to accuracy assessment of solutions, we address quickly computable and structure-preserving backward error bounds and their corresponding condition numbers for GEPs with HPSD matrices. There is an abundance of literature on backward error measures possessing one of these features; the backward error in this thesis provides both.

In Chapter 3, we elaborate on dense solvers for GEPs with HPSD matrices. The standard solver reduces the GEP to a standard eigenvalue problem; it is fast but requires positive definite mass matrices and is only conditionally backward stable. The QZ algorithm for general GEPs is backward stable but it is also much slower and does not preserve any problem properties. We present two new backward stable and structure preserving solvers, one using deflation of infinite eigenvalues, the other one using the generalized singular value decomposition (GSVD). We analyze backward stability and computational complexity. In comparison to the QZ algorithm, both solvers are competitive with the standard solver in our tests. Finally, we propose a new solver combining the speed of deflation with the ability of GSVD-based solvers to handle singular matrix pencils.

Finally, we consider black-box solvers based on projection methods to compute the eigenpairs with the smallest eigenvalues of large, sparse GEPs with Hermitian positive definite matrices (HPD). After reviewing common methods for spectral approximation, we briefly mention ways to improve numerical stability. We discuss the automated multilevel substructuring method (AMLS) before analyzing the impact of off-diagonal blocks in block matrices on eigenvalues. We use the results of this thesis and insights in recent papers to propose a new divide-and-conquer eigensolver and to suggest a change that makes AMLS more robust. We test the divide-and-conquer eigensolver on sparse structural engineering matrices with 10,000 to 150,000 degrees of freedom.

2010 Mathematics Subject Classification. 65F15, 65F50, 65Y04, 65Y20.

Edit: Revised master's thesis from April 2016 (PDF)