To appreciate the degree of desperation, you need to know that in the last two years, only two out of formerly 17 shooting ranges (five out of formerly 73 shooting lanes) were available for Berlin's police force for some time. On average, police officers had two minutes of shooting exercises per year and about 700 policemen were unable to complete their obligatory one-time training per year (source, source, source, source).

After the September 11 attacks, I started to see groups of policemen patroling airports, one of them being armed with a submachine gun. Next, policemen across Germany started wearing ballistic vests. On New Year's Eve, I was at the Berlin city center less than one kilometre away from the German parliament and most policemen carried submachine guns; there were even three policemen with submachine guns guarding the club where I celebrated New Year's Eve. Although Christmas season is over now, you can see policemen in Berlin's government district with submachine guns every day. It is planned that Berlin's police will upgrade its existing ballistic vests and acquire 6300 new ballistic vests, 12.000 new pistols, as well as new submachine guns for 8.8 Million Euros (there are 17.000 policemen in Berlin).

When I visited New York in 2009, I was dumbfounded seeing two policemen patroling the streets wearing body armor and carrying machine guns. In 2016, heavily armed police in Germany is a common sight. Something is amiss in Germany and militarizing the police will only suppress the symptoms but not cure the disease.

]]>- using an existing open source license,
- commercial use clauses,
- patent protection,
- the obligation to disclose changes,
- the ability to revoke the license as the sole copyright holder, and
- compatibility with BSD-licensed code.

In this blog post I elaborate on why I chose the Mozilla Public License 2.0 for my eigensolver DCGeig based on these features.

A copyright holder can release his code under his very own custom license. In my eyes, this approach has many drawbacks:

- you need to find a lawyer to devise the license,
- getting licenses right is hard,
- there is no foundation enforcing the license, and
- it abets license proliferation.

The Free Software Foundation ensures compliance for the GNU licenses and similarly, the Mozilla Foundation backs the Mozilla Public License. Note that all licenses of these two organizations have evolved over the years and while some of these changes were motivated by new technologies (case in point: Affero GPL and Software as a Service), I drew the conclusion that getting licenses right is hard. License proliferation makes it difficult to determine license compatibility and understanding what users and developers of the software are agreeing to. As a consequence, I decided that my code should be released under one of the following better known licenses:

- GNU Affero General Public License, version 3 (AGPLv3 for short),
- GNU General Public License, version 2 (GPLv2) or version 3 (GPLv3),
- GNU Lesser General Public License, version 3 (LGPLv3),
- Mozilla Public License 2.0 (MPL 2.0),
- Apache 2.0 License,
- BSD 3-Clause License, or
- MIT License.

It would be great if I could benefit from commercial use of my software but none of the better known licenses forbids commercial use and it seems to me that you must use dual-licensing in this case. Since using an established license is more important to me than commercial use considerations, I decided to ignore this point.

A patent holder may contribute code to an open source software project but that does not mean the patent holder granted patent rights to the users of the software. The AGPLv3, GPLv3, LGPLv3, MPL 2.0, and Apache License 2.0 all contain patent use clauses that either grant patent rights or (in layman's terms) forbid enforcing these rights. I wanted a license with patent protection meaning I could not use a BSD license or the MIT license. The GPLv2 provides an implicit patent grant that was deemed not extensive enough for the GPLv3 (see *Why Upgrade to GPLv3*) but I cannot say how the GPLv2 patent protection compares to the other licenses.

There is no license that forces changes to be disclosed under all circumstances. The GPL only requires the disclosure of changes if the software is distributed. Hence Google can run a customized Linux kernel (see page 44 in the linked document) without ever disclosing these changes, for example. The AGPLv3 requires the disclosure of changes if the compiled code is distributed or if it is used to provide services over a network so if a company uses programs only internally, then none of the popular licenses requires the company to disclose changes. This raises the question if there exists useful software that is not distributed and not provided as a service and if not, then there would be no "disclosure gap". My solver DCGeig is well-suited for certain problems arising, e.g., in the design of automobiles, aircraft, spacecraft, or submarines and thus, it is exactly the kind of software a vehicle manufacturer does not want to make accessible outside of the company. Now while there may be a disclosure gap, I wonder how one can prove the existence of a piece of software used only internally in a company and the fact the company is in violation of the software license. In conclusion, this point does not narrow down the set of licenses to choose from.

So far, we found out that we cannot earn money from licensing our software without a custom license and we cannot enforce the disclosure of all changes. Nevertheless, one might be interested to revoke the license for certain individuals or organizations but such a license cannot be an open source license according to the definition of the Open Source Initiative (see Articles 5 and 6). Also, the GPLv3 contains an explicit irrevocability clause in Article 2. Nevertheless, can the **copyright holder** revoke the license or not? I found many contradicting answers (no, yes, no, in 35 years in the US, only in Australia). Assuming licenses cannot be revoked, the only possible approach is to put future releases under a different license. Again, we cannot narrow the set of licenses to choose from.

My eigensolver DCGeig relies heavily on the Python interpreter CPython, the numerical mathematics libraries BLAS, LAPACK, NumPy, SciPy including the linear system solver SuperLU, the graph partitioning software Metis, and soon maybe the sparse symmetric matrix library SPRAL. These software packages are released under the following licenses (in order): Python Software Foundation License (BSD-like), BSD, BSD, BSD, BSD, BSD, Apache 2.0 License, BSD. If I were to release DCGeig under a GNU license, then it would be impossible to make my solver available in, say, SciPy which comes to my mind here because this project had to remove its UMFPACK bindings since the copyright holder changed the license to GPL (see the corresponding GitHub issue #3178). I benefit from these projects and I want to keep the option of allowing these projects to use my work. This choice leaves me unable to use a GNU license.

At this point, I could use the Apache License 2.0 or the Mozilla Public License 2.0. I wanted to ensure disclosure of changes of my code and thus, I decided to put future DCGeig versions under the MPL 2.0. As a bonus, the MPL 2.0 is free of political messages, the license makes concessions if licensees cannot fully comply with it (compare this to the GPLv3, Article 12), the code cannot just be taken und put under a different license (see MPL 2.0 FAQ, Q14.2), the boilerplate license notice that needs to be put at the top of every file is short, and in my opinion the license is quite readable. The article *The Mozilla Public License Version 2.0: A Good Middle Ground?* highlights the core ideas of this license well.

Python code generating the matrices and their eigenpairs can be found in my git repository `discrete-laplacian`

.

Let , let . The differential operator is called Laplacian and it is the sum of the second derivatives of a function:

It is a well-researched differential operator in mathematics and for many domains, the exact eigenvalues and eigenfunctions are known such that

In this blog post, I discuss the solutions of on hyperrectangles with the Dirichlet boundary condition

The combination of Laplace operator and the finite difference method (FDM) can be very well used in an introductory course on the numerical treatment of partial differential equations (PDEs) for the illustration of concepts such as discretization of PDEs, discretization error, the growth of the matrix condition number with finer grids, and sparse solvers.

In numerical linear algebra, the Laplace operator is appealing because the FDM discretization of the operator on a one-dimensional domain yields a standard eigenvalue problem with a sparse, real symmetric positive-definite, tridiagonal Toeplitz matrix and known eigenpairs. For domains in higher dimensions, the matrices can be constructed with the aid of the Kronecker sum and consequently, the eigenpairs can be calculated in this case, too. With this knowledge, the Laplace operator makes for a good, easy test problem in numerical mathematics because we can distinguish between discretization errors and algebraic solution errors. Naturally, the Laplace operator can also be discretized with the finite element method (FEM) yielding a generalized eigenvalue problem with sparse, real symmetric positive-definite matrices. Here, the eigenpairs are known, too, and furthermore, the eigenvectors are exact in the grid points (so-called *superconvergence*) providing us with another well-conditioned test problem.

I will present the solutions for the continuous case before I briefly introduce Kronecker products and Kronecker sums since these linear algebra operations are used to construct the matrices corresponding to higher dimensional domains. Finally, I discuss domain discretization and notation before giving closed expressions for the matrices and their eigenpairs created by FDM and FEM. In the end, there is an example demonstrating the use of the quantities , , and .

In 1D, the exact eigenpairs of the Laplace operator on the domain are

In 2D, the exact eigenpairs of the Laplace operator on the domain are

On a -dimensional hyperrectangle, the eigenpairs are

These solutions can be found, e.g., in the book

The matrices of the discretized Laplacian on higher dimensional domains can be constructed with Kronecker products. Given two matrices , , the Kronecker product is defined as follows:

If and are square, then is square as well; let be the eigenpairs of and let be the eigenpairs of . Then the eigenpairs of are

If and are real symmetric, then is also real symmetric. For square and , the Kronecker sum of and is defined as

where is the identity matrix. The eigenvalues of are

The Kronecker sum occurs during the construction of the 2D FDM matrix. See, e.g.,

For the 1D case along the -th axis, we use points uniformly distributed over , such that the step size is . We use lexicographical ordering of points, e.g., in 2D let be an eigenfunction of the continuous problem, let be an eigenvector of the algebraic eigenvalue problem. Then is an approximation to , approximates , approximates , and so on.

To obtain accurate approximations to the solutions of the continuous eigenvalue problem, the distance between adjacent grid points should always be constant. Thus, if the length of the 1D domain is doubled, the number of grid points should be doubled, too.

In the following sections, we need to deal with matrices arising from discretizations of the Laplace operator on 1D domains with different step sizes and their eigenpairs as well as discretizations of the Laplace operator on higher dimensional domains and their eigenpairs. denotes the identity matrix; its dimension can be gathered from the context.

Matrices denote the Laplacian discretized with the FDM, matrices denote the discrete Laplacian on 1D domains along the -th axis, and matrices denote the the discrete Laplacian on a -dimensional domain. The eigenpairs of are signified by , , are the eigenpairs of .

Similarly, and are the stiffness and mass matrix, respectively, of the Laplace operator discretized with the finite element method. denote the discrete Laplacian on a one-dimensional domain along the -th axis, and , are the stiffness and mass matrix for the discrete Laplacian on a -dimensional hyperrectangle. We speak of the *solutions of* or of *eigenpairs of the matrix pencil* . The eigenpairs of are denoted by , whereas signify eigenpairs of .

In this section, we will construct the matrices of the discretized Laplace operator on a -dimensional domain with the aid of the matrices for the -dimensional and the one-dimensional case.

Let be the -th unit vector. Let us consider the discretization along the -th axis first. Let

be the forward difference quotient, let

be the backward difference quotient. Then

Consequently, the discrete Laplacian has the stencil

meaning is a real symmetric tridiagonal matrix with the value 2 on the diagonal and -1 on the first sub- and superdiagonal. The eigenvalues of are

the entries of the eigenvector are

In 2D with lexicographic ordering of the variables, we have

and this matrix can be constructed as follows:

The eigenpairs can be derived directly from the properties of the Kronecker sum: the eigenvalues are

and the eigenvectors are

where , .

In higher dimensions, it holds that

where

are the eigenvalues and

are the eigenvectors, .

Similarly to the finite differences method, we can construct the matrices of the discretized equation recursively. We use hat functions as ansatz functions throughout this section.

The 1D stencil is

for the stiffness matrix and

for the mass matrix. The generalized eigenvalues are

and the -th entry of the eigenvector corresponding to is

Observe that is not only an eigenvector of the matrix pencil but also of the matrices and themselves. Thus, the eigenvalues of are

whereas the eigenvalues of are

In 2D, the mass matrix stencil is

Judging by the coefficient and the diagonal blocks, it must hold that

The stiffness matrix stencil is

Seeing the factors and , the stiffness matrix cannot be the Kronecker product or the Kronecker sum of 1D stiffness matrices. However, observe that the 1D mass matrices and have coefficients and , respectively, and indeed, the 2D stiffness matrix can be constructed with the aid of the 1D mass matrices:

Based on the properties of the Kronecker product and using the fact that mass and stiffness matrix have the same set of eigenvalues, has the eigenvalues

and eigenvectors

where , .

For the Laplacian on hyperrectangles in -dimensional space, the stiffness matrix can be constructed by

the mass matrix can be constructed using

such that has eigenvalues

and eigenvectors

where .

Now I will present the matrices of the discrete Laplacian on the domain . The figure below shows the domain and the grid used for discretization: there are five interior grid points along the first axis, three interior points along the second axis, and the interior grid points are highlighted as black circles. Thus, , , , , the step size , and . In 2D, an eigenvector of the algebraic eigenvalue problem will possess entries , , such that , , , , , , because of the lexicographic ordering.

In the first dimension, the Laplacian discretized with the FDM has the following matrix:

Along the second axis, we have

With the FEM, the stiffness matrix is

and the mass matrix is

along the first axis. The matrices corresponding to the second dimension are

and

]]>

ccmake -G ninja /path/to/source-code

on the command line and after pressing

[c] to configure

ccmake presents you an empty window with the message

Errors occurred during the last pass

in the status bar. The cause of this problem is the misspelled generator name on the ccmake command line (argument `-G`

): the generator is called *Ninja* with capital N as the first letter.

Let be a set. A function is called a *metric* for if

- ,
- iff ,
- , and
- ,

where . The last inequality is known as the triangle inequality. The pair is then called a metric space. Let and let denote the th component of and , respectively, . Often used metrics are the Euclidean metric

the Manhattan metric

and the Chebychev metric

Consider a set of points . Given another point (note that we allow ), we might be interested in the point closest to , , i.e.,

This problem is known as the

In the

The nearest-neighbor problem and its variants occur, e.g., during the computation of minimum spanning trees with vertices in a vector space or in -body simulations, and they are elementary operations in machine learning (nearest centroid classification, -means clustering, kernel density estimation, ...) as well as spatial databases. Obviously, computing the nearest neighbor using a sequential search requires linear time. Hence, many space-partitioning data structures were devised that were able to reduce the average complexity though the worst-case bound is often , e.g., for k-d trees or octrees.

Cover trees are fast in practice and have great theoretical significance because nearest-neighbor queries have guaranteed logarithmic complexity and they allow the solution of the monochromatic all-nearest-neighbors problem in linear time instead of (see the paper Linear-time Algorithms for Pairwise Statistical Problems). Furthermore, cover trees require only a metric for proper operation and they are oblivious to the representation of the points. This allows one, e.g., to freely mix cartesian and polar coordinates, or to use implicitly defined points.

A cover tree on a data set with metric is a tree data structure with levels. Every node in the tree references a point and in the following, we will identify both the node and the point with the same variable . A cover tree tree is either

- empty, or
- at level the tree has a single root node .

If has children, then

- the children are non-empty cover trees with root nodes at level ,
- (nesting) there is a child tree with as root node,
- (cover) for the root node in every child tree of , it holds that , i.e.,
*covers*, - (separation) for each pair of root nodes in child trees of , it holds that .

Note that the cover tree definition does not prescribe that every descendant of must have distance . Let be the parent nodes of . Then the triangle inequality yields

With an infinite amount of levels, we get the inequality

What is more, given a prescribed parent node , notice that the separation condition implies that child nodes must inserted in the lowest possible level for otherwise we violate the separation inequality .

The definition of the cover trees uses the basis 2 for the definition of the cover radius and the minimum separation but this number can be chosen arbitrarily. In fact, the implementation by Beygelzimer/Kakade/Langford uses the basis 1.3 and MLPACK defaults to 2 but allows user-provided values chosen at run-time. In this blog post and in my implementation I use the basis 2 because it avoids round-off errors during the calculation of the exponent.

Cover trees have nice theoretical properties as you can see below, where , denotes the expansion constant explained in the next section:

- construction: ,
- insertion: ,
- removal: ,
- query: ,
- batch query: .

The cover tree requires space.

Let denote the set of points that are less than away from :

The expansion constant of a set is the smallest scalar such that

for all , . We will demonstrate that the expansion constant can be large and sensitive to changes in .

Let and let

for some integer . In this case, because and and this is obviously the worst case. Moreover, can be sensitive to changes of , e.g., consider set whose points are evenly distributed on the surface of a unit hypersphere, and let be a point arbitrarily close to the origin. The expansion constant of the set is whereas the expansion constant of the set is (this example was taken from the thesis Improving Dual-Tree Algorithms). With these bounds in mind and assuming the worst-case bounds on the cover tree algorithms are tight, we have to concede that these algorithms may require operations or worse. Even if the points are regularly spaced, the performance bounds may be bad. Consider a set forming a -dimensional hypercubic honeycomb, i.e., with this is a regular square tiling. In this case, the expansion constant is proportional to . Note that the expansion constant depends on the dimension of the subspace spanned by the points of and not on the dimension of the space containing these points.

Nevertheless, cover trees are used in practice because real-world data sets often have small expansion constants. The expansion constant is related to the doubling dimension of a metric space (given a ball with unit radius in a -dimensional metric space , the doubling dimension of is the smallest number of balls with radius needed to cover ).

In a cover tree, given a point on level , there is a node for on all levels which raises the question how we can efficiently represent cover trees on a computer. Furthermore, we need to know if the number of levels in the cover tree can represented with standard integer types.

Given a point that occurs on multiple levels in the tree, we can either

- coalesce all nodes corresponding to and store the children in an associtive array whose values are child nodes and levels as keys, or
- we create one node corresponding to on each level whenever there are children, storing the level of the node as well as its children.

The memory consumption for the former representation can be calculated as follows: every cover tree node needs to store the corresponding point and the associative array. If the associative array is a binary tree, then for every level , there is one binary tree node with

- a list of children of the cover tree node at level ,
- a pointer to its left child,
- a pointer to its right child, and
- a pointer to the parent binary tree node so that we can implement iterators (this is how
`std::map`

nodes in the GCC C++ standard library are implemented).

Hence, for every level in the binary tree, we need to store at least four references and the level. The other representation must store the level, a reference to the corresponding point, and a reference to the list of children of this cover tree node so this representation is more economic with respect to the memory consumption. There is no difference in the complexity of nearest-neighbor searches because for an efficient nearest-neighbor search, we have to search the cover tree top down starting at the highest level.

A metric maps its input to non-negative real values. On a computer (in finite precision arithmetic) there are bounds on the range of values that can be represented and we will elaborate on this fact using numbers in the IEEE-754 double-precision floating-point format as an example. An IEEE-754 float is a number , where is called the *mantissa* (or significand), is called the *exponent*, and is a fixed number called *bias*. The sign bit of a double-precision float is represented with one bit, the exponent occupies 11 bits, the mantissa 52 bits, and the bias is . The limited size of these two fields immediately bounds the number the quantities that can be represented and in fact, the largest finite double-precision float value is approximately and the smallest positive value is . Consequently, we will never need have more than levels in a cover tree when using double-precision floats irrespective of the number of points stored in the tree. Thus, the levels can be represented with 16bit integers.

The authors of the original cover tree paper Cover Trees for Nearest Neighbors made their C and C++ implementations available on the website http://hunch.net/~jl/projects/cover_tree/cover_tree.html. The first author of the paper *Faster Cover Trees* made the Haskell implementation of a nearest ancestor cover tree used for this paper available on GitHub. The C++ implementation by Manzil Zaheer features -nearest neighbor search, range search, and parallel construction based on C++ concurrency features (GitHub). The C++ implementation by David Crane can be found in his repository on GitHub. Note that the worst-case complexity of node removal is linear in this implementation because of a conspicuous linear vector search. The most well maintained implementation of a cover tree can probably be found in the MLPACK library (also C++). I implemented a nearest ancestor cover tree in C++14 which takes longer to construct but has superior performance during nearest neighbor searches. The code can be found in my git repository.

The worst-case complexity bounds of common cover tree operations, e.g., construction and querying, contain terms or , where is the expansion constant. In this section, I will measure the effect of the expansion constant on the run-time of batch construction and nearest-neighbor search on a synthetic data set.

For the experiment, I implemented a nearest ancestor cover tree described in *Faster Cover Trees* in C++14 with batch construction, nearest-neighbor search (single-tree algorithm), and without associative arrays. The first point in the data set is chosen as the root of the cover tree and on every level of the cover tree, the construction algorithm attempts to select the points farthest away from the root as children.

The data consists of random points in -dimensional space with uniformly distributed entries in the interval , i.e., we use random points inside of a hypercube. The reference set (the cover tree) contains points and we performed nearest-neighbor searches for random points. The experiments are conducted using the Manhattan, the Euclidean, and the Chebychev metric and the measurements were repeated 25 times for dimensions .

We do not attempt to measure the expansion constant for every set of points. Instead, we approximate the expansion constant from the dimension . Let , , be a metric, where

- is the Manhattan metric,
- is the Euclidean metric, and
- is the Chebychev metric,

and let be the ball centered at the origin with radius :

The expansion constant of a set was defined as the smallest scalar such that

for all , . We will now simplify both sides of the inequality.

In this experiment, all entries of the points are uniformly distributed around the origin and using the assumption that and are sufficiently large, will be approximately constant everywhere in the hypercube containing the points:

Using the uniform distribution property again, we can set without loss of generality. Likewise, since is approximately constant, the fraction above will be close to the ratio of the volumes of the balls and . is called a

The volume of the -ball in Euclidean space is

where is the gamma function. Finally, is a hypercube with volume

Using our assumptions, it holds that

Consequently, the worst-case bounds are for construction and for nearest-neighbor searches in cover trees with this data set.

In the plots, indicates the Manhattan, the Euclidean, and the Chebychev metric with the markers corresponding to the shape of the Ball .

The figures below show mean and standard deviation for construction and query phase. The construction time of the cover tree is strongly dependent on the used metric: cover tree construction using the Chebychev metric takes considerably more time than with the other norms; construction with the Manhattan metric is slightly faster than with the Euclidean metric. Observe that there is a large variation in the construction time when employing the Euclidean metric and this effect becomes more pronounced the higher the dimension . Also, considering the small standard deviation in the data, the construction time slightly jumps at for the Manhattan norm. In the query time plot, we can see that the results are mostly independent of the metric at hand. What is more, the variance of the query time is without exception small in comparison to the mean. Nevertheless, there is a conspicuous bend at when using the Manhattan metric. This bend is unlikely to be a random variation because we repeated the measurements 25 times and the variance is small.

With our measurements we want to determine how the expansion constant influences construction and query time and according to the estimates above, we ought to see an exponential growth in operations. To determine if the data points could have been generated by an exponential function, one can plot the data with a logarithmic scaling along the vertical axis. Then, exponential functions will appear linear and polynomials sub-linear. In the figures below, we added an exponential function for comparison and it seems that the construction time does indeed grow exponentially with the dimension irrespective of the metric at hand while the query time does not increase exponentially with the dimension.

Seeing the logarithmic plots, I attempted to fit an exponential function , , to the construction time data. The fitted exponential did not approximate the data of the Euclidean and Manhattan metric well even when considering the standard deviation. However, the fit for the Chebychev metric was very good. In the face of these results, I decided to fit a monomial , , to both construction and query time data and the results can be seen below. Except for the Manhattan metric data, the fit is very good.

The value of the constants are for construction:

- : , ,
- : , ,
- : , .

For nearest-neighbor searches, the constants are

- : , ,
- : , ,
- : , .

In conclusion, for our test data the construction time of a nearest ancestor cover tree is strongly dependent on the metric at the hand and the dimension of the underlying space whereas the query time is mostly indepedent of the metric and a function of the square of the dimension . The jumps in the data of the Manhattan metric and the increase in variation in the construction time when using the Euclidean metric highlight that there must be non-trivial interactions between dimension, metric, and the cover tree implementation.

Originally, we asked how the expansion constant impacts the run-time of cover tree operations and determined that we can approximate by calculating . Thus, the run-time of construction and nearest-neighbor search seems to be proportional to because and .

We introduced cover trees, discussed their advantages as well as their unique theoretical properties. We elaborated on the complexity of cover tree operations, the expansion constant, and implementation aspects. Finally, we conducted an experiment on a synthetic data set and found that the construction time is strongly dependent on the metric and the dimension of the underlying space while the time needed for nearest-neighbor search is almost independent of the metric. Most importantly, the complexity of operations seems to be polynomial in and proportional to the logarithm of the expansion constant. There are unexplained jumps in the measurements.

]]>

For my master's thesis, I implemented a solver for large, sparse generalized eigenvalue problems that used SuperLU to solve SLEs , where is large, sparse, and Hermitian positive definite (HPD) or real symmetric positive definite. SuperLU is a direct solver for such SLEs but depending on the structure of the non-zero matrix entries, there may be so much fill-in that the solver runs out of memory when attempting to factorize and this happened for the largest test problems in my master's thesis. Without carefully measuring the memory consumption, I replaced SuperLU with direct substructuring.

Direct substructuring is a recursive method for solving SLEs . Partition

conformally and let be the

Then we can compute as follows: solve and calculate . Then

Finally,

Our aim is to solve SLEs and so far, I did not explain how to solve SLEs . We could solve this SLE directly with a dense or a sparse solver or we can apply the idea from above again, i.e., we partition into a block matrix, partition and conformally, and solve . This recursive approach yields a method that is called

Observe for direct substructuring, we only need to store the results of the factorization of the Schur complements, e.g., the matrices and of the LU decomposition of so the memory consumption of the factorization of is a function of the dimension of . Furthermore, we can quickly minimize the dimension of the block with the aid of nested dissection orderings, a method based on minimal vertex separators in graph theory. Thus, it is possible to use direct substructuring in practice.

Before presenting the results of the numerical experiments, I elaborate on the suboptimal parameter choice for SuperLU and afterwards, I explain why using direct substructuring was ill-advised.

SuperLU is a sparse direct solver meaning it tries to find permutation matrices and as well as triangular matrices and with minimal fill-in such that . When selecting permutation matrices, SuperLU considers fill-in and the modulus of the pivot to ensure numerical stability; is determined from partial pivoting and is chosen independently from in order to minimize fill-in (see Section 2.5 of the SuperLU User's Guide). SuperLU uses *threshold pivoting* meaning SuperLU attempts to use the diagonal entry as pivot unless

where is a small, user-provided constant. Consequently, for sufficiently well-conditioned HPD matrices, SuperLU will use and compute a Cholesky decomposition such that , approximately halving the memory consumption in comparison to an LU decomposition. For badly conditioned or indefinite matrices, SuperLU will thus destroy hermiticity as soon as a diagonal element violates the admission criterion for pivoting elements above and compute an LU decomposition.

For HPD matrices small pivots do not hurt numerical stability and in my thesis, I used only HPD matrices. For large matrices, SuperLU was often computing LU instead of Cholesky decompositions and while this behavior is completely reasonable, it was also unnecessary and it can be avoided by setting thereby forcing SuperLU to compute Cholesky factorizations.

Given an SLE and a suitable factorization of , a direct solver can overwrite with such that no temporary storage is needed. Hence as long as there is sufficient memory for storing the factorization of and the right-hand sides , we can solve SLEs.

Let us reconsider solving an SLE with direct substructuring. We need to

- solve ,
- calculate ,
- solve , and
- compute .

Let . Observe when computing in the last step, we need to store and simultaneously. Assuming we overwrite the memory location originally occupied by with , we still need additional memory to store . Thus, direct substructuring requires additional memory during a solve because we need to store the factorized Schur complements as well as . That is, the maximum memory consumption of direct substructuring may be considerably larger than the amount of memory required only for storing the Schur complements. The sparse solver in my thesis was supposed to work with large matrices (ca. ) so even if the block is small in dimension such that has only few columns, storing requires large amounts of memory, e.g., let such that , let and let . With double precision floats (8 bytes), requires

of storage. For comparison, storing the matrix consumes 0.3MiB while the 2,455,670 non-zeros of the test matrix s3dkq4m2 (see the section with the numerical experiments) with dimension 90,449 require 18.7MiB of memory, and the sparse Cholesky factorization of s3dkq4m2 computed by SuperLU occupies 761MiB. Keep in mind that direct substructuring is a recursive method, i.e., we have to store a matrix at every level of recursion.

If we want to avoid running out of memory, then we have to reduce the maximum memory usage. When I decided to replace SuperLU with direct substructuring, I attempted to reduce the memory required for storing the matrix factorization. With SuperLU, maximum memory consumption and memory required for storing the matrix factorization are the same but not for direct substructuring. In fact, the maximum memory consumption is often magnitudes larger than the storage required for the factors of the Schur complements.

I measured the following quantities:

- fill-in,
- maximum memory consumption,
- the time needed to factorize , and
- the time needed to solve SLEs with 256, 512, 1024, and 2048 right-hand sides (RHS).

In the plots, LU signifies SuperLU with the default parameters, LL stands for SuperLU with parameters chosen to exploit hermiticity ( is an allusion to the Cholesky decomposition , see the section on SuperLU and HPD matrices), and DS means direct substructuring. The set of 27 test matrices consists of all real symmetric positive definite Harwell-Boeing BCS structural engineering matrices, as well as

- gyro_k, gyro_m,
- vanbody,
- ct20stif,
- s3dkq4m2,
- oilpan,
- ship_003,
- bmw7st_1,
- bmwcra_1, and
- pwtk.

These matrices can be found at the University of Florida Sparse Matrix Collection. The smallest matrix has dimension 1074, the largest matrix has dimension 217,918, the median is 6458, and the arithmetic mean is 37,566. For the timing tests, I measured wall-clock time and CPU time and both timers have a resolution of 0.1 seconds. The computer used for the tests has an AMD Athlon II X2 270 CPU and 8GB RAM. Direct substructuring was implemented in Python with SciPy 0.17.1 and NumPy 1.10.4 using Intel MKL 11.3. SuperLU ships with SciPy. The right-hand sides vectors are random vectors with uniformly distributed entries in the interval .

With the fill-in plot, it is easy to see that LL requires less memory than LU. For direct substructuring, the fill-in is constant except for a few outliers where the matrix is diagonal. For the smallest problems, both SuperLU variants create less fill-in but for the larger test matrices, DS is comparable to LL and sometimes better. Unfortunately, we can also gather from the plot that the fill-in seems to be a linear function of the matrix dimension with SuperLU. If we assume that the number of non-zero entries in a sparse matrix is a linear function of , too, then the number of non-zero entries in the factorization must be proportional to . When considering maxmimum memory consumption, DS is significantly worse than LL and LU.

The plot showing the setup time of the solvers are unambiguous: LL needs slightly less time than LU and LL as well as LU require both significantly less time than DS (note the logarithmic scale).

Before I present the measurements for the time needed to solve the SLEs, I have to highlight a limitation of SciPy's SuperLU interface. As soon as one computed the LU or Cholesky decomposition of a matrix of an SLE , there is no need for additional memory because the RHS can be overwritten directly with . This is exploited by software, e.g., LAPACK as well as SuperLU. Nevertheless, SciPy's SuperLU interface stores the solution in newly allocated memory and this leaves SuperLU unable to solve seven (LU) and six SLEs (LL), respectively. For comparison, DS is unable to solve six SLEs.

The time needed to solve an SLE depends on the number of RHS, on the dimension of the matrix , other properties of like node degrees. To avoid complex figures, I plotted the solve time normalized by the time needed by LL for solving an SLE with 256 RHS and the same matrix over the number of RHS and the results can be seen below. Due to the normalization, I removed all SLEs where LL was unable to solve with 256 RHS. Moreover, the plot ignores all SLEs where the solver took less than one second to compute the solution to reduce jitter caused by the timer resolution (0.1 seconds) and finally, I perturbed the -coordinate of the data points to improve legibility.

For most problems, LL is faster than LU and for both solvers, the run-time increases linearly with the number of RHS as expected (this fact is more obvious when considering CPU time). For DS, the situtation is more interesting: with 256 RHS, DS is most of the time signifcantly slower than both LL and LU but with 2048 RHS, DS is often significantly faster than both LL and LU. To explain this behavior, let us reconsider the operations performed by DS ( is a block matrix):

- solve ,
- calculate ,
- solve , and
- compute .

Let be the number of RHS (the number of columns of ) and let be the number of columns of . In practice, will be considerably smaller than in dimension hence the majority of computational effort is required for the evaluation of and . If we evaluate the latter term as , then DS behaves almost as if there were RHS. For small , this will clearly impact the run-time, e.g., with , the number of RHS is effectively doubled.

We introduced direct substructuring, briefly highlighted how SuperLU can be forced to compute Cholesky decompositions of Hermitian positive definite matrices, and that fill-in and maximum memory consumption are two different quantities when solving SLEs with direct substructuring. We compared maximum memory consumption, setup time, and solve time of SuperLU and direct substructuring for SLEs in experiments with real symmetric positive definite real-world matrices and a varying number of right-hand sides. SuperLU with forced symmetric pivoting has the smallest setup time and solves SLEs faster than SuperLU with its default settings. Direct substructuring requires the most storage, it is slow if there are few right-hand sides but it is by far the fastest solver if there are many right-hand sides in relation to the dimension of the matrix .

The data and the code that created the plots in this article can be found at Gitlab:

https://gitlab.com/christoph-conrads/christoph-conrads.name/tree/master/superLU-vs-DS

The code generating the measurements can be found in DCGeig commit c1e0, July 16, 2016.

]]>

I will demonstrate this point with an example in the programming language C, specifically the revision C11. Consider a singly-linked list where the pointer to the head of the list is concurrently read and written by multiple threads:

struct node { size_t value; struct node* p_next; }; typedef struct node node; _Atomic(node*) p_head = ATOMIC_VAR_INIT(NULL);

The singly-linked list is considered empty if `p_head`

is a null pointer. The thread T1 is only reading the list:

void* T1(void* args) { node* p = atomic_load(&p_head); // more computations // stop referencing the head of the list return NULL; }

The thread T2 removes the first element of the singly-linked list by trying to overwrite the pointer stored in `p_head`

:

void* T2(void* args) { node* p = NULL; node* p_next = NULL; node* p_expected = NULL; do { p = atomic_load(&p_head); if( !p ) break; p_next = p->p_next; p_expected = p; } while(!atomic_compare_exchange_strong(&p_head, &p_expected, p_next)); // ensure other threads stopped referencing p free(p); return NULL; }

T2 relies on compare-and-swap in line 16 to detect interference of other threads.

After successfully updating `p_head`

, the memory referenced by `p`

needs to be freed after all threads stopped referencing this memory and in general, this requires a garbage collector. Waiting does not help because the threads holding references might have been stopped by the operating system. Scanning the stack, the heap, and the other threads' CPU registers is not possible in many languages or not covered by the programming language standard and besides, such a scan is an integral part of any tracing garbage collector.

In the introduction I wrote that certain concurrent algorithms **require** garbage collection and more accurately, it should say: In the absence of special guarantees, certain concurrent algorithms require garbage collection. For example, if we can guarantee that threads hold their references to the singly-linked list only for a certain amount of time, then there is no need for garbage collection and this fact is used in the Linux kernel when using the read-copy-update mechanisms.

- structure-preserving backward error bounds computable in linear time,
- the runtime of GSVD-based dense GEP solvers is within factor 5 of the fastest GEP solver with Netlib LAPACK in my tests,
- computing the GSVD directly is up to 20 times slower than the computation by means of QR factorizations and the CS decomposition with Netlib LAPACK in my tests,
- given a pair of matrices with 2x2 block structure, I show how to minimize eigenvalue perturbation by off-diagonal blocks with the aid of graph algorithms, and
- I propose a new multilevel eigensolver for sparse GEPs that is able to compute up to 1000 eigenpairs on a cluster node with two dual-core CPUs and 16 GB virtual memory limit for problems with up to 150,000 degrees of freedom in less than eleven hours.

The revised edition of the thesis with fixed typos is here (PDF), the source code is available here, and the abstract is below. In February, I already gave a talk on the preliminary thesis results; more details can be found in the corresponding blog post.

## Abstract

This thesis treats the numerical solution of generalized eigenvalue problems (GEPs) , where , are Hermitian positive semidefinite (HPSD). We discuss problem and solution properties, accuracy assessment of solutions, aspect of computations in finite precision, the connection to the finite element method (FEM), dense solvers, and projection methods for these GEPs. All results are directly applicable to real-world problems.

We present properties and origins of GEPs with HPSD matrices and briefly mention the FEM as a source of such problems.

With respect to accuracy assessment of solutions, we address quickly computable and structure-preserving backward error bounds and their corresponding condition numbers for GEPs with HPSD matrices. There is an abundance of literature on backward error measures possessing one of these features; the backward error in this thesis provides both.

In Chapter 3, we elaborate on dense solvers for GEPs with HPSD matrices. The standard solver reduces the GEP to a standard eigenvalue problem; it is fast but requires positive definite mass matrices and is only conditionally backward stable. The QZ algorithm for general GEPs is backward stable but it is also much slower and does not preserve any problem properties. We present two new backward stable and structure preserving solvers, one using deflation of infinite eigenvalues, the other one using the generalized singular value decomposition (GSVD). We analyze backward stability and computational complexity. In comparison to the QZ algorithm, both solvers are competitive with the standard solver in our tests. Finally, we propose a new solver combining the speed of deflation with the ability of GSVD-based solvers to handle singular matrix pencils.

Finally, we consider black-box solvers based on projection methods to compute the eigenpairs with the smallest eigenvalues of large, sparse GEPs with Hermitian positive definite matrices (HPD). After reviewing common methods for spectral approximation, we briefly mention ways to improve numerical stability. We discuss the automated multilevel substructuring method (AMLS) before analyzing the impact of off-diagonal blocks in block matrices on eigenvalues. We use the results of this thesis and insights in recent papers to propose a new divide-and-conquer eigensolver and to suggest a change that makes AMLS more robust. We test the divide-and-conquer eigensolver on sparse structural engineering matrices with 10,000 to 150,000 degrees of freedom.

2010 Mathematics Subject Classification. 65F15, 65F50, 65Y04, 65Y20.

**Edit**: Revised master's thesis from April 2016 (PDF)

- using WordPress as a static website generator,
- not loading unused scripts and fonts, and
- employing compression and client-side caches.

According to WebPagetest, a Firefox client in Frankurt with a DSL connection and an empty cache needed to download 666 kB (28 requests) and had to wait approximately 7.7 seconds before being able to view my frontpage from April 4. With the static website, the same client has to wait about 3.1 seconds and transfer 165 kB (18 requests). As a side-effect, the website offers considerably less attack surface now and user privacy was improved.

Earlier this year, I was annoyed by the speed of my blog and this feeling was reinforced by mediocre test results on online website speed tests like PageSpeed Insights, WebPagetest, or Pingdom Website Speed Test. These tests also highlighted the suboptimal configuration of the HTTP server (no compression, unused client-side caches). Since my WordPress blog is static and since there are usually many vulnerabilities in WordPress, in WordPress plugins, or WordPress themes, I decided to replace my online presence with a static copy of my WordPress blog.

The static website is faster thereby improving user experience and it provides better privacy; given a link to a file, a web browser sends the address of the website containing the link whenever it downloads the linked file (the so-called HTTP referrer). Consequently, HTTP referrers allows user tracking across different websites and by default, WordPress with the Twenty Fourteen theme loads one set of Google webfonts when viewing the page and it loads another set of Google webfonts when logging in allowing Google to distinguish between regular website visitors and authors (see also this link). Except for the blog posts using MathJax, the static website does not require external data for viewing.

For the sake of completeness I want to mention there are also SSL tests for servers, e.g., by Qualys SSL Labs and by wormly.

Since we will disable the comments later, we will prevent web browsers from loading Gravatar data. To that end, the option `Show Avatars`

in `Settings -> Discussion -> Avatar Display`

must be unchecked (the discussion settings are not available if the Disable Comments plugin is active). WordPress supports emojis and if you do not use these, then they can be disabled with a plugin called Disable Emojis. Furthermore, WordPress has been loading Google webfonts for some time now and these can be disabled with the plugin Disable Google Fonts.

In order to use WordPress as a static website generator, there must be no dynamic content unless the user is logged in. Consequently, the WordPress blog must

- disable comments,
- avoid attachments pages,
- disable shortlinks, and
- make the search function inaccessible.

Comments can be globally disabled by the plugin Disable Comments. Links to attachment pages have to be removed manually from every article whereas shortlinks can be disabled with a plugin or by adding the following line to the file `functions.php`

in the active theme:

remove_action( 'wp_head', 'wp_shortlink_wp_head' );

Here, we want to remind the reader that the preferred way to modify standard themes is by means of child themes. Finally, search forms can be removed by customizing the current WordPress theme; the 404 page may contain a search form, too.

Since I want to have my website serve only static content, the WordPress blog needs to move somewhere and I decided to host WordPress offline on my own computer. Hence, I needed to install a LAMP stack or a similar solution stack on my machine, install WordPress, and move the blog afterwards.

I chose a vanilla LAMP stack and I struggled with the setup of the PHP interpreter because there are at least three popular ways to have Apache run PHP code. The how-to by Nathan Zachary describes a state-of-the-art solution using FastCGI over UNIX domain sockets.

I followed the reference guide for installing WordPress. The guide glosses over one question: who owns the directory with the WordPress installation? The PHP interpreter needs to have read and write access to the `wp-content`

folder in the WordPress directory because this is the location where WordPress stores its plugins, themes, and uploaded files so on my computer

- the PHP interpreter called by Apache (
`php-fpm`

) is executed as user`php-fpm`

, - the subdirectory
`wp-content`

is owned by the user`php-fpm`

and the group`root`

, - every other file and directory of the WordPress installation is owned by
`root`

(user and group).

Moving a WordPress blog involves

- creating a backup of the online WordPress blog,
- importing the backup into the offline WordPress blog, and
- replacing the links to the online blog.

Backups can be created manually or you can use WordPress plugins; I used BackUpWordPress for exporting the blog and I manually imported the data on my computer with the aid of the WordPress' how-to. After importing the WordPress data, there may still be links to the online website in the blog posts and these need to be updated. Furthermore, if WordPress is hosted offline, then it makes sense to disable automatic notification of blog update services (`Settings -> Writing`

) and search engines.

Given a WordPress blog without dynamic content (when not logged in), the goal is to generate a snapshot of the blog and make it available online. Initially, I used the WordPress plugin Simply Static to generate static copies but later I wanted a fully automatic process for generating snapshots and updating the online website. The automated process has to

- download all pages reachable from the front page,
- download the 404 error page,
- optionally copy other files like the site map, robots.txt, or stylesheets for the XML site maps,
- remove query parameters from filenames,
- replace the links to your offline WordPress blog with links to the online website, and
- upload the static website via SSH.

This can be done with standard UNIX tools and a fully automated script can be found in my git repository. The script does not check for dead links but there are online services for this purpose.

Even for completely static websites, HTTP server settings still have a large impact on website performance. I followed the proposals by the online speed tests mentioned above and enabled

- ETags,
- client-side caching, and
- compression.

Moreover, without WordPress and its plugins, I needed to redirect HTTP connections to HTTPS and canonicalize URLs manually. A corresponding `.htaccess`

file can be found in my git repository (this file does not enable ETags as they were enabled by default at my webhoster).

https://gitlab.com/christoph-conrads/christoph-conrads.name

The repository contains:

- The WordPress theme of this website, a child theme of "Twenty Fourteen",
- measurement data and scripts of the article Performance and Accuracy of xPTEQR, and
- the Matlab code for the article Spectral Norm Bounds for Hermitian Matrices.

Many commits were modified on April 12; I retrofitted all files of my WordPress theme with license headers in order to avoid licensing issues.

]]>