1 files changed, 1879 insertions, 0 deletions
diff --git a/papers/txt/engelken2023_gradient_flossing.txt b/papers/txt/engelken2023_gradient_flossing.txt
new file mode 100644
index 0000000..6d70608
--- /dev/null
+++ b/papers/txt/engelken2023_gradient_flossing.txt
@@ -0,0 +1,1879 @@
+                                              Gradient Flossing: Improving Gradient Descent
+                                                 through Dynamic Control of Jacobians
+
+
+                                                                                Rainer Engelken
+                                                                      Zuckerman Mind Brain Behavior Institute
+                                                                               Columbia University
+arXiv:2312.17306v1 [cs.LG] 28 Dec 2023
+
+
+
+
+                                                                                New York, USA
+                                                                             re2365@columbia.edu
+
+
+
+                                                                                      Abstract
+                                                  Training recurrent neural networks (RNNs) remains a challenge due to the insta-
+                                                  bility of gradients across long time horizons, which can lead to exploding and
+                                                  vanishing gradients. Recent research has linked these problems to the values
+                                                  of Lyapunov exponents for the forward-dynamics, which describe the growth or
+                                                  shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a
+                                                  novel approach to tackling gradient instability by pushing Lyapunov exponents
+                                                  of the forward dynamics toward zero during learning. We achieve this by regu-
+                                                  larizing Lyapunov exponents through backpropagation using differentiable linear
+                                                  algebra. This enables us to "floss" the gradients, stabilizing them and thus improv-
+                                                  ing network training. We demonstrate that gradient flossing controls not only the
+                                                  gradient norm but also the condition number of the long-term Jacobian, facilitating
+                                                  multidimensional error feedback propagation. We find that applying gradient
+                                                  flossing prior to training enhances both the success rate and convergence speed for
+                                                  tasks involving long time horizons. For challenging tasks, we show that gradient
+                                                  flossing during training can further increase the time horizon that can be bridged
+                                                  by backpropagation through time. Moreover, we demonstrate the effectiveness of
+                                                  our approach on various RNN architectures and tasks of variable temporal com-
+                                                  plexity. Additionally, we provide a simple implementation of our gradient flossing
+                                                  algorithm that can be used in practice. Our results indicate that gradient flossing
+                                                  via regularizing Lyapunov exponents can significantly enhance the effectiveness of
+                                                  RNN training and mitigate the exploding and vanishing gradients problem.
+
+
+                                         1   Introduction
+                                         Recurrent neural networks are commonly used both in machine learning and computational neu-
+                                         roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories.
+                                         Training is often achieved through gradient descent by the backpropagation of error information
+                                         across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively
+                                         applying the chain rule to calculate the gradient of the loss with respect to the network parameters.
+                                         Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error
+                                         signals travel across time steps. When trained on tasks that have long-range temporal dependencies,
+                                         recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from
+                                         the exponential amplification or attenuation of recursive derivatives of recurrent network states over
+                                         many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at
+                                         a much earlier point in time, the error information has to be propagated through the recurrent network
+                                         states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how
+                                         changes in one recurrent network state depend on changes in the previous network state. Together,
+                                         this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg-
+
+                                         37th Conference on Neural Information Processing Systems (NeurIPS 2023).
+ulates how well error signals can propagate backwards along multiple time steps, allowing temporal
+credit assignment. A close mathematical correspondence of these singular values and the Lyapunov
+exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents
+characterize the asymptotic average rate of exponential divergence or convergence of nearby initial
+conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to
+improve the trainability of RNNs.
+Previous approaches that tackled the problem of exploding or vanishing gradients have suggested
+solutions at different levels. First, specialized units such as LSTM and GRU were introduced,
+which have additional latent variables that can be decoupled from the recurrent network states via
+multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can
+therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can
+be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements
+[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent
+saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the
+problem of exploding/vanishing gradients can be ameliorated by specialized network architectures,
+for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled
+oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28],
+(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for
+large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization
+[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which
+become exact only in the large-network limit. Such initialization schemes have also been suggested
+for gated networks [40]. However, even when initializing the network with well-behaved gradients,
+gradients will typically not retain their stability during training once the network parameters have
+changed.
+Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a
+technique that keeps gradients well-behaved throughout training. Gradient flossing is based on
+a recently described link between the gradients of backpropagation through time and Lyapunov
+exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian
+[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close
+to zero during training. This improves not only the error gradient norm but also the condition number
+of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.
+We first demonstrate that the Lyapunov exponents can be controlled during training by including an
+additional loss term. We then demonstrate that gradient flossing improves the gradient norm and
+effective dimension of the gradient signal. We find empirically that gradient flossing improves test
+accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally,
+we find that gradient flossing during training further helps to bridge long-time horizons and show
+that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as
+dynamic mean-field theory for initialization, orthogonal initialization and gated units.
+Our contributions include:
+
+       • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in
+         recurrent neural networks based on regularization of Lyapunov exponents.
+
+       • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov
+         exponents.
+
+       • Empirical evidence that gradient flossing improves training on tasks that involve bridging
+         long time horizons.
+
+
+2   RNN Gradients and Lyapunov Exponents
+
+We begin by revisiting the established mathematical relationship between the gradients of the loss
+function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how
+it relates to the problem of vanishing and exploding gradients. In backpropagation through time,
+network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is
+locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state
+h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by
+
+
+                                                   2
+unrolling the network dynamics in time. The resulting expression for the gradient is given by:
+                            τ =t−1   t−1
+                                                  !
+                 ∂Lt    ∂Lt X        Y ∂hτ ′ +1 ∂hτ         ∂Lt X              ∂hτ
+                     =                                   =            Tt (hτ )                        (1)
+                 ∂θ     ∂ht           ′
+                                          ∂hτ  ′     ∂θ      ∂ht τ              ∂θ
+                               τ =t−l   τ =τ
+
+where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1
+                                                                    ∂hs :
+                                               t−1                       t−1
+                                               Y        ∂hτ ′ +1   Y
+                                 Tt (hτ ) =                      =   Dτ ′                             (2)
+                                                         ∂hτ ′
+                                               τ ′ =τ              ′     τ =τ
+
+Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially
+with time. This complicates training particularly when the task loss at time t dependents on inputs x
+or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How
+well error signals can propagate back in time is constrained by the tangent space dynamics along
+trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate,
+shear, or compress as the system evolves.
+The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or
+explode during backpropagation through time, are directly related to the Lyapunov exponents of the
+forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic
+time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43]
+                                                         1
+                                        λi = lim              log(σi,t )                              (3)
+                                               t→∞ t − τ
+
+where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I
+for details). This means that positive Lyapunov exponents in the forward dynamics correspond to
+exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics
+correspond to exponentially vanishing gradient modes.
+In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of
+infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the
+signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in
+the forward dynamics correspond to tangent space directions along which error signals are neither
+drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in
+the tangent dynamics can propagate information reliably across many time steps.
+
+3   Gradient Flossing: Idea and Algorithm
+We now leverage the mathematical connection established between Lyapunov exponents and the
+prevalent issue of exploding and vanishing gradients for regularizing the singular values of the
+long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing
+gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding
+directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned
+long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest
+Lyapunov exponent λ1 , λ2 . . . λk as a loss function:
+                                                             k
+                                                             X
+                                            Lflossing =            λ2i                                (4)
+                                                             i=1
+
+and evaluate the gradient obtained from backpropagation through time:
+                                                      k
+                                        ∂Lflossing   X   ∂λ2i
+                                                   =                                                  (5)
+                                          ∂θ         i=1
+                                                         ∂θ
+
+This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its
+own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov
+exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative
+reorthonormalization method involving QR decomposition that avoids directly evaluating the ill-
+conditioned long-term Jacobian [12, 44].
+
+
+                                                         3
+First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the
+trajectory using the Jacobian Ds = ∂h   s+1
+                                       ∂hs . This means to calculate
+
+                                           Q
+                                           e s+1 = Ds Qs                                       (6)
+at every time-step. Second, we extract the exponential growth rates using the QR decomposition,
+                                       Qe s+1 = Qs+1 Rs+1 ,
+which decomposes Q    e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k
+     ⊤
+so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal
+elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently
+often, i.e., once every tONS such that Q
+                                       e does not become ill-conditioned.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]:
+                                              t                t
+                                       1     Y              1X
+                               λi = lim  log     Rsii = lim       log Rsii .                             (7)
+                                   t→∞ t                t→∞ t
+                                             s=1              s=1
+This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal
+elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient
+of the square of the Lyapunov exponents backward through time in gradient flossing, we used an
+analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR
+decomposition is given by [45, 46, 47, 48]
+                                 Q = Q + Q copyltu(M) R−T ,
+                                                           
+                                                                                                 (8)
+                   T       T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying
+the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij =
+Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple
+implementation of this algorithm in pseudocode is:
+Algorithm 1 Algorithm for gradient flossing of k tangent space directions
+  initialize h, Q
+  for e = 1 → E do
+       for t = 1 → T do
+           h ← fθ (h, x)
+                  dht
+           D ← dh  t−1
+           Q←D·Q
+           if t ≡ 0 (mod tONS ) then
+               Q, R ← qr(Q)
+               γi += log(Rii )
+           end if
+       end for
+       λi = γi /T
+                       ∂Lflossing
+       θe+1 ← θe − η ∂θ
+  end for
+For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually
+implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example
+implementation is available here. Note that this algorithm also works for different recurrent network
+architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic
+variables of the recurrent network model. For example, in case of a single recurrent network of
+N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be
+calculated analytically or it can be obtained via automatic differentiation.
+
+4   Gradient Flossing: Control of Lyapunov Exponents
+In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a
+target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with
+different weight variances. The N units of the recurrent neural network follow the dynamics
+                             hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 .                        (9)
+
+
+                                                     4
+                                                                       C
+                                   t−1
+                                            ∂hτ ′ +1                        10 1
+                                                                                                             number of flossed i
+                                   Y
+                   Tt (hτ ) =                                                                                        k = 32
+                                             ∂hτ ′                                                                   k = 16
+                                   τ ′ =τ
+                                                                            10 3                                     k=1
+
+
+
+
+                                                                 2
+                                                                 i
+                                                                      i=1
+                                                                 k
+                                                                 1
+                                                                 k
+                                                                            10 5
+
+
+                                                                                         0     20     40    60             80         100
+                                                                                                       Epochs
+         B                                                               D
+             0.0                                                                   0
+             0.5                                                                   2
+1(1/ )
+
+
+
+
+                                                                      i(1/ )
+             1.0                                                                   4     number of flossed i
+                                                                                                 k = 32
+             1.5            target 1                                               6             k = 16
+                            actual 1                                                             k=1
+             2.0
+                   0   10     20       30     40 50    60   70   80                    0.0     0.2     0.4           0.6        0.8         1.0
+                                            Epochs                                                             i/N
+Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation
+A) Exploding and vanishing gradients in backpropagation through time arise from amplifica-
+                                                                                    Qt−1 ∂h ′
+tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 .
+                                                                                               τ′
+B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the
+mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent
+λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent
+weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing
+minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla
+RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note,
+the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32
+with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line
+shows median.
+
+
+The initial entries of W are drawn independently from a Gaussian distribution with zero mean and
+variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use
+the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM
+units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input
+xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing.
+In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by
+gradient flossing to match a desired target value. The networks were initialized with 10 different values
+of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly
+approached three different target values of the first Lyapunov exponents λtarget 1     = {−1, −0.5, 0}
+within less than 100 training epochs with batch size B = 1. We note that gradient flossing with
+positive target λtarget
+                  1     seems not to arrive at a positive Lyapunov exponent λ1 .
+Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient-
+descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing
+(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32
+Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one,
+several, or all Lyapunov exponents before or during network training. Gradient flossing also works for
+RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck         
+of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 ,
+both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov
+spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need
+to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the
+training performance.
+
+
+                                                                               5
+5                           Gradient Flossing: Condition Number of the Long-Term Jacobian
+                             A                                            B                                              C
+condition number 2(Tt( ))
+                            1034                                         1026        initial
+                                                                                     after flossing                      1030
+                            1025                                                     theory
+                                                                         1019        simulations
+
+
+
+
+                                                                                                            2 (theory)
+                                                              2(Tt( ))
+                                                                                                                         1020
+                            1016                                         1012                                                                    k
+                                                  initial                                                                1010                        5
+                            107                   after flossing         105                                                                         10
+                                                  theory                                                                                             15
+                                                  simulations                                                            100
+                                   0    100 200 300                                    5       10      15                       100 1010 1020 1030
+                                     time horizon t                             tangent space dimension m                             2 (numerical)
+Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition
+number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue)
+and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision
+floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on
+Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent
+space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at
+initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of
+tangent space dimensions available for backpropagation for a given condition number (Grey dotted
+line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of
+condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors
+denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80,
+epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR
+task in Fig 3D.
+
+A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient
+flossing improves the condition number of the long-term Jacobian which constrains the error signal
+propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a
+linear map A measures how close the map is to being singular and is given by the ratio of the largest
+singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax   (A)
+                                                                            min (A)
+                                                                                    . According to the
+                                            p
+rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision
+when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of
+Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number
+numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that
+the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We
+compare the numerically measured condition number κ2 with an asymptotic approximation of the
+condition number based on Lyapunov exponents that are calculated in the forward pass and find a
+good match (Fig 2A).
+Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that
+is temporally evolved by the long-term Jacobian Tt is:
+
+                                         e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) .
+                                                                
+                                     κ2 (Q                                                                                                            (10)
+                                                                   σm (Tt (hτ ))
+where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian.
+We note that this theoretical estimate of the condition number follows from the asymptotic definition
+of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing
+reduces the condition number by a factor whose magnitude increases exponentially with time (orange
+in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long
+time horizon to bridge. We will later confirm this numerically.
+Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available
+for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however,
+the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the
+condition number. This indicates how close to singular the gradient signal for a given number of
+tangent space dimensions is. Given a fixed acceptable condition number—determined, for example,
+by noise level or floating-point precision—we observe that gradient flossing increases the number of
+usable tangent space dimensions for backpropagation (Fig 2B).
+
+
+                                                                                       6
+Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents
+can even predict differences in condition number that originate from finite network size N (Fig 2C).
+We emphasize that this goes beyond mean-field methods, which become exact only in the large-
+network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G).
+
+6      Initial Gradient Flossing Improves Trainability
+     A             delayed copy task       B                                                          delayed XOR task
+            10 2                                                             10 2
+test loss
+
+
+
+
+                                                                 test loss
+                                          no gradient flossing                                           no gradient flossing
+            10 3                          gradient flossing                                              gradient flossing
+                                                                             10 3
+            10 4
+                   0   2000      4000 6000         8000 10000                            0        2000    4000 6000         8000 10000
+                                    Epochs                                                                   Epochs
+     C                      delayed copy task                         D                               delayed XOR task
+            10 2                                                             10 2
+
+            10 3                                                             10 3
+test loss
+
+
+
+
+                                                                 test loss
+
+            10 4
+                                                                             10 4
+
+                       20           40            60        80                      10       20      30 40 50 60                 70   80
+                            task difficulty (delay d)                                                task difficulty (delay d)
+Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test
+error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient
+flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed
+XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d)
+for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed
+XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing
+for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and
+80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ).
+
+We next present numerical results on two tasks with variable spatial and temporal complexity,
+demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We
+call gradient flossing before training in the following preflossing. For preflossing, we first initialize the
+                                                  Pk
+network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently
+train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used
+to probe exploding/vanishing gradients, because we want a task where the structure of long-range
+dependencies in the data is transparent and can be varied as desired.
+First, we consider the delayed copy task, where a scalar stream of random input numbers x must be
+reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial
+and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter
+vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1.
+Our experiments show that gradient flossing can substantially improve the performance of RNNs
+on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond
+d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that
+we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the
+number of flossed Lyapunov exponents.
+Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input-
+output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d
+denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient
+flossing helps to train networks on a substantially longer delay d. We found similar improvements
+through gradient flossing for RNNs initialized with orthogonal weights (see appendix G).
+
+
+                                                                 7
+7      Gradient Flossing During Training
+     A          delayed temporal XOR     B                                                             delayed spatial XOR
+                    100        flossing during training                                    100
+test accuracy (%)    90        preflossing                                                  90
+
+
+
+
+                                                                       test accuracy (%)
+                               no flossing
+                     80                                                                     80
+                     70                                                                     70
+                     60                                                                     60
+                     50                                                                     50
+                          0       2000     4000 6000      8000 10000                             0   2000   4000 6000    8000 10000
+                                              Epochs                                                           Epochs
+     C                                                                      D
+                    100                                                                    100
+                     90                                                                     90
+test accuracy (%)
+
+
+
+
+                                                                       test accuracy (%)
+                     80                                                                     80                     flossing during training
+                                                                                                                   preflossing
+                     70                                                                     70                     no flossing
+                     60                                                                     60
+                     50                                                                     50
+                          10     20
+                      30 40 50 60 70                         10 20 30 40 50 60 70
+                     complexity (delay d)                               complexity (delay d)
+Figure 4: Gradient flossing during training further improves trainability
+A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d
+with gradient flossing during training (green), preflossing (gradient flossing before training) (orange),
+and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations,
+individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial
+XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy
+as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a
+function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size
+b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before
+training and during training for green lines, and only before training for orange lines. Same plotting
+conventions as previous figure. Task loss: cross-entropy between y and ŷ.
+We next investigate the effects of gradient flossing during the training and find that gradient flossing
+during training can further improve trainability. We trained RNNs on two more challenging tasks
+with variable temporal complexity and performed gradient flossing either both during and before
+training, only before training, or not at all.
+Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task
+yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs
+falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability
+can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient
+flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed
+for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B,
+D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a
+three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of
+d steps. While the solution of the task itself is not difficult and could even be implemented by hand
+(see appendix), the task is challenging for backpropagation through time because nonlinear temporal
+associations bridging long time horizons have to be formed. Again, we observe that gradient flossing
+before training improves the performance compared to baseline, but starts failing for long delays
+d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks
+(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial
+conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to
+task performance if it is continued throughout all training epochs (appendix C)
+We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual
+one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training
+improvement. This suggests that taking the temporal correlations between Jacobians Ds into account
+is important for improving trainability.
+
+
+                                                                       8
+7.1   Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents
+
+We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in
+training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient
+flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d.
+We found that as the task becomes more difficult, networks where not enough Lyapunov exponents
+k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring
+final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that
+more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more
+difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of
+both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend
+(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to
+zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our
+tasks, longer delays automatically imply that more values have to be stored as at any moment all the
+values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from
+tasks where a single variable has to be stored and recalled after a long delay. It would be interesting
+to study tasks where the number of delay steps and the number of items in memory can be varied
+independently.
+Finally, we did the same analysis on networks with only preflossing (gradient flossing before training)
+and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov
+exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems
+to indicate that gradient flossing during training cannot be replaced by just gradient flossing more
+Lyapunov exponents before training.
+
+
+
+
+Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number
+of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104
+for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing
+for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75%
+percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy
+between y and ŷ.
+8     Limitations
+The mathematical connection between Lyapunov exponents and backpropagation through time
+exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be
+interesting to extend our analysis to finite-time Lyapunov exponents.
+Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians
+of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest
+product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of
+the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside
+or towards the attractor.
+Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry
+of covariant Lyapunov vectors [53], which could be used to improve training performance, speed,
+and reliability. Additionally, it is important to investigate how sensitive the method is to the choice
+of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54].
+
+
+                                                   9
+Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size
+and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we
+suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is
+not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also
+calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov
+spectrum and investigate more efficient gradient flossing schedules.
+
+
+9   Discussion
+
+We tackle the problem of gradient signal propagation in recurrent neural networks through a dy-
+namical systems lens. We introduce a novel method called gradient flossing that addresses the
+problem of gradient instability during training. Our approach enhances gradient signal stability both
+before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian
+well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this,
+we combine established dynamical systems methods for calculating Lyapunov exponents with an
+analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability
+in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons.
+Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55].
+Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For
+difficult temporal credit assignment problems, gradient flossing throughout training further enhances
+signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks
+with controllable time-complexity and show that it can be combined with other approaches to tackle
+exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal
+initialization and specialized single units, such as LSTMs.
+Prior research on exploding and vanishing gradients mainly focused on selecting network architectures
+that are less prone to exploding/vanishing gradients or finding parameter initializations that provide
+well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can
+be seen as a complementary approach that can further enhance gradient stability throughout training.
+Compared to the work on picking good parameter initializations based on random matrix theory [41]
+and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field
+theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient
+flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory
+and mean-field heuristics are usually confined to the limit of large networks [52], while gradient
+flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients
+of backpropagation through time has been described previously [9, 12] and has been spelled out
+analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents
+here not only as a diagnostic tool for gradient stability but also to show that they can directly be part
+of the cure for exploding and vanishing gradients.
+Future investigations could delve further into the roles of the second to N th Lyapunov exponents
+in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen-
+sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results
+suggest a trade-off between trainability across long time horizons and the nonlinear task demands
+that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent
+learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to
+feedforward networks, state-space models and transformers is a promising avenue for future research
+(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such
+as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks
+could be treated similarly. This could also provide a link between the stability of the network against
+adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that
+time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they
+may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture
+search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other
+model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single-
+unit parameters could optimize the activation function and single-neuron biophysics in biologically
+plausible neuron models.
+
+
+
+                                                   10
+Acknowledgments and Disclosure of Funding
+I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G.
+Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S.
+Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript.
+I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex
+Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for
+the Global Brain (542939SPI), and the Swartz Foundation (2021-6).
+
+References
+ [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974.
+ [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management
+     Science. MIT-Press, Cambridge, Mass, 8, 1985.
+ [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85,
+     pages 599–604, 1985.
+ [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-
+     propagating errors. Nature, 323(6088):533, October 1986.
+ [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991.
+ [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–
+     1780, 1997.
+ [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
+     IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
+ [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural
+     Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063.
+ [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+     arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427.
+[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics
+     with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022.
+[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working
+     memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023.
+[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
+     Physical Review Research, 5(4):043044, October 2023.
+[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics,
+     57(3):617–656, July 1985.
+[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics.
+     Cambridge University Press, Cambridge, February 2016.
+[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of
+     Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS
+     Bibcode: 2014arXiv1409.1259C Type: article.
+[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
+     networks. In Proceedings of the 30th International Conference on International Conference on Machine
+     Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org.
+[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of
+     Technology, Faculty of Information Technology, Brno, CZ, 2012.
+[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016.
+     Google-Books-ID: omivDQAAQBAJ.
+[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
+     Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine
+     Learning, pages 448–456. PMLR, June 2015.
+
+
+                                                     11
+[20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View
+     on Recurrent Neural Networks. International Conference on Learning Representations, December 2018.
+
+[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of
+     learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv:
+     1312.6120.
+
+[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In
+     Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June
+     2016.
+
+[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley
+     Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978.
+     PMLR, July 2018.
+
+[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN):
+     An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page
+     arXiv:2010.00951, October 2020.
+
+[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney.
+     Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January
+     2021.
+
+[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and
+     Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023.
+     ADS Bibcode: 2023arXiv230306349O Type: article.
+
+[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an
+     erratum note. Bonn, Germany: German National Research Center for Information Technology GMD
+     Technical Report, 148:34, 2001.
+
+[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks.
+     Neural Computation, 19(1):111–138, January 2007.
+
+[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances
+     in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
+
+[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway
+     Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198.
+     PMLR, July 2017.
+
+[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle
+     Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887,
+     November 2019. ISSN: 2576-2303.
+
+[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook
+     University, May 2023.
+
+[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15,
+     pages 315–323, April 2011.
+
+[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential
+     expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June
+     2016. arXiv: 1606.05340.
+
+[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field
+     Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394
+     [cs, stat], August 2018. arXiv: 1806.05394.
+
+[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural
+     Networks. December 2018.
+
+[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information
+     Propagation. November 2016.
+
+[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In
+     Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
+
+
+                                                    12
+[39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training.
+     Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article.
+[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey
+     Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019.
+[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls
+     phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv:
+     2002.00025.
+[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of
+     dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968.
+[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing
+     Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990.
+[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic
+     Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of
+     them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980.
+[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks.
+     Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph].
+[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application
+     in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode:
+     2010arXiv1001.1654W Type: article.
+[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of
+     Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January
+     1989.
+[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating
+     Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article.
+[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning
+     through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems,
+     volume 30. Curran Associates, Inc., 2017.
+[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in
+     Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979.
+[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007.
+     Google-Books-ID: ZUfVZELlrMEC.
+[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and
+     Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics,
+     11(1):501–528, 2020.
+[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant
+     Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007.
+[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D:
+     Nonlinear Phenomena, 118(3):167–198, July 1998.
+[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential
+     Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
+     2018.
+[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents.
+     Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021.
+[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents
+     for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied
+     Mathematics and Statistics, 8, 2022.
+[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks.
+     Physical Review X, 12(1):011011, January 2022.
+[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal
+     Processes. Routledge, New York, 1st edition edition, March 1987.
+
+
+                                                     13
+[60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus
+     of Variations and Optimal Control Theory. Princeton University Press, December 2011.
+
+[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms
+     for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320–
+     135:5353, January 2020.
+
+[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical
+     report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article.
+[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without
+     Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article.
+
+[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical
+     Review Letters, 69(26):3717–3719, December 1992.
+
+[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent
+     neural networks. Physical Review E, 82(1):011903, July 2010.
+
+[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered
+     neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758.
+
+[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations
+     impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology,
+     18(12):e1010590, December 2022.
+
+[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID:
+     PfXoAwAAQBAJ.
+[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems.
+     World Scientific Pub Co Inc, Hackensack, NJ, September 2009.
+
+
+
+
+                                                     14
+A         Backpropagation Through QR Decomposition
+The backward pass of the QR decomposition is given by [45, 46, 47, 48]
+                                 Q = Q + Q copyltu(M) R−T
+                                                               
+                                                                                                              (11)
+                     T    T
+where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower
+triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j)
+[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T .
+Using an analytical pullback is more memory-efficient and less computationally costly than directly doing
+automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR
+decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable
+programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient
+flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here .
+
+B       Further Details and Analysis of Gradient Flossing
+An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here.
+We are actively developing implementations for other widely used differentiable programming frameworks.
+
+B.1       Gradient Flossing for recurrent LSTM and ReLU networks
+
+      A                         LSTM                         C                         ReLU
+           0.2                             target 1                 6                             target 1
+                                           actual 1                                               actual 1
+                                                                    4
+           0.0
+                                                                    2
+    1
+
+
+
+
+                                                             1
+
+
+
+
+           0.2
+                                                                    0
+           0.4
+                                                                    2
+                 0   20       40      60   80         100               0       20   40      60   80         100
+                                Epochs                                                 Epochs
+      B                                                      D 101
+          10 1
+          10 2                                                   10 1
+          10 3
+ 2
+
+
+
+
+                                                            2
+ 1
+
+
+
+
+                                                            1
+
+
+
+
+          10 4                                                   10 3
+          10 5
+          10 6                                                   10 5
+                 0   20       40      60   80         100                   0   20   40    60     80    100
+                                Epochs                                                Epochs
+
+
+Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First
+Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean
+squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0
+by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target
+value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with
+Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1).
+B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent
+LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were
+initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn
+g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10
+network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median.
+
+We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To
+this end, we generated random LSTM networks where the weights of all the different gates and biases were
+
+
+                                                        15
+independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance.
+Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The
+dynamics of each of the N LSTM units follows the map [6]:
+                                      ft    =    σg (Uf ht−1 + Wf xt + bf )                                      (12)
+                                      ot    =    σg (Uo ht−1 + Wo xt + bo )                                      (13)
+                                      it    =    σg (Ui ht−1 + Wi xt + bi )                                      (14)
+                                      c̃t   =    σh (Uc ht−1 + Wc xt + bc )                                      (15)
+                                      ct    =    ft ⊙ ct−1 + it ⊙ c̃t                                            (16)
+                                      ht    =    ot ⊙ ϕ(ct )                                                     (17)
+                                                              1
+where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x)              is the sigmoid function, σh (x) = tanh(x) and
+                                                                2
+entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars.
+Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state.
+In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow
+if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly
+distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero.
+During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close
+to the target Lyapunov exponent λtarget
+                                     1   = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that
+the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov
+exponent of λtarget
+               1    = −1 is achieved after 100 gradient flossing steps only for a subset of random network
+realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units,
+which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also
+[9, 12, 41, 58]).
+For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in
+Eq 9
+                              hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 ,
+The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1
+and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer
+function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d.
+Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
+flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents
+Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that
+the squared Lyapunov exponents of ReLU networks converge towards zero.
+
+B.2    Additional Results for Different Numbers of Flossed Lyapunov Exponents
+Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing
+before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the
+tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus
+k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during
+training cannot be replaced by just gradient flossing more Lyapunov exponents before training.
+
+
+C     Additional Results on Gradient Flossing Throughout Training
+We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient
+flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time
+steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also
+boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally,
+we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally,
+we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR
+task have a simple relationship to the delay d.
+
+
+D     Gradient Flossing boosts the Gradient Norm for Long Time Horizons
+In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is
+important to note that the complete error gradient of backpropagation through time is composed of a summation
+of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent
+dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are
+smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian.
+
+
+                                                         16
+                A                                                                                 B
+                     100                                                                               100
+                                     k
+ test accuracy (%)
+
+
+
+
+                                                                                   test accuracy (%)
+                             80          1                                                                     80                           delay
+                                         20                                                                                                     30
+                                         40                                                                                                     50
+                                         60                                                                                                     70
+                             60          80                                                                    60
+
+                                      20       40         60                                                        0       20     40       60   80
+                                       complexity (delay d)                                                                number of flossed i k
+                       C                                          100                                    D                                         100
+                             70                                                                                70
+      complexity (delay d)
+
+
+
+
+                                                                                        complexity (delay d)
+                                                               test accuracy (%)
+
+
+
+
+                                                                                                                                                test accuracy (%)
+                             60                                                                                60
+                             50                                                                                50
+                             40                                                                                40
+                             30                                                                                30
+                             20                                                                                20
+                             10                                                                                10
+                                  1 10 20 30 40 50 60 70 80       50                                                1 10 20 30 40 50 60 70 80      50
+                                   number of flossed i k                                                             number of flossed i k
+
+
+Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents
+A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
+flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of
+flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function
+of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch
+size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial
+XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B,
+C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means,
+transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ.
+
+
+In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop
+length of backpropagation through time is necessary for finding the correct association. We were moreover
+careful in our task design not to have any additional signals in our task that might help to bridge the long time
+scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops,
+even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate
+the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time
+horizons, we focus on the gradient with respect to the initial conditions h0 .
+                   τ =t−1   t−1
+                                          !                τ =t−1   t−1
+                                                                                   !
+     ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ           ∂Lt X         Y ∂hτ ′ +1              ∂Lt
+           =                                       =                                 δτ 0 =      Tt (h0 )      (18)
+     ∂h0      ∂ht            ′
+                                   ∂h τ ′    ∂h 0     ∂h t           ′
+                                                                           ∂h  τ ′          ∂ht
+                                      τ =t−l   τ =τ                                           τ =t−l                τ =τ
+
+We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that
+contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure
+that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we
+use the binary cross entropy loss which makes the derivative ∂L
+                                                              ∂ht
+                                                                 t
+                                                                   trivial.
+In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we
+compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70.
+One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained
+without gradient flossing.
+
+
+                                                                               17
+                                                                          dL
+For the network without gradient flossing, the gradient norm of | dh        0
+                                                                              | diminishes to extremely small values
+       −6
+(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each
+episode of gradient flossing causes the norm | dh dL
+                                                    0
+                                                      | to spike, surpassing values larger than 10−2 . These findings
+are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in
+challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the
+            dL
+gradient | dh 0
+                | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent
+in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between
+                   dL
+gradient norm | dh    0
+                        | and training success consistently across numerous network realizations (Fig 8C and D).
+                                                     dL
+These findings suggest that the gradient norm | dh      0
+                                                          | can be a good predictor of learning success, sometimes
+hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient
+norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual
+growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when
+                                                                                    dL
+plotting the accuracy as a function of epoch aligned with the last epoch with | dh    0
+                                                                                        | < 0.001, we observe for this
+                                             dL
+task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the
+chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector
+ dL
+ dh0
+     with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap
+around the training epoch where the accuracy surpasses the chance level both in networks with and without
+gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable
+(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over
+many epochs align
+
+D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient
+To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient
+                                                                                                                  dL
+and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW
+into in weighted sum of outer products using singular value decomposition (Fig 10).
+As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term
+Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during
+gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked:
+                                    τ =t−1    t−1
+                                                            !
+                      ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ            ∂Lt X             ∂hτ
+                            =                                        =          Tt (hτ )                        (19)
+                      ∂W       ∂ht             ′
+                                                    ∂h  τ ′    ∂W       ∂h t
+                                                                              τ
+                                                                                          ∂W
+                                     τ =t−l   τ =τ
+
+We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only
+characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the
+summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important
+signals about the task to be contaminated by summands of both shorter and longer chains, which contribute
+irrelevant fluctuations.
+                                                              dL
+                                                                 
+The singular values of the recurrent weight gradient σi dW         as a function of training epoch reveal that the
+subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing,
+while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing
+                                                                 dL
+increases the effective rank of the recurrent weight gradient dW     . In other words, gradient flossing facilitates
+high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents
+to zero, the different summands in the total gradient contribute more equitable as long loops have neither a
+dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would
+happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank.
+In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish
+to extremely small values over training epochs and remain very small throughout training. Note however that the
+leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not.
+                                                                                              dL
+We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0
+                                                                                                   |, the subordinate
+singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance
+level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the
+accuracy grows beyond chance only later during training (Fig 11).
+
+D.2 Gradient Flossing Throughout Training Can Be Detrimental
+We find that gradient flossing continued throughout all training epochs can be detrimental for performance
+(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different
+conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or
+we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at
+all (blue).
+
+
+                                                          18
+            A                                                                     C
+              10 2                                                                          0.05
+              10 4                                                                          0.04
+ |dhd 0 |     10 6                                                                          0.03
+
+
+
+
+                                                                         |dhd 0 |
+              10 8                                                                          0.02
+             10 10                                                                          0.01
+                                        with flossing during training
+             10 12                      without flossing
+                                                                                            0.00
+                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
+                                       Epochs                                                                  Epochs
+            B                                                                     D
+                     100                                                                    100
+      accuracy (%)
+
+
+
+
+                                                                             accuracy (%)
+                               with flossing during training
+                               without flossing
+
+
+                      50                                                                     50
+                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
+                                       Epochs                                                                  Epochs
+                                                                                                dL
+Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh             0
+                                                                                                     | as
+a function of training epochs for networks without flossing (blue) and networks with flossing
+during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈
+                                                                                               dL
+{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh    0
+                                                                                                   | are
+much smaller overall. One out of ten random network realizations with solid line, the other 9 with
+transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in
+A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+                            dL
+when the gradient norm | dh   0
+                                | becomes macroscopically large. C) Same as A in linear scale. Mean
+final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are
+different network realizations with gradient flossing during training. Black lines are without any
+gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network
+                                                    dL
+realizations the moments where gradient norm | dh     0
+                                                        | becomes macroscopically large coincides with
+the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80,
+epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before
+training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50%
+(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously,
+                                                                                                                dL
+when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh           0
+                                                                                                                    | is
+                                                                                                        dL
+initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes
+macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across
+long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful
+to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered
+here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov
+exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a
+fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an
+interesting future research avenue to further investigate this potential trade-off also in other network architectures.
+
+D.3             Lyapunov Exponents after Training With and Without Gradient Flossing
+In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on
+the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with
+gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted
+by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level
+have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent.
+In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an
+
+
+                                                                        19
+        A                                                       B 80
+
+                                                                           75
+            10 2
+                                                                           70
+            10 3
+
+
+
+
+                                                            accuracy (%)
+                                                                           65
+ |dhd 0 |
+
+
+
+            10 4                                                           60
+                                                                           55
+            10 5
+                                                                           50
+                                                                           45
+                   3000   2000     1000      0       1000                       0   100     200    300     400
+                                 Epochs                                                   Epochs
+
+
+Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level
+                       dL
+A) Gradient norm of | dh 0
+                           | as a function of training epochs for 20 network realizations with flossing
+during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and
+simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with
+                                                                        dL
+flossing during training. Epochs are aligned to the last epoch with | dh  0
+                                                                            | < 0.001. Different colors
+are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 ,
+T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at
+training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+analytically tractable temporal correlation structure, which might partially explain the less conclusive results.
+It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in
+[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a
+2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is
+required [9, 12, 58].
+
+
+E Gradient Flossing for Linear Network
+We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train
+linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task,
+but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full
+analytical description of gradient flossing for linear networks would be a promising avenue for future research as
+networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope
+of the presented work.
+
+
+F      Computational Complexity of Gradient Flossing
+We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three
+main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity
+of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla
+            
+networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the
+Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents.
+Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents
+                                                               
+considered.
+Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of
+                                                                 
+
+training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is
+                                                                                                  
+the number of flossing time steps.
+
+
+                                                       20
+  A
+             10 2
+ i dW )
+   d
+      (
+
+
+
+             10 4
+
+
+                        0           500          1000        1500           2000           2500          3000
+  B               100
+                            early training flossing
+                            without flossing
+   accuracy (%)
+
+
+
+
+                  50
+                        0           500          1000         1500          2000           2500          3000
+                                                             Epochs
+Figure 10: Gradient flossing decreases condition number          of recurrent weight error gradient
+                                                     dL
+                                                         
+A) Singular values of recurrent weight gradient σi dW      as a function of training epochs for singular
+values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing
+(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while
+the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the
+                                                  dL
+effective rank of the recurrent weight gradient dW   . B) Accuracy as a function of training epochs.
+Note that accuracy of networks with gradient flossing grows beyond chance level approximately
+when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables
+high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 ,
+T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training.
+Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).
+
+
+
+
+In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] ,
+                                                                                                         
+where E is the number of training epochs and Ep is the number of preflossing epochs.
+For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with
+O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during
+training.
+Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for
+training success is much smaller than the total number of training epochs E. For example, the preflossing for
+500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training
+epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time
+was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the
+temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be
+the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks
+with unknown temporal correlation structure.
+It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent
+on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would
+suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for
+a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU
+hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond
+our current computational budget. We will start such experiments an might be able to provide results during the
+reviewer period.
+
+
+                                                        21
+  A
+              10 2
+ i dW )
+   d
+      (
+
+
+
+              10 4
+
+
+                         0           500          1000        1500         2000           2500           3000
+  B                100
+                             early training flossing
+                             without flossing
+    accuracy (%)
+
+
+
+
+                   50
+                         0           500          1000         1500        2000           2500           3000
+                                                              Epochs
+Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient
+Same as Fig 10 for different network realization.
+
+
+
+G         Additional controls
+
+We also investigate the effects of gradient flossing during the training with orthogonal weight initializations
+and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to
+bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated
+the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal
+complexity and performed gradient flossing either both during and before training, only before training, or not at
+all.
+Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal
+XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla
+RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can
+be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with
+gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for
+500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks
+have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input
+signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text.
+Again, we observe similar to networks with Gaussian initialization that flossing before training improves the
+performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks
+that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and
+D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian
+initialization because for 10000 epochs, both networks with gradient flossing only before training and with
+gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal
+initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient
+flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go
+well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient
+flossing steps necessary to improve performance.
+
+
+
+H         Additional Details on Training Tasks
+
+In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3:
+
+
+                                                         22
+  A                  100
+
+
+      accuracy (%)
+                                                                      with flossing throughout training
+                                                                      early training flossing
+                                                                      without flossing
+                     50
+                           0       500          1000         1500           2000          2500           3000
+  B
+                     100
+    test error
+
+
+
+
+                 10 1
+                           0       500          1000         1500           2000          2500           3000
+  C
+                 10 3                        with flossing throughout training
+                                             early training flossing
+                                             without flossing
+   |dhd 0 |
+
+
+
+
+                 10 7
+
+                           0       500          1000          1500          2000          2500           3000
+                                                             Epochs
+Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a
+function of training epochs for binary temporal delayed XOR task for gradient flossing throughout
+training every 100 training epochs (red). Accuracy drops down close to chance level every time after
+gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at
+epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function
+                                           dL
+of training epochs. C) Gradient norm of | dh 0
+                                               | as a function of training epochs for networks without
+gradient flossing (blue) and networks with flossing throughout training (red) and early training
+gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks
+                                                   dL
+without gradient flossing, the gradient norms | dh   0
+                                                       | are much smaller overall. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on
+k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross
+entropy(y, ŷ).
+
+
+H.1        Copy task
+For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the
+input to be sampled i.i.d. from a uniform distribution between 0 and 1.
+
+H.2        Temporal XOR task
+The temporal XOR task requires the target network readout yt at time t to be computed as follows:
+                                              yt = |xt−d/2 − xt−d |                                            (20)
+                                                                                  2
+where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt
+follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be
+seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0,
+and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example,
+using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by
+                               f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ).                          (21)
+Together with a delay line to transmit the signal xt−d over time, this can solve the task.
+
+
+                                                        23
+         A            no gradient flossing                          B gradient flossing during training
+             0.1
+    1
+                                                                        0.1
+
+
+
+
+                                                               1
+             0.2
+                                                                        0.2
+                      20       40         60                                        20       40         60
+                       complexity (delay d)                                          complexity (delay d)
+         C                                                          D
+             0.1
+             0.2                                                        0.1
+    10
+
+
+
+
+                                                               10
+             0.3                                                        0.2
+                      20       40         60                                        20       40         60
+                       complexity (delay d)                                          complexity (delay d)
+Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First
+Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of
+the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red
+corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with
+gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully
+trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto-
+col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov
+exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes
+λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for
+Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR,
+loss: cross entropy(y, ŷ).
+
+
+I   Additional Background on Lyapunov Exponents of RNNs
+
+An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt =
+F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time
+dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation,
+but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the
+map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal
+perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all
+infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |.
+Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov
+exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions:
+
+
+                                                          1              ||ϵut ||
+                                        λ1 (h0 ) = lim        lim log                                            (22)
+                                                    t→∞ t ϵ→0            ||ϵu0 ||
+In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial
+conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial
+to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the
+                                       ||ϵut ||
+limits are exchanged, as limt→∞ log ||ϵu   0 ||
+                                                is bounded for finite perturbations even if the system is chaotic. To
+measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us
+spanning the tangent space:
+
+
+                                                   us+1 = Ds us                                                  (23)
+
+
+                                                         24
+                                      forward pass                                            backward pass
+                                      O N2 b
+                                             
+    RNN dynamics                                                                              "
+                                      O N2 k
+                                              
+ Jacobian step                                                                                "
+                                      O N k2
+                                              
+    QR step                                                                                   "
+                                      O N2 b T
+                                                
+    total amortized costs                                                                     "
+    per training epoch
+                                      O N 2 Tf (1 + k/tONS + k)
+                                                                    
+    total amortized costs                                                                     "
+    per gradient flossing epoch
+                                      O N 2 [Eb T + Ep Tf (1 + k/tONS + k)]
+                                                                                   
+    total amortized costs                                                                     "
+    of preflossing
+                                      O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)]
+                                                                                        
+    total amortized costs                                                                     "
+    flossing during training
+
+Table 1: Computational cost for gradient flossing and training of RNNs
+N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of
+training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization
+interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is
+the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically,
+we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller
+than both the total number of training epochs E. Moreover, Tp can be smaller than T .
+
+                                                     Table 2: XOR
+                                  input xt−d        input xt−2d     target output yt
+                                  0                 0               0
+                                  0                 1               1
+                                  1                 0               1
+                                  1                 1               0
+
+
+where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba-
+tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a
+fixed point, i.e., when hs+1 = f (hs ) = hs .
+We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian
+
+                                      Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ).                        (24)
+Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN
+are defined as the logarithms of the eigenvalues of the Oseledets matrix
+                                                                              1
+                                          Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t ,                             (25)
+                                                    t→∞
+
+where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term
+Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope
+spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the
+logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence
+of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov
+exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov
+spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned,
+i.e., the ratio between its largest and smallest singular value diverges exponentially with time.
+
+
+J      Algorithm for Calculating Lyapunov Spectrum of Rate Networks
+For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional
+infinitesimal volume element is given by λ(m) = m                               (1)
+                                                                                    , λ2 = λ(2) − λ1 , λ3 =
+                                                   P
+                                                     i=1 λi . Therefore, λ1 = λ
+  (3)
+λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition.
+
+
+                                                            25
+                                                    orthogonal weight initialization
+       A                           delayed temporal XOR           B                  delayed spatial XOR
+                    100                                                                       100
+test accuracy (%)              flossing during training
+
+
+
+
+                                                                          test accuracy (%)
+                               preflossing
+                     80        no flossing                                                     80
+
+                     60                                                                        60
+
+                          0      2000     4000 6000          8000 10000                             0    1000   2000 3000       4000     5000
+                                             Epochs                                                               Epochs
+       C                                                                         D
+                    100                                                                       100
+test accuracy (%)
+
+
+
+
+                                                                          test accuracy (%)
+                     80                                                                        80                       flossing during training
+                                                                                                                        preflossing
+                                                                                                                        no flossing
+                     60                                                                        60
+
+                          10     20      30 40 50 60                 70                             10   20     30 40 50 60              70
+                                       complexity (delay d)                                                   complexity (delay d)
+
+Figure 14: Gradient flossing before and during training improves trainability for orthogonal
+nets
+A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR
+task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and
+with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual
+network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d .
+C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test
+accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1,
+batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed
+spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training
+and during training for green lines, and only before training for orange lines. Shaded areas are 25%
+and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is
+cross-entropy between y, ŷ.
+
+
+First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm
+                                                                          s ] in the tangent space along the trajectory
+using the Jacobian Ds :
+                                                Qe s+1 = Ds Qs                                                     (26)
+A continuous system can be transformed into a discrete system by considering a stroboscopic representation,
+where the trajectory is only considered at certain discrete time points. We use here the notation of discrete
+dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q
+                                                                                                     e s+1 =
+Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time
+∆t = 1. The notation can be readily extended to continuous systems [43].
+Second, we extract the exponential growth rates using the QR-decomposition,
+                                                                 e s+1 = Qs+1 Rs+1 ,
+                                                                 Q
+
+which uniquely decomposes Q    e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤    s+1 Qs+1 = 1m×m
+and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1
+describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking
+of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the
+QR-decomposition for k = 2.
+The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs :
+                                                                     t                t
+                                                              1     Y              1X
+                                                 λi = lim       log     Rsii = lim       log Rsii .                                             (27)
+                                                          t→∞ t                t→∞ t
+                                                                    s=1              s=1
+
+
+
+                                                                          26
+                      Ds
+                                                                        QR
+
+                                                         s+1
+                                                        R22
+qs2                              ~2         s+1
+                                                                                    2
+                                                                                   qs+1
+                                 qs+1      R11          ~1
+                                                        qs+1
+                                                                              s+1
+      qs1                                                                    R11             1
+                                                                                            qs+1
+                                                                                                    s+1
+                                                                                                   R22
+
+Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix
+Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted
+by the Jacobian Ds into an k-dimensional parallelotope Q           e s+1 = Ds Qs embedded in RN . The
+figure illustrates this for k = 2, in which case the columns of Q        e s+1 span a parallelogram, which
+can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the
+area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition
+reorthonormalizes Q       e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 =
+[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R
+                                                                   s+1
+                                                                       . Qs+1 describes the rotation of Qs
+                                                  s+1
+caused by Ds . The diagonal entries of R              gives the stretching/shrinking along the columns of
+Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q            e s+1 is given by
+          m
+Vm = i=1 Rs+1     ii    . The  time-averaged logarithms  of the diagonal  elements of R s
+                                                                                          give the Lyapunov
+                                   1
+                                         Qt       s                1
+                                                                     Pt          s
+spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii .
+
+
+Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often,
+i.e., once every sONS steps such that Q     ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned
+                                      e s+s
+[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the
+ratio of the smallest and largest singular value:
+                                                                                    s+s
+                              e s+s ) = κ2 (Rs+sONS ) =    σ1 (Rs+sONS )    R ONS
+                         κ2 ( Q                                          = 11          .                   (28)
+                                                                            Rs+s
+                                   ONS                            s+s
+                                                           σm (R ONS )         mm
+                                                                                   ONS
+
+An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to
+converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25)
+[54]. A simple example of this algorithm in pseudocode is:
+
+Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum
+  initialize h, Q
+  evolve h until it is on attractor (avoid initial transient)
+  evolve Q until it converges to the eigenvectors of the backward Oseledets matrix
+  set γi = 0
+  for t = 1 → T do
+       h ← f (h)
+              df
+       D ← dh
+       Q←D·Q
+       if s ≡ 0 (mod sONS ) then
+           Q, R ← qr(Q)
+           γi += log(Rii )
+       end if
+  end for
+  λi = γi /T
+
+
+It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge
+towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example
+of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization
+time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and
+a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough
+estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation
+run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to
+estimate the Lyapunov spectrum [14, 43, 68, 69].
+
+
+                                                        27
+K            Convergence of Lyapunov Exponents of RNNs
+In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov
+exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization.
+
+             0
+             10
+i(1/steps)
+
+
+
+
+             20
+             30
+             40
+                  100                                    101                                      102
+                                                        steps
+
+
+Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents
+λi for ten identical network realizations with different initial conditions with simulation time (i =
+1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1).
+
+
+
+
+                                                   28
+
+\ No newline at end of file