diff options
Diffstat (limited to 'papers/txt/engelken2023_gradient_flossing.txt')
| -rw-r--r-- | papers/txt/engelken2023_gradient_flossing.txt | 1879 |
1 files changed, 1879 insertions, 0 deletions
diff --git a/papers/txt/engelken2023_gradient_flossing.txt b/papers/txt/engelken2023_gradient_flossing.txt new file mode 100644 index 0000000..6d70608 --- /dev/null +++ b/papers/txt/engelken2023_gradient_flossing.txt @@ -0,0 +1,1879 @@ + Gradient Flossing: Improving Gradient Descent + through Dynamic Control of Jacobians + + + Rainer Engelken + Zuckerman Mind Brain Behavior Institute + Columbia University +arXiv:2312.17306v1 [cs.LG] 28 Dec 2023 + + + + + New York, USA + re2365@columbia.edu + + + + Abstract + Training recurrent neural networks (RNNs) remains a challenge due to the insta- + bility of gradients across long time horizons, which can lead to exploding and + vanishing gradients. Recent research has linked these problems to the values + of Lyapunov exponents for the forward-dynamics, which describe the growth or + shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a + novel approach to tackling gradient instability by pushing Lyapunov exponents + of the forward dynamics toward zero during learning. We achieve this by regu- + larizing Lyapunov exponents through backpropagation using differentiable linear + algebra. This enables us to "floss" the gradients, stabilizing them and thus improv- + ing network training. We demonstrate that gradient flossing controls not only the + gradient norm but also the condition number of the long-term Jacobian, facilitating + multidimensional error feedback propagation. We find that applying gradient + flossing prior to training enhances both the success rate and convergence speed for + tasks involving long time horizons. For challenging tasks, we show that gradient + flossing during training can further increase the time horizon that can be bridged + by backpropagation through time. Moreover, we demonstrate the effectiveness of + our approach on various RNN architectures and tasks of variable temporal com- + plexity. Additionally, we provide a simple implementation of our gradient flossing + algorithm that can be used in practice. Our results indicate that gradient flossing + via regularizing Lyapunov exponents can significantly enhance the effectiveness of + RNN training and mitigate the exploding and vanishing gradients problem. + + + 1 Introduction + Recurrent neural networks are commonly used both in machine learning and computational neu- + roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories. + Training is often achieved through gradient descent by the backpropagation of error information + across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively + applying the chain rule to calculate the gradient of the loss with respect to the network parameters. + Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error + signals travel across time steps. When trained on tasks that have long-range temporal dependencies, + recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from + the exponential amplification or attenuation of recursive derivatives of recurrent network states over + many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at + a much earlier point in time, the error information has to be propagated through the recurrent network + states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how + changes in one recurrent network state depend on changes in the previous network state. Together, + this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg- + + 37th Conference on Neural Information Processing Systems (NeurIPS 2023). +ulates how well error signals can propagate backwards along multiple time steps, allowing temporal +credit assignment. A close mathematical correspondence of these singular values and the Lyapunov +exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents +characterize the asymptotic average rate of exponential divergence or convergence of nearby initial +conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to +improve the trainability of RNNs. +Previous approaches that tackled the problem of exploding or vanishing gradients have suggested +solutions at different levels. First, specialized units such as LSTM and GRU were introduced, +which have additional latent variables that can be decoupled from the recurrent network states via +multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can +therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can +be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements +[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent +saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the +problem of exploding/vanishing gradients can be ameliorated by specialized network architectures, +for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled +oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28], +(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for +large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization +[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which +become exact only in the large-network limit. Such initialization schemes have also been suggested +for gated networks [40]. However, even when initializing the network with well-behaved gradients, +gradients will typically not retain their stability during training once the network parameters have +changed. +Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a +technique that keeps gradients well-behaved throughout training. Gradient flossing is based on +a recently described link between the gradients of backpropagation through time and Lyapunov +exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian +[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close +to zero during training. This improves not only the error gradient norm but also the condition number +of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons. +We first demonstrate that the Lyapunov exponents can be controlled during training by including an +additional loss term. We then demonstrate that gradient flossing improves the gradient norm and +effective dimension of the gradient signal. We find empirically that gradient flossing improves test +accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally, +we find that gradient flossing during training further helps to bridge long-time horizons and show +that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as +dynamic mean-field theory for initialization, orthogonal initialization and gated units. +Our contributions include: + + • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in + recurrent neural networks based on regularization of Lyapunov exponents. + + • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov + exponents. + + • Empirical evidence that gradient flossing improves training on tasks that involve bridging + long time horizons. + + +2 RNN Gradients and Lyapunov Exponents + +We begin by revisiting the established mathematical relationship between the gradients of the loss +function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how +it relates to the problem of vanishing and exploding gradients. In backpropagation through time, +network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is +locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state +h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by + + + 2 +unrolling the network dynamics in time. The resulting expression for the gradient is given by: + τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ + = = Tt (hτ ) (1) + ∂θ ∂ht ′ + ∂hτ ′ ∂θ ∂ht τ ∂θ + τ =t−l τ =τ + +where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1 + ∂hs : + t−1 t−1 + Y ∂hτ ′ +1 Y + Tt (hτ ) = = Dτ ′ (2) + ∂hτ ′ + τ ′ =τ ′ τ =τ + +Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially +with time. This complicates training particularly when the task loss at time t dependents on inputs x +or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How +well error signals can propagate back in time is constrained by the tangent space dynamics along +trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate, +shear, or compress as the system evolves. +The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or +explode during backpropagation through time, are directly related to the Lyapunov exponents of the +forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic +time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43] + 1 + λi = lim log(σi,t ) (3) + t→∞ t − τ + +where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I +for details). This means that positive Lyapunov exponents in the forward dynamics correspond to +exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics +correspond to exponentially vanishing gradient modes. +In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of +infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the +signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in +the forward dynamics correspond to tangent space directions along which error signals are neither +drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in +the tangent dynamics can propagate information reliably across many time steps. + +3 Gradient Flossing: Idea and Algorithm +We now leverage the mathematical connection established between Lyapunov exponents and the +prevalent issue of exploding and vanishing gradients for regularizing the singular values of the +long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing +gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding +directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned +long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest +Lyapunov exponent λ1 , λ2 . . . λk as a loss function: + k + X + Lflossing = λ2i (4) + i=1 + +and evaluate the gradient obtained from backpropagation through time: + k + ∂Lflossing X ∂λ2i + = (5) + ∂θ i=1 + ∂θ + +This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its +own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov +exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative +reorthonormalization method involving QR decomposition that avoids directly evaluating the ill- +conditioned long-term Jacobian [12, 44]. + + + 3 +First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the +trajectory using the Jacobian Ds = ∂h s+1 + ∂hs . This means to calculate + + Q + e s+1 = Ds Qs (6) +at every time-step. Second, we extract the exponential growth rates using the QR decomposition, + Qe s+1 = Qs+1 Rs+1 , +which decomposes Q e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k + ⊤ +so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal +elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently +often, i.e., once every tONS such that Q + e does not become ill-conditioned. +The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]: + t t + 1 Y 1X + λi = lim log Rsii = lim log Rsii . (7) + t→∞ t t→∞ t + s=1 s=1 +This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal +elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient +of the square of the Lyapunov exponents backward through time in gradient flossing, we used an +analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR +decomposition is given by [45, 46, 47, 48] + Q = Q + Q copyltu(M) R−T , + + (8) + T T +where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying +the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = +Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple +implementation of this algorithm in pseudocode is: +Algorithm 1 Algorithm for gradient flossing of k tangent space directions + initialize h, Q + for e = 1 → E do + for t = 1 → T do + h ← fθ (h, x) + dht + D ← dh t−1 + Q←D·Q + if t ≡ 0 (mod tONS ) then + Q, R ← qr(Q) + γi += log(Rii ) + end if + end for + λi = γi /T + ∂Lflossing + θe+1 ← θe − η ∂θ + end for +For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually +implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example +implementation is available here. Note that this algorithm also works for different recurrent network +architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic +variables of the recurrent network model. For example, in case of a single recurrent network of +N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be +calculated analytically or it can be obtained via automatic differentiation. + +4 Gradient Flossing: Control of Lyapunov Exponents +In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a +target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with +different weight variances. The N units of the recurrent neural network follow the dynamics + hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 . (9) + + + 4 + C + t−1 + ∂hτ ′ +1 10 1 + number of flossed i + Y + Tt (hτ ) = k = 32 + ∂hτ ′ k = 16 + τ ′ =τ + 10 3 k=1 + + + + + 2 + i + i=1 + k + 1 + k + 10 5 + + + 0 20 40 60 80 100 + Epochs + B D + 0.0 0 + 0.5 2 +1(1/ ) + + + + + i(1/ ) + 1.0 4 number of flossed i + k = 32 + 1.5 target 1 6 k = 16 + actual 1 k=1 + 2.0 + 0 10 20 30 40 50 60 70 80 0.0 0.2 0.4 0.6 0.8 1.0 + Epochs i/N +Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation +A) Exploding and vanishing gradients in backpropagation through time arise from amplifica- + Qt−1 ∂h ′ +tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 . + τ′ +B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the +mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent +λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent +weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing +minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla +RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note, +the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32 +with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line +shows median. + + +The initial entries of W are drawn independently from a Gaussian distribution with zero mean and +variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use +the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM +units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input +xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient +flossing. +In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by +gradient flossing to match a desired target value. The networks were initialized with 10 different values +of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly +approached three different target values of the first Lyapunov exponents λtarget 1 = {−1, −0.5, 0} +within less than 100 training epochs with batch size B = 1. We note that gradient flossing with +positive target λtarget + 1 seems not to arrive at a positive Lyapunov exponent λ1 . +Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient- +descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing +(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32 +Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one, +several, or all Lyapunov exponents before or during network training. Gradient flossing also works for +RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck +of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 , +both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov +spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need +to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the +training performance. + + + 5 +5 Gradient Flossing: Condition Number of the Long-Term Jacobian + A B C +condition number 2(Tt( )) + 1034 1026 initial + after flossing 1030 + 1025 theory + 1019 simulations + + + + + 2 (theory) + 2(Tt( )) + 1020 + 1016 1012 k + initial 1010 5 + 107 after flossing 105 10 + theory 15 + simulations 100 + 0 100 200 300 5 10 15 100 1010 1020 1030 + time horizon t tangent space dimension m 2 (numerical) +Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition +number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue) +and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision +floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on +Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent +space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at +initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of +tangent space dimensions available for backpropagation for a given condition number (Grey dotted +line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of +condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors +denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80, +epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR +task in Fig 3D. + +A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient +flossing improves the condition number of the long-term Jacobian which constrains the error signal +propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a +linear map A measures how close the map is to being singular and is given by the ratio of the largest +singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax (A) + min (A) + . According to the + p +rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision +when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of +Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number +numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that +the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We +compare the numerically measured condition number κ2 with an asymptotic approximation of the +condition number based on Lyapunov exponents that are calculated in the forward pass and find a +good match (Fig 2A). +Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that +is temporally evolved by the long-term Jacobian Tt is: + + e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) . + + κ2 (Q (10) + σm (Tt (hτ )) +where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian. +We note that this theoretical estimate of the condition number follows from the asymptotic definition +of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing +reduces the condition number by a factor whose magnitude increases exponentially with time (orange +in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long +time horizon to bridge. We will later confirm this numerically. +Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available +for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however, +the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the +condition number. This indicates how close to singular the gradient signal for a given number of +tangent space dimensions is. Given a fixed acceptable condition number—determined, for example, +by noise level or floating-point precision—we observe that gradient flossing increases the number of +usable tangent space dimensions for backpropagation (Fig 2B). + + + 6 +Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents +can even predict differences in condition number that originate from finite network size N (Fig 2C). +We emphasize that this goes beyond mean-field methods, which become exact only in the large- +network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G). + +6 Initial Gradient Flossing Improves Trainability + A delayed copy task B delayed XOR task + 10 2 10 2 +test loss + + + + + test loss + no gradient flossing no gradient flossing + 10 3 gradient flossing gradient flossing + 10 3 + 10 4 + 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 + Epochs Epochs + C delayed copy task D delayed XOR task + 10 2 10 2 + + 10 3 10 3 +test loss + + + + + test loss + + 10 4 + 10 4 + + 20 40 60 80 10 20 30 40 50 60 70 80 + task difficulty (delay d) task difficulty (delay d) +Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test +error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient +flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed +XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d) +for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed +XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing +for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and +80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ). + +We next present numerical results on two tasks with variable spatial and temporal complexity, +demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We +call gradient flossing before training in the following preflossing. For preflossing, we first initialize the + Pk +network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently +train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used +to probe exploding/vanishing gradients, because we want a task where the structure of long-range +dependencies in the data is transparent and can be varied as desired. +First, we consider the delayed copy task, where a scalar stream of random input numbers x must be +reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial +and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter +vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1. +Our experiments show that gradient flossing can substantially improve the performance of RNNs +on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond +d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that +we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the +number of flossed Lyapunov exponents. +Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input- +output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d +denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient +flossing helps to train networks on a substantially longer delay d. We found similar improvements +through gradient flossing for RNNs initialized with orthogonal weights (see appendix G). + + + 7 +7 Gradient Flossing During Training + A delayed temporal XOR B delayed spatial XOR + 100 flossing during training 100 +test accuracy (%) 90 preflossing 90 + + + + + test accuracy (%) + no flossing + 80 80 + 70 70 + 60 60 + 50 50 + 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000 + Epochs Epochs + C D + 100 100 + 90 90 +test accuracy (%) + + + + + test accuracy (%) + 80 80 flossing during training + preflossing + 70 70 no flossing + 60 60 + 50 50 + 10 20 + 30 40 50 60 70 10 20 30 40 50 60 70 + complexity (delay d) complexity (delay d) +Figure 4: Gradient flossing during training further improves trainability +A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d +with gradient flossing during training (green), preflossing (gradient flossing before training) (orange), +and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations, +individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial +XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy +as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a +function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size +b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before +training and during training for green lines, and only before training for orange lines. Same plotting +conventions as previous figure. Task loss: cross-entropy between y and ŷ. +We next investigate the effects of gradient flossing during the training and find that gradient flossing +during training can further improve trainability. We trained RNNs on two more challenging tasks +with variable temporal complexity and performed gradient flossing either both during and before +training, only before training, or not at all. +Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task +yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs +falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability +can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient +flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed +for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B, +D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a +three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of +d steps. While the solution of the task itself is not difficult and could even be implemented by hand +(see appendix), the task is challenging for backpropagation through time because nonlinear temporal +associations bridging long time horizons have to be formed. Again, we observe that gradient flossing +before training improves the performance compared to baseline, but starts failing for long delays +d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks +(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial +conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to +task performance if it is continued throughout all training epochs (appendix C) +We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual +one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training +improvement. This suggests that taking the temporal correlations between Jacobians Ds into account +is important for improving trainability. + + + 8 +7.1 Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents + +We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in +training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient +flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d. +We found that as the task becomes more difficult, networks where not enough Lyapunov exponents +k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring +final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that +more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more +difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of +both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend +(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to +zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our +tasks, longer delays automatically imply that more values have to be stored as at any moment all the +values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from +tasks where a single variable has to be stored and recalled after a long delay. It would be interesting +to study tasks where the number of delay steps and the number of items in memory can be varied +independently. +Finally, we did the same analysis on networks with only preflossing (gradient flossing before training) +and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov +exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems +to indicate that gradient flossing during training cannot be replaced by just gradient flossing more +Lyapunov exponents before training. + + + + +Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents +A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers +flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number +of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 +for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing +for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75% +percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy +between y and ŷ. +8 Limitations +The mathematical connection between Lyapunov exponents and backpropagation through time +exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be +interesting to extend our analysis to finite-time Lyapunov exponents. +Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians +of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest +product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of +the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside +or towards the attractor. +Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry +of covariant Lyapunov vectors [53], which could be used to improve training performance, speed, +and reliability. Additionally, it is important to investigate how sensitive the method is to the choice +of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54]. + + + 9 +Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size +and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we +suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is +not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also +calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov +spectrum and investigate more efficient gradient flossing schedules. + + +9 Discussion + +We tackle the problem of gradient signal propagation in recurrent neural networks through a dy- +namical systems lens. We introduce a novel method called gradient flossing that addresses the +problem of gradient instability during training. Our approach enhances gradient signal stability both +before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian +well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this, +we combine established dynamical systems methods for calculating Lyapunov exponents with an +analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability +in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons. +Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55]. +Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For +difficult temporal credit assignment problems, gradient flossing throughout training further enhances +signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks +with controllable time-complexity and show that it can be combined with other approaches to tackle +exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal +initialization and specialized single units, such as LSTMs. +Prior research on exploding and vanishing gradients mainly focused on selecting network architectures +that are less prone to exploding/vanishing gradients or finding parameter initializations that provide +well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can +be seen as a complementary approach that can further enhance gradient stability throughout training. +Compared to the work on picking good parameter initializations based on random matrix theory [41] +and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field +theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient +flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory +and mean-field heuristics are usually confined to the limit of large networks [52], while gradient +flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients +of backpropagation through time has been described previously [9, 12] and has been spelled out +analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents +here not only as a diagnostic tool for gradient stability but also to show that they can directly be part +of the cure for exploding and vanishing gradients. +Future investigations could delve further into the roles of the second to N th Lyapunov exponents +in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen- +sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results +suggest a trade-off between trainability across long time horizons and the nonlinear task demands +that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent +learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to +feedforward networks, state-space models and transformers is a promising avenue for future research +(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such +as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks +could be treated similarly. This could also provide a link between the stability of the network against +adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that +time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they +may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture +search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other +model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single- +unit parameters could optimize the activation function and single-neuron biophysics in biologically +plausible neuron models. + + + + 10 +Acknowledgments and Disclosure of Funding +I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G. +Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S. +Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript. +I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex +Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for +the Global Brain (542939SPI), and the Swartz Foundation (2021-6). + +References + [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974. + [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management + Science. MIT-Press, Cambridge, Mass, 8, 1985. + [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85, + pages 599–604, 1985. + [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back- + propagating errors. Nature, 323(6088):533, October 1986. + [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991. + [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735– + 1780, 1997. + [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. + IEEE Transactions on Neural Networks, 5(2):157–166, March 1994. + [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural + Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063. + [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks. + arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427. +[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics + with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022. +[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working + memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023. +[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks. + Physical Review Research, 5(4):043044, October 2023. +[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, + 57(3):617–656, July 1985. +[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics. + Cambridge University Press, Cambridge, February 2016. +[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of + Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS + Bibcode: 2014arXiv1409.1259C Type: article. +[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural + networks. In Proceedings of the 30th International Conference on International Conference on Machine + Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org. +[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of + Technology, Faculty of Information Technology, Brno, CZ, 2012. +[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016. + Google-Books-ID: omivDQAAQBAJ. +[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by + Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine + Learning, pages 448–456. PMLR, June 2015. + + + 11 +[20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View + on Recurrent Neural Networks. International Conference on Learning Representations, December 2018. + +[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of + learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv: + 1312.6120. + +[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In + Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June + 2016. + +[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley + Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978. + PMLR, July 2018. + +[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN): + An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page + arXiv:2010.00951, October 2020. + +[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney. + Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January + 2021. + +[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and + Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023. + ADS Bibcode: 2023arXiv230306349O Type: article. + +[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an + erratum note. Bonn, Germany: German National Research Center for Information Technology GMD + Technical Report, 148:34, 2001. + +[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks. + Neural Computation, 19(1):111–138, January 2007. + +[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances + in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. + +[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway + Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198. + PMLR, July 2017. + +[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle + Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887, + November 2019. ISSN: 2576-2303. + +[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook + University, May 2023. + +[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15, + pages 315–323, April 2011. + +[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential + expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June + 2016. arXiv: 1606.05340. + +[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field + Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394 + [cs, stat], August 2018. arXiv: 1806.05394. + +[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural + Networks. December 2018. + +[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information + Propagation. November 2016. + +[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In + Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. + + + 12 +[39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training. + Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article. +[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey + Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019. +[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls + phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv: + 2002.00025. +[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of + dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968. +[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing + Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990. +[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic + Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of + them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980. +[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks. + Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph]. +[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application + in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode: + 2010arXiv1001.1654W Type: article. +[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of + Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January + 1989. +[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating + Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article. +[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning + through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems, + volume 30. Curran Associates, Inc., 2017. +[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in + Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979. +[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007. + Google-Books-ID: ZUfVZELlrMEC. +[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and + Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics, + 11(1):501–528, 2020. +[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant + Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007. +[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D: + Nonlinear Phenomena, 118(3):167–198, July 1998. +[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential + Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., + 2018. +[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents. + Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021. +[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents + for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied + Mathematics and Statistics, 8, 2022. +[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks. + Physical Review X, 12(1):011011, January 2022. +[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal + Processes. Routledge, New York, 1st edition edition, March 1987. + + + 13 +[60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus + of Variations and Optimal Control Theory. Princeton University Press, December 2011. + +[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms + for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320– + 135:5353, January 2020. + +[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical + report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article. +[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without + Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article. + +[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical + Review Letters, 69(26):3717–3719, December 1992. + +[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent + neural networks. Physical Review E, 82(1):011903, July 2010. + +[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered + neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758. + +[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations + impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology, + 18(12):e1010590, December 2022. + +[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID: + PfXoAwAAQBAJ. +[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems. + World Scientific Pub Co Inc, Hackensack, NJ, September 2009. + + + + + 14 +A Backpropagation Through QR Decomposition +The backward pass of the QR decomposition is given by [45, 46, 47, 48] + Q = Q + Q copyltu(M) R−T + + (11) + T T +where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower +triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j) +[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T . +Using an analytical pullback is more memory-efficient and less computationally costly than directly doing +automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR +decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable +programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient +flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here . + +B Further Details and Analysis of Gradient Flossing +An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here. +We are actively developing implementations for other widely used differentiable programming frameworks. + +B.1 Gradient Flossing for recurrent LSTM and ReLU networks + + A LSTM C ReLU + 0.2 target 1 6 target 1 + actual 1 actual 1 + 4 + 0.0 + 2 + 1 + + + + + 1 + + + + + 0.2 + 0 + 0.4 + 2 + 0 20 40 60 80 100 0 20 40 60 80 100 + Epochs Epochs + B D 101 + 10 1 + 10 2 10 1 + 10 3 + 2 + + + + + 2 + 1 + + + + + 1 + + + + + 10 4 10 3 + 10 5 + 10 6 10 5 + 0 20 40 60 80 100 0 20 40 60 80 100 + Epochs Epochs + + +Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First +Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean +squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0 +by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target +value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with +Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1). +B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent +LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were +initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn +g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10 +network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median. + +We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To +this end, we generated random LSTM networks where the weights of all the different gates and biases were + + + 15 +independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance. +Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The +dynamics of each of the N LSTM units follows the map [6]: + ft = σg (Uf ht−1 + Wf xt + bf ) (12) + ot = σg (Uo ht−1 + Wo xt + bo ) (13) + it = σg (Ui ht−1 + Wi xt + bi ) (14) + c̃t = σh (Uc ht−1 + Wc xt + bc ) (15) + ct = ft ⊙ ct−1 + it ⊙ c̃t (16) + ht = ot ⊙ ϕ(ct ) (17) + 1 +where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x) is the sigmoid function, σh (x) = tanh(x) and + 2 +entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars. +Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state. +In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow +if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly +distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero. +During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close +to the target Lyapunov exponent λtarget + 1 = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that +the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov +exponent of λtarget + 1 = −1 is achieved after 100 gradient flossing steps only for a subset of random network +realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units, +which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also +[9, 12, 41, 58]). +For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in +Eq 9 + hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 , +The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1 +and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer +function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. +Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient +flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents +Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that +the squared Lyapunov exponents of ReLU networks converge towards zero. + +B.2 Additional Results for Different Numbers of Flossed Lyapunov Exponents +Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing +before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the +tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus +k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during +training cannot be replaced by just gradient flossing more Lyapunov exponents before training. + + +C Additional Results on Gradient Flossing Throughout Training +We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient +flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time +steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also +boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally, +we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally, +we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR +task have a simple relationship to the delay d. + + +D Gradient Flossing boosts the Gradient Norm for Long Time Horizons +In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is +important to note that the complete error gradient of backpropagation through time is composed of a summation +of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent +dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are +smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian. + + + 16 + A B + 100 100 + k + test accuracy (%) + + + + + test accuracy (%) + 80 1 80 delay + 20 30 + 40 50 + 60 70 + 60 80 60 + + 20 40 60 0 20 40 60 80 + complexity (delay d) number of flossed i k + C 100 D 100 + 70 70 + complexity (delay d) + + + + + complexity (delay d) + test accuracy (%) + + + + + test accuracy (%) + 60 60 + 50 50 + 40 40 + 30 30 + 20 20 + 10 10 + 1 10 20 30 40 50 60 70 80 50 1 10 20 30 40 50 60 70 80 50 + number of flossed i k number of flossed i k + + +Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents +A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers +flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of +flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function +of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at +epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch +size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial +XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B, +C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means, +transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ. + + +In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop +length of backpropagation through time is necessary for finding the correct association. We were moreover +careful in our task design not to have any additional signals in our task that might help to bridge the long time +scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops, +even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate +the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time +horizons, we focus on the gradient with respect to the initial conditions h0 . + τ =t−1 t−1 + ! τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X Y ∂hτ ′ +1 ∂Lt + = = δτ 0 = Tt (h0 ) (18) + ∂h0 ∂ht ′ + ∂h τ ′ ∂h 0 ∂h t ′ + ∂h τ ′ ∂ht + τ =t−l τ =τ τ =t−l τ =τ + +We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that +contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure +that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we +use the binary cross entropy loss which makes the derivative ∂L + ∂ht + t + trivial. +In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we +compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70. +One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained +without gradient flossing. + + + 17 + dL +For the network without gradient flossing, the gradient norm of | dh 0 + | diminishes to extremely small values + −6 +(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each +episode of gradient flossing causes the norm | dh dL + 0 + | to spike, surpassing values larger than 10−2 . These findings +are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in +challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the + dL +gradient | dh 0 + | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent +in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between + dL +gradient norm | dh 0 + | and training success consistently across numerous network realizations (Fig 8C and D). + dL +These findings suggest that the gradient norm | dh 0 + | can be a good predictor of learning success, sometimes +hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient +norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual +growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when + dL +plotting the accuracy as a function of epoch aligned with the last epoch with | dh 0 + | < 0.001, we observe for this + dL +task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the +chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector + dL + dh0 + with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap +around the training epoch where the accuracy surpasses the chance level both in networks with and without +gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable +(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over +many epochs align + +D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient +To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient + dL +and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW +into in weighted sum of outer products using singular value decomposition (Fig 10). +As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term +Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during +gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked: + τ =t−1 t−1 + ! + ∂Lt ∂Lt X Y ∂hτ ′ +1 ∂hτ ∂Lt X ∂hτ + = = Tt (hτ ) (19) + ∂W ∂ht ′ + ∂h τ ′ ∂W ∂h t + τ + ∂W + τ =t−l τ =τ + +We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only +characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the +summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important +signals about the task to be contaminated by summands of both shorter and longer chains, which contribute +irrelevant fluctuations. + dL + +The singular values of the recurrent weight gradient σi dW as a function of training epoch reveal that the +subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing, +while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing + dL +increases the effective rank of the recurrent weight gradient dW . In other words, gradient flossing facilitates +high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents +to zero, the different summands in the total gradient contribute more equitable as long loops have neither a +dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would +happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank. +In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish +to extremely small values over training epochs and remain very small throughout training. Note however that the +leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not. + dL +We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0 + |, the subordinate +singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance +level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the +accuracy grows beyond chance only later during training (Fig 11). + +D.2 Gradient Flossing Throughout Training Can Be Detrimental +We find that gradient flossing continued throughout all training epochs can be detrimental for performance +(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different +conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or +we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at +all (blue). + + + 18 + A C + 10 2 0.05 + 10 4 0.04 + |dhd 0 | 10 6 0.03 + + + + + |dhd 0 | + 10 8 0.02 + 10 10 0.01 + with flossing during training + 10 12 without flossing + 0.00 + 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 + Epochs Epochs + B D + 100 100 + accuracy (%) + + + + + accuracy (%) + with flossing during training + without flossing + + + 50 50 + 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 + Epochs Epochs + dL +Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh 0 + | as +a function of training epochs for networks without flossing (blue) and networks with flossing +during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈ + dL +{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh 0 + | are +much smaller overall. One out of ten random network realizations with solid line, the other 9 with +transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in +A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately + dL +when the gradient norm | dh 0 + | becomes macroscopically large. C) Same as A in linear scale. Mean +final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are +different network realizations with gradient flossing during training. Black lines are without any +gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network + dL +realizations the moments where gradient norm | dh 0 + | becomes macroscopically large coincides with +the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80, +epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before +training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ). + + +We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50% +(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously, + dL +when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh 0 + | is + dL +initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes +macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across +long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful +to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered +here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov +exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a +fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an +interesting future research avenue to further investigate this potential trade-off also in other network architectures. + +D.3 Lyapunov Exponents after Training With and Without Gradient Flossing +In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on +the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with +gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted +by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level +have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent. +In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an + + + 19 + A B 80 + + 75 + 10 2 + 70 + 10 3 + + + + + accuracy (%) + 65 + |dhd 0 | + + + + 10 4 60 + 55 + 10 5 + 50 + 45 + 3000 2000 1000 0 1000 0 100 200 300 400 + Epochs Epochs + + +Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level + dL +A) Gradient norm of | dh 0 + | as a function of training epochs for 20 network realizations with flossing +during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and +simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with + dL +flossing during training. Epochs are aligned to the last epoch with | dh 0 + | < 0.001. Different colors +are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , +T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at +training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross +entropy(y, ŷ). + + +analytically tractable temporal correlation structure, which might partially explain the less conclusive results. +It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in +[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a +2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is +required [9, 12, 58]. + + +E Gradient Flossing for Linear Network +We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train +linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task, +but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full +analytical description of gradient flossing for linear networks would be a promising avenue for future research as +networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope +of the presented work. + + +F Computational Complexity of Gradient Flossing +We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three +main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity +of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla + +networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the +Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents. +Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents + +considered. +Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of + + +training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is + +the number of flossing time steps. + + + 20 + A + 10 2 + i dW ) + d + ( + + + + 10 4 + + + 0 500 1000 1500 2000 2500 3000 + B 100 + early training flossing + without flossing + accuracy (%) + + + + + 50 + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 10: Gradient flossing decreases condition number of recurrent weight error gradient + dL + +A) Singular values of recurrent weight gradient σi dW as a function of training epochs for singular +values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing +(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while +the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the + dL +effective rank of the recurrent weight gradient dW . B) Accuracy as a function of training epochs. +Note that accuracy of networks with gradient flossing grows beyond chance level approximately +when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables +high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 , +T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training. +Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ). + + + + +In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] , + +where E is the number of training epochs and Ep is the number of preflossing epochs. +For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with +O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during +training. +Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for +training success is much smaller than the total number of training epochs E. For example, the preflossing for +500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training +epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time +was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the +temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be +the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks +with unknown temporal correlation structure. +It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent +on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would +suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for +a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU +hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond +our current computational budget. We will start such experiments an might be able to provide results during the +reviewer period. + + + 21 + A + 10 2 + i dW ) + d + ( + + + + 10 4 + + + 0 500 1000 1500 2000 2500 3000 + B 100 + early training flossing + without flossing + accuracy (%) + + + + + 50 + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient +Same as Fig 10 for different network realization. + + + +G Additional controls + +We also investigate the effects of gradient flossing during the training with orthogonal weight initializations +and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to +bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated +the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal +complexity and performed gradient flossing either both during and before training, only before training, or not at +all. +Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal +XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla +RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can +be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with +gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for +500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks +have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input +signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text. +Again, we observe similar to networks with Gaussian initialization that flossing before training improves the +performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks +that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and +D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian +initialization because for 10000 epochs, both networks with gradient flossing only before training and with +gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal +initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient +flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go +well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient +flossing steps necessary to improve performance. + + + +H Additional Details on Training Tasks + +In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3: + + + 22 + A 100 + + + accuracy (%) + with flossing throughout training + early training flossing + without flossing + 50 + 0 500 1000 1500 2000 2500 3000 + B + 100 + test error + + + + + 10 1 + 0 500 1000 1500 2000 2500 3000 + C + 10 3 with flossing throughout training + early training flossing + without flossing + |dhd 0 | + + + + + 10 7 + + 0 500 1000 1500 2000 2500 3000 + Epochs +Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a +function of training epochs for binary temporal delayed XOR task for gradient flossing throughout +training every 100 training epochs (red). Accuracy drops down close to chance level every time after +gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at +epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function + dL +of training epochs. C) Gradient norm of | dh 0 + | as a function of training epochs for networks without +gradient flossing (blue) and networks with flossing throughout training (red) and early training +gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks + dL +without gradient flossing, the gradient norms | dh 0 + | are much smaller overall. Parameters: g = 1, +batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on +k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross +entropy(y, ŷ). + + +H.1 Copy task +For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the +input to be sampled i.i.d. from a uniform distribution between 0 and 1. + +H.2 Temporal XOR task +The temporal XOR task requires the target network readout yt at time t to be computed as follows: + yt = |xt−d/2 − xt−d | (20) + 2 +where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt +follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be +seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0, +and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example, +using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by + f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ). (21) +Together with a delay line to transmit the signal xt−d over time, this can solve the task. + + + 23 + A no gradient flossing B gradient flossing during training + 0.1 + 1 + 0.1 + + + + + 1 + 0.2 + 0.2 + 20 40 60 20 40 60 + complexity (delay d) complexity (delay d) + C D + 0.1 + 0.2 0.1 + 10 + + + + + 10 + 0.3 0.2 + 20 40 60 20 40 60 + complexity (delay d) complexity (delay d) +Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First +Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of +the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red +corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with +gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully +trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto- +col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov +exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes +λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for +Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR, +loss: cross entropy(y, ŷ). + + +I Additional Background on Lyapunov Exponents of RNNs + +An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt = +F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time +dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation, +but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the +map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal +perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all +infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |. +Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov +exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions: + + + 1 ||ϵut || + λ1 (h0 ) = lim lim log (22) + t→∞ t ϵ→0 ||ϵu0 || +In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial +conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial +to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the + ||ϵut || +limits are exchanged, as limt→∞ log ||ϵu 0 || + is bounded for finite perturbations even if the system is chaotic. To +measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us +spanning the tangent space: + + + us+1 = Ds us (23) + + + 24 + forward pass backward pass + O N2 b + + RNN dynamics " + O N2 k + + Jacobian step " + O N k2 + + QR step " + O N2 b T + + total amortized costs " + per training epoch + O N 2 Tf (1 + k/tONS + k) + + total amortized costs " + per gradient flossing epoch + O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] + + total amortized costs " + of preflossing + O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] + + total amortized costs " + flossing during training + +Table 1: Computational cost for gradient flossing and training of RNNs +N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of +training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization +interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is +the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically, +we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller +than both the total number of training epochs E. Moreover, Tp can be smaller than T . + + Table 2: XOR + input xt−d input xt−2d target output yt + 0 0 0 + 0 1 1 + 1 0 1 + 1 1 0 + + +where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba- +tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a +fixed point, i.e., when hs+1 = f (hs ) = hs . +We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian + + Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ). (24) +Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN +are defined as the logarithms of the eigenvalues of the Oseledets matrix + 1 + Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t , (25) + t→∞ + +where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term +Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope +spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the +logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence +of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov +exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov +spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned, +i.e., the ratio between its largest and smallest singular value diverges exponentially with time. + + +J Algorithm for Calculating Lyapunov Spectrum of Rate Networks +For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional +infinitesimal volume element is given by λ(m) = m (1) + , λ2 = λ(2) − λ1 , λ3 = + P + i=1 λi . Therefore, λ1 = λ + (3) +λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition. + + + 25 + orthogonal weight initialization + A delayed temporal XOR B delayed spatial XOR + 100 100 +test accuracy (%) flossing during training + + + + + test accuracy (%) + preflossing + 80 no flossing 80 + + 60 60 + + 0 2000 4000 6000 8000 10000 0 1000 2000 3000 4000 5000 + Epochs Epochs + C D + 100 100 +test accuracy (%) + + + + + test accuracy (%) + 80 80 flossing during training + preflossing + no flossing + 60 60 + + 10 20 30 40 50 60 70 10 20 30 40 50 60 70 + complexity (delay d) complexity (delay d) + +Figure 14: Gradient flossing before and during training improves trainability for orthogonal +nets +A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR +task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and +with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual +network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . +C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test +accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, +batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed +spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training +and during training for green lines, and only before training for orange lines. Shaded areas are 25% +and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is +cross-entropy between y, ŷ. + + +First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm + s ] in the tangent space along the trajectory +using the Jacobian Ds : + Qe s+1 = Ds Qs (26) +A continuous system can be transformed into a discrete system by considering a stroboscopic representation, +where the trajectory is only considered at certain discrete time points. We use here the notation of discrete +dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q + e s+1 = +Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time +∆t = 1. The notation can be readily extended to continuous systems [43]. +Second, we extract the exponential growth rates using the QR-decomposition, + e s+1 = Qs+1 Rs+1 , + Q + +which uniquely decomposes Q e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤ s+1 Qs+1 = 1m×m +and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1 +describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking +of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the +QR-decomposition for k = 2. +The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs : + t t + 1 Y 1X + λi = lim log Rsii = lim log Rsii . (27) + t→∞ t t→∞ t + s=1 s=1 + + + + 26 + Ds + QR + + s+1 + R22 +qs2 ~2 s+1 + 2 + qs+1 + qs+1 R11 ~1 + qs+1 + s+1 + qs1 R11 1 + qs+1 + s+1 + R22 + +Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix +Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted +by the Jacobian Ds into an k-dimensional parallelotope Q e s+1 = Ds Qs embedded in RN . The +figure illustrates this for k = 2, in which case the columns of Q e s+1 span a parallelogram, which +can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the +area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition +reorthonormalizes Q e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 = +[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R + s+1 + . Qs+1 describes the rotation of Qs + s+1 +caused by Ds . The diagonal entries of R gives the stretching/shrinking along the columns of +Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q e s+1 is given by + m +Vm = i=1 Rs+1 ii . The time-averaged logarithms of the diagonal elements of R s + give the Lyapunov + 1 + Qt s 1 + Pt s +spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii . + + +Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often, +i.e., once every sONS steps such that Q ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned + e s+s +[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the +ratio of the smallest and largest singular value: + s+s + e s+s ) = κ2 (Rs+sONS ) = σ1 (Rs+sONS ) R ONS + κ2 ( Q = 11 . (28) + Rs+s + ONS s+s + σm (R ONS ) mm + ONS + +An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to +converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25) +[54]. A simple example of this algorithm in pseudocode is: + +Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum + initialize h, Q + evolve h until it is on attractor (avoid initial transient) + evolve Q until it converges to the eigenvectors of the backward Oseledets matrix + set γi = 0 + for t = 1 → T do + h ← f (h) + df + D ← dh + Q←D·Q + if s ≡ 0 (mod sONS ) then + Q, R ← qr(Q) + γi += log(Rii ) + end if + end for + λi = γi /T + + +It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge +towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example +of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization +time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and +a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough +estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation +run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to +estimate the Lyapunov spectrum [14, 43, 68, 69]. + + + 27 +K Convergence of Lyapunov Exponents of RNNs +In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov +exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization. + + 0 + 10 +i(1/steps) + + + + + 20 + 30 + 40 + 100 101 102 + steps + + +Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents +λi for ten identical network realizations with different initial conditions with simulation time (i = +1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1). + + + + + 28 +
\ No newline at end of file |
