papers/txt/engelken2023_gradient_flossing.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879

                                              Gradient Flossing: Improving Gradient Descent
                                                 through Dynamic Control of Jacobians


                                                                                Rainer Engelken
                                                                      Zuckerman Mind Brain Behavior Institute
                                                                               Columbia University
arXiv:2312.17306v1 [cs.LG] 28 Dec 2023


                                                                                New York, USA
                                                                             re2365@columbia.edu


                                                                                      Abstract
                                                  Training recurrent neural networks (RNNs) remains a challenge due to the insta-
                                                  bility of gradients across long time horizons, which can lead to exploding and
                                                  vanishing gradients. Recent research has linked these problems to the values
                                                  of Lyapunov exponents for the forward-dynamics, which describe the growth or
                                                  shrinkage of infinitesimal perturbations. Here, we propose gradient flossing, a
                                                  novel approach to tackling gradient instability by pushing Lyapunov exponents
                                                  of the forward dynamics toward zero during learning. We achieve this by regu-
                                                  larizing Lyapunov exponents through backpropagation using differentiable linear
                                                  algebra. This enables us to "floss" the gradients, stabilizing them and thus improv-
                                                  ing network training. We demonstrate that gradient flossing controls not only the
                                                  gradient norm but also the condition number of the long-term Jacobian, facilitating
                                                  multidimensional error feedback propagation. We find that applying gradient
                                                  flossing prior to training enhances both the success rate and convergence speed for
                                                  tasks involving long time horizons. For challenging tasks, we show that gradient
                                                  flossing during training can further increase the time horizon that can be bridged
                                                  by backpropagation through time. Moreover, we demonstrate the effectiveness of
                                                  our approach on various RNN architectures and tasks of variable temporal com-
                                                  plexity. Additionally, we provide a simple implementation of our gradient flossing
                                                  algorithm that can be used in practice. Our results indicate that gradient flossing
                                                  via regularizing Lyapunov exponents can significantly enhance the effectiveness of
                                                  RNN training and mitigate the exploding and vanishing gradients problem.


                                         1   Introduction
                                         Recurrent neural networks are commonly used both in machine learning and computational neu-
                                         roscience for tasks that involve input-to-output mappings over sequences and dynamic trajectories.
                                         Training is often achieved through gradient descent by the backpropagation of error information
                                         across time steps [1, 2, 3, 4]. This amounts to unrolling the network dynamics in time and recursively
                                         applying the chain rule to calculate the gradient of the loss with respect to the network parameters.
                                         Mathematically, evaluating the product of Jacobians of the recurrent state update describes how error
                                         signals travel across time steps. When trained on tasks that have long-range temporal dependencies,
                                         recurrent neural networks are prone to exploding and vanishing gradients [5, 6, 7, 8]. These arise from
                                         the exponential amplification or attenuation of recursive derivatives of recurrent network states over
                                         many time steps. Intuitively, to evaluate how an output error depends on a small parameter change at
                                         a much earlier point in time, the error information has to be propagated through the recurrent network
                                         states iteratively. Mathematically, this corresponds to a product of Jacobians that describe how
                                         changes in one recurrent network state depend on changes in the previous network state. Together,
                                         this product forms the long-term Jacobian. The singular value spectrum of the long-term Jacobian reg-

                                         37th Conference on Neural Information Processing Systems (NeurIPS 2023).
ulates how well error signals can propagate backwards along multiple time steps, allowing temporal
credit assignment. A close mathematical correspondence of these singular values and the Lyapunov
exponents of the forward dynamics was established recently [9, 10, 11, 12]. Lyapunov exponents
characterize the asymptotic average rate of exponential divergence or convergence of nearby initial
conditions and are a cornerstone of dynamical systems theory [13, 14]. We will use this link to
improve the trainability of RNNs.
Previous approaches that tackled the problem of exploding or vanishing gradients have suggested
solutions at different levels. First, specialized units such as LSTM and GRU were introduced,
which have additional latent variables that can be decoupled from the recurrent network states via
multiplicative (gating) interactions. The gating interactions shield the latent memory state, which can
therefore transport information across multiple time steps [5, 6, 15]. Second, exploding gradients can
be avoided by gradient clipping, which re-scales the gradient norm [16] or their individual elements
[17] if they become too large [18]. Third, normalization schemes like batch normalization prevent
saturated nonlinearities that contribute to vanishing gradients [19]. Third, it was suggested that the
problem of exploding/vanishing gradients can be ameliorated by specialized network architectures,
for example, antisymmetric networks [20], orthogonal/unitary initializations [21, 22, 23], coupled
oscillatory RNNs [24], Lipschitz RNNs [25], linear recurrent units [26], echo state networks [27, 28],
(recurrent) highway networks [29, 30], and stable limit cycle neural networks [11, 31, 32]. Fourth, for
large networks, a suitable choice of weights can guarantee a well-conditioned Jacobian at initialization
[21, 33, 34, 35, 36, 37, 38, 39, 40, 41]. These initializations are based on mean-field methods, which
become exact only in the large-network limit. Such initialization schemes have also been suggested
for gated networks [40]. However, even when initializing the network with well-behaved gradients,
gradients will typically not retain their stability during training once the network parameters have
changed.
Here, we propose a novel approach to tackling this challenge by introducing gradient flossing, a
technique that keeps gradients well-behaved throughout training. Gradient flossing is based on
a recently described link between the gradients of backpropagation through time and Lyapunov
exponents, which are the time-averaged logarithms of the singular values of the long-term Jacobian
[9, 11, 12, 32]. Gradient flossing regularizes one or several Lyapunov exponents to keep them close
to zero during training. This improves not only the error gradient norm but also the condition number
of the long-term Jacobian. As a result, error signals can be propagated back over longer time horizons.
We first demonstrate that the Lyapunov exponents can be controlled during training by including an
additional loss term. We then demonstrate that gradient flossing improves the gradient norm and
effective dimension of the gradient signal. We find empirically that gradient flossing improves test
accuracy and convergence speed on synthetic tasks over a range of temporal complexities. Finally,
we find that gradient flossing during training further helps to bridge long-time horizons and show
that it combines well with other approaches to ameliorate exploding and vanishing gradients, such as
dynamic mean-field theory for initialization, orthogonal initialization and gated units.
Our contributions include:

       • Gradient flossing, a novel approach to the problem of exploding and vanishing gradients in
         recurrent neural networks based on regularization of Lyapunov exponents.

       • Analytical estimates of the condition number of the long-term Jacobian based on Lyapunov
         exponents.

       • Empirical evidence that gradient flossing improves training on tasks that involve bridging
         long time horizons.


2   RNN Gradients and Lyapunov Exponents

We begin by revisiting the established mathematical relationship between the gradients of the loss
function, computed via backpropagation through time, and Lyapunov exponents [9, 12], and how
it relates to the problem of vanishing and exploding gradients. In backpropagation through time,
network parameters θ are iteratively updated by stochastic gradient descent such that a loss Lt is
locally reduced [1, 2, 3, 4]. For RNN dynamics hs+1 = fθ (hs , xs+1 ), with recurrent network state
h, external input x, and parameters θ, the gradient of the loss Lt with respect to θ is evaluated by


                                                   2
unrolling the network dynamics in time. The resulting expression for the gradient is given by:
                            τ =t−1   t−1
                                                  !
                 ∂Lt    ∂Lt X        Y ∂hτ ′ +1 ∂hτ         ∂Lt X              ∂hτ
                     =                                   =            Tt (hτ )                        (1)
                 ∂θ     ∂ht           ′
                                          ∂hτ  ′     ∂θ      ∂ht τ              ∂θ
                               τ =t−l   τ =τ

where Tt (hτ ) is composed of a product of one-step Jacobians Ds = ∂hs+1
                                                                    ∂hs :
                                               t−1                       t−1
                                               Y        ∂hτ ′ +1   Y
                                 Tt (hτ ) =                      =   Dτ ′                             (2)
                                                         ∂hτ ′
                                               τ ′ =τ              ′     τ =τ

Due to the chain of matrix multiplications in Tt , the gradients tend to vanish or explode exponentially
with time. This complicates training particularly when the task loss at time t dependents on inputs x
or states h from many time steps prior which creates long temporal dependencies [5, 6, 7, 8]. How
well error signals can propagate back in time is constrained by the tangent space dynamics along
trajectory ht , which dictate how local perturbations around each point on the trajectory stretch, rotate,
shear, or compress as the system evolves.
The singular values of the Jacobian’s product Tt , which determine how quickly gradients vanish or
explode during backpropagation through time, are directly related to the Lyapunov exponents of the
forward dynamics [9, 12]: Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN are defined as the asymptotic
time-averaged logarithms of the singular values of the long-term Jacobian [13, 42, 43]
                                                         1
                                        λi = lim              log(σi,t )                              (3)
                                               t→∞ t − τ

where σi,t denotes the ith singular value of Tt (hτ ) with σ1,t ≥ σ2,t . . . σN,t (See Appendix I
for details). This means that positive Lyapunov exponents in the forward dynamics correspond to
exponentially exploding gradient modes, while negative Lyapunov exponents in the forward dynamics
correspond to exponentially vanishing gradient modes.
In summary, the Lyapunov exponents give the average asymptotic exponential growth rates of
infinitesimal perturbations in the tangent space of the forward dynamics, which also constrain the
signal propagation in backpropagation for long time horizons. Lyapunov exponents close to zero in
the forward dynamics correspond to tangent space directions along which error signals are neither
drastically attenuated nor amplified in backpropagation through time. Such close-to-neutral modes in
the tangent dynamics can propagate information reliably across many time steps.

3   Gradient Flossing: Idea and Algorithm
We now leverage the mathematical connection established between Lyapunov exponents and the
prevalent issue of exploding and vanishing gradients for regularizing the singular values of the
long-term Jacobian. We term this procedure gradient flossing. To prevent exploding and vanishing
gradients, we constrain Lyapunov exponents to be close to zero. This ensures that the corresponding
directions in tangent space grow and shrink on average only slowly. This leads to a better-conditioned
long-term Jacobian Tt (hτ ). We achieve this by using the sum of the squares of the first k largest
Lyapunov exponent λ1 , λ2 . . . λk as a loss function:
                                                             k
                                                             X
                                            Lflossing =            λ2i                                (4)
                                                             i=1

and evaluate the gradient obtained from backpropagation through time:
                                                      k
                                        ∂Lflossing   X   ∂λ2i
                                                   =                                                  (5)
                                          ∂θ         i=1
                                                         ∂θ

This might seem like an ill-fated enterprise, as the gradient expression in Eq 5 suffers from its
own problem of exploding and vanishing gradients. However, instead of calculating the Lyapunov
exponents by directly evaluating the long-term Jacobian Tt (Eq 2), we use an established iterative
reorthonormalization method involving QR decomposition that avoids directly evaluating the ill-
conditioned long-term Jacobian [12, 44].


                                                         3
First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qks ] in the tangent space along the
trajectory using the Jacobian Ds = ∂h   s+1
                                       ∂hs . This means to calculate

                                           Q
                                           e s+1 = Ds Qs                                       (6)
at every time-step. Second, we extract the exponential growth rates using the QR decomposition,
                                       Qe s+1 = Qs+1 Rs+1 ,
which decomposes Q    e s+1 uniquely into the product of an orthonormal matrix Qs+1 of size N × k
     ⊤
so Qs+1 Qs+1 = 1k×k and an upper triangular matrix Rs+1 of size k × k with positive diagonal
elements. Note that the QR decomposition does not have to be applied at every step, just sufficiently
often, i.e., once every tONS such that Q
                                       e does not become ill-conditioned.
The Lyapunov exponents are given by time-averaged logarithms of the diagonal entries of Rs [43, 44]:
                                              t                t
                                       1     Y              1X
                               λi = lim  log     Rsii = lim       log Rsii .                             (7)
                                   t→∞ t                t→∞ t
                                             s=1              s=1
This way, the Lyapunov exponent can be expressed in terms of a temporal average over the diagonal
elements of the Rs -matrix of a QR decomposition of the iterated Jacobian. To propagate the gradient
of the square of the Lyapunov exponents backward through time in gradient flossing, we used an
analytical expression for the pullback of the QR decomposition [45]: The backward pass of the QR
decomposition is given by [45, 46, 47, 48]
                                 Q = Q + Q copyltu(M) R−T ,
                                                           
                                                                                                 (8)
                   T       T
where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying
the lower triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij =
Mmax(i,j),min(i,j) [45, 46, 47, 48]. We denote here adjoint variable as T = ∂L/∂T . A simple
implementation of this algorithm in pseudocode is:
Algorithm 1 Algorithm for gradient flossing of k tangent space directions
  initialize h, Q
  for e = 1 → E do
       for t = 1 → T do
           h ← fθ (h, x)
                  dht
           D ← dh  t−1
           Q←D·Q
           if t ≡ 0 (mod tONS ) then
               Q, R ← qr(Q)
               γi += log(Rii )
           end if
       end for
       λi = γi /T
                       ∂Lflossing
       θe+1 ← θe − η ∂θ
  end for
For clarity, we described gradient flossing in terms of stochastic gradient descent, but we actually
implemented it with the ADAM optimizer using standard hyperparameters η, β1 and β2 . An example
implementation is available here. Note that this algorithm also works for different recurrent network
architectures. In this case, the Jacobians D has size n × n, where n is the number of dynamic
variables of the recurrent network model. For example, in case of a single recurrent network of
N LSTM units, the Jacobian has size 2N × 2N [9, 12, 41]. The Jacobian matrix D can either be
calculated analytically or it can be obtained via automatic differentiation.

4   Gradient Flossing: Control of Lyapunov Exponents
In Fig 1, we demonstrate that gradient flossing can set one or several Lyapunov exponents to a
target value via gradient descent with the ADAM optimizer in random Vanilla RNNs initialized with
different weight variances. The N units of the recurrent neural network follow the dynamics
                             hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 .                        (9)


                                                     4
                                                                       C
                                   t−1
                                            ∂hτ ′ +1                        10 1
                                                                                                             number of flossed i
                                   Y
                   Tt (hτ ) =                                                                                        k = 32
                                             ∂hτ ′                                                                   k = 16
                                   τ ′ =τ
                                                                            10 3                                     k=1


                                                                 2
                                                                 i
                                                                      i=1
                                                                 k
                                                                 1
                                                                 k
                                                                            10 5


                                                                                         0     20     40    60             80         100
                                                                                                       Epochs
         B                                                               D
             0.0                                                                   0
             0.5                                                                   2
1(1/ )


                                                                      i(1/ )
             1.0                                                                   4     number of flossed i
                                                                                                 k = 32
             1.5            target 1                                               6             k = 16
                            actual 1                                                             k=1
             2.0
                   0   10     20       30     40 50    60   70   80                    0.0     0.2     0.4           0.6        0.8         1.0
                                            Epochs                                                             i/N
Figure 1: Gradient flossing controls Lyapunov exponents and gradient signal propagation
A) Exploding and vanishing gradients in backpropagation through time arise from amplifica-
                                                                                    Qt−1 ∂h ′
tion/attenuation of product of Jacobians that form the long-term Jacobian Tt (hτ ) = τ ′ =τ ∂hτ +1 .
                                                                                               τ′
B) First Lyapunov exponent of Vanilla RNN as a function of training epochs. Minimizing the
mean squared error between estimated first Lyapunov exponent and target Lyapunov exponent
λ1 = −1, −0.5, 0 by gradient descent. 10 Vanilla RNNs were initialized with Gaussian recurrent
weights Wij ∼ N (0, g 2 /N ) where values of g were drawn g ∼ Unif(0, 1). C) Gradient flossing
minimizes the square of Lyapunov exponents over epochs. D) Full Lyapunov spectrum of Vanilla
RNN after a different number of Lyapunov exponents are pushed to zero via gradient flossing. Note,
the variability of the Lyapunov exponents that were not flossed. Parameters: network size N = 32
with 10 network realizations. Error bars in C indicate the 25% and 75% percentiles and solid line
shows median.


The initial entries of W are drawn independently from a Gaussian distribution with zero mean and
variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We here use
the transfer function ϕ(x) = tanh(x). (See appendix B for gradient flossing with ReLU and LSTM
units). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d. Gaussian input
xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
flossing.
In Fig 1B, we show that for randomly initialized RNNs, the Lyapunov exponent can be modified by
gradient flossing to match a desired target value. The networks were initialized with 10 different values
of initial weight strength g chosen uniformly between 0 and 1. During gradient flossing, they quickly
approached three different target values of the first Lyapunov exponents λtarget 1     = {−1, −0.5, 0}
within less than 100 training epochs with batch size B = 1. We note that gradient flossing with
positive target λtarget
                  1     seems not to arrive at a positive Lyapunov exponent λ1 .
Fig 1C shows gradient flossing for different numbers of Lyapunov exponents k. Here, during gradient-
descent, the sum of the squares of 1, 16, or 32 Lyapunov exponents is used as loss in gradient flossing
(see Fig 1A). Fig 1D shows the Lyapunov spectrum after flossing, which now has 1, 16, or 32
Lyapunov exponents close to zero. We conclude that gradient flossing can selectively manipulate one,
several, or all Lyapunov exponents before or during network training. Gradient flossing also works for
RNNs of ReLU and LSTM units (See appendix B. Further, we find that the computational bottleneck         
of gradient flossing is the QR decomposition, which has a computational complexity of O N k 2 ,
both in the forward pass and in the backward pass. Thus, gradient flossing of the entire Lyapunov
spectrum is computationally expensive. However, as we will show, not all Lyapunov exponents need
to be flossed and only short episodes of gradient flossing are sufficient for significantly improving the
training performance.


                                                                               5
5                           Gradient Flossing: Condition Number of the Long-Term Jacobian
                             A                                            B                                              C
condition number 2(Tt( ))
                            1034                                         1026        initial
                                                                                     after flossing                      1030
                            1025                                                     theory
                                                                         1019        simulations


                                                                                                            2 (theory)
                                                              2(Tt( ))
                                                                                                                         1020
                            1016                                         1012                                                                    k
                                                  initial                                                                1010                        5
                            107                   after flossing         105                                                                         10
                                                  theory                                                                                             15
                                                  simulations                                                            100
                                   0    100 200 300                                    5       10      15                       100 1010 1020 1030
                                     time horizon t                             tangent space dimension m                             2 (numerical)
Figure 2: Gradient flossing reduces condition number of the long-term Jacobian A) Condition
number κ2 of long-term Jacobian Tt (hτ ) as a function of time horizon t − τ at initialization (blue)
and after gradient flossing (orange). Direct numerical simulations are done with arbitrary precision
floating point arithmetic (transparent lines) with 256 bits per float, asymptotic theory based on
Lyapunov exponents (dashed lines) (Eq 10). B) Condition number for different number of tangent
space dimensions m. Simulations (dots) and Lyapunov exponent based theory (dashed lines) at
initialization (blue) and after gradient flossing (orange). Gradient flossing increases the number of
tangent space dimensions available for backpropagation for a given condition number (Grey dotted
line as a guide for eye for κ2 = 105 .) First 15 Lyapunov exponents were flossed. C) Comparison of
condition number obtained via direct numerical simulations vs. Lyapunov exponent-based. Colors
denote the number of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 1, N = 80,
epochs = 500, T = 500, gradient flossing for Ef = 500 epochs. Input xs identical to delayed XOR
task in Fig 3D.

A well-conditioned Jacobian is essential for efficient and fast learning [21, 49, 50]. Gradient
flossing improves the condition number of the long-term Jacobian which constrains the error signal
propagation across long time horizons in backpropagation (Fig 2). The condition number κ2 of a
linear map A measures how close the map is to being singular and is given by the ratio of the largest
singular value σmax and the smallest singular values σmin , so κ2 (A) = σσmax   (A)
                                                                            min (A)
                                                                                    . According to the
                                            p
rule of thumb given in [51], if κ2 (A) = 10 , one can anticipate losing at least p digits of precision
when solving the equation Ax = b. Note that the long-term Jacobian Tt is composed of a product of
Jacobians, which generically makes it ill-conditioned. To nevertheless quantify the condition number
numerically, we use arbitrary-precision arithmetic with 256 bits per float. We find numerically that
the condition number of Tt exponentially diverges with the number of time steps (Fig 2A). We
compare the numerically measured condition number κ2 with an asymptotic approximation of the
condition number based on Lyapunov exponents that are calculated in the forward pass and find a
good match (Fig 2A).
Our theoretical estimate of the condition number κ2 of an orthonormal system Q of size N × m that
is temporally evolved by the long-term Jacobian Tt is:

                                         e t+τ ) = κ2 Tt (hτ )Qt = σ1 (Tt (hτ )) ≈ exp ((λ1 − λm )(t − τ )) .
                                                                
                                     κ2 (Q                                                                                                            (10)
                                                                   σm (Tt (hτ ))
where σ1 (Tt (hτ )) and σm (Tt (hτ )) are the first and mth singular value of the long-term Jacobian.
We note that this theoretical estimate of the condition number follows from the asymptotic definition
of Lyapunov exponents and should be exact in the limit of long times. We find that gradient flossing
reduces the condition number by a factor whose magnitude increases exponentially with time (orange
in Fig 2A). Thus, we can expect that gradient flossing has a stronger effect on problems with a long
time horizon to bridge. We will later confirm this numerically.
Moreover, Lyapunov exponents enable the estimation of the number of gradient dimensions available
for the backpropagation of error signals. Generally, the long-term Jacobian is ill-conditioned, however,
the Lyapunov spectrum provides for a given number of tangent space dimensions an estimate of the
condition number. This indicates how close to singular the gradient signal for a given number of
tangent space dimensions is. Given a fixed acceptable condition number—determined, for example,
by noise level or floating-point precision—we observe that gradient flossing increases the number of
usable tangent space dimensions for backpropagation (Fig 2B).


                                                                                       6
Finally, we show that the asymptotic estimate of the condition number based on Lyapunov exponents
can even predict differences in condition number that originate from finite network size N (Fig 2C).
We emphasize that this goes beyond mean-field methods, which become exact only in the large-
network limit N → ∞ and usually do not capture finite-size effects [52] (see appendix G).

6      Initial Gradient Flossing Improves Trainability
     A             delayed copy task       B                                                          delayed XOR task
            10 2                                                             10 2
test loss


                                                                 test loss
                                          no gradient flossing                                           no gradient flossing
            10 3                          gradient flossing                                              gradient flossing
                                                                             10 3
            10 4
                   0   2000      4000 6000         8000 10000                            0        2000    4000 6000         8000 10000
                                    Epochs                                                                   Epochs
     C                      delayed copy task                         D                               delayed XOR task
            10 2                                                             10 2

            10 3                                                             10 3
test loss


                                                                 test loss

            10 4
                                                                             10 4

                       20           40            60        80                      10       20      30 40 50 60                 70   80
                            task difficulty (delay d)                                                task difficulty (delay d)
Figure 3: Gradient flossing improves trainability on tasks that involve long time horizons A) Test
error for Vanilla RNNs trained on delayed copy task yt = xt−d for d = 40 with and without gradient
flossing flossing. Solid lines are medians across 5 network realizations. B) Same as A for delayed
XOR task with yt = |xt−d/2 − xt−d |. C) Mean final test loss as a function of task difficulty (delay d)
for delayed copy task. D) Mean final test loss as a function of task difficulty (delay d) for delayed
XOR task. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing
for Ef = 500 epochs on k = 75 before training. Shaded regions in C and D indicate the 20% and
80% percentiles and solid line shows mean. Dots are individual runs. Task loss: MSE(y, ŷ).

We next present numerical results on two tasks with variable spatial and temporal complexity,
demonstrating that gradient flossing before training improves the trainability of Vanilla RNNs. We
call gradient flossing before training in the following preflossing. For preflossing, we first initialize the
                                                  Pk
network randomly, then minimize Lflossing = i=1 λ2i using the ADAM optimizer and subsequently
train on the tasks. We deliberately do not use sequential MNIST or similar toy tasks commonly used
to probe exploding/vanishing gradients, because we want a task where the structure of long-range
dependencies in the data is transparent and can be varied as desired.
First, we consider the delayed copy task, where a scalar stream of random input numbers x must be
reproduced by the output y delayed by d time steps, i.e. yt = xt−d . Although the task itself is trivial
and can be solved even by a linear network through a delay line (see appendix E), RNNs encounter
vanishing gradients for large delays d during training even with ’critical’ initialization with g = 1.
Our experiments show that gradient flossing can substantially improve the performance of RNNs
on this task (Fig 3A, C). While Vanilla RNNs without gradient flossing fail to train reliably beyond
d = 20, Vanilla RNNs with gradient flossing can be reliably trained for d = 40 (Fig 3C). Note that
we flossed here k = 40 Lyapunov exponents before training. We will later investigate the role of the
number of flossed Lyapunov exponents.
Second, we consider the temporal XOR task, which requires the RNN to perform a nonlinear input-
output computation on a sequential stream of scalar inputs, i.e., yt = |xt−d/2 − xt−d |, where d
denotes a time delay of d time steps (For details see appendix H). Fig 3D demonstrates that gradient
flossing helps to train networks on a substantially longer delay d. We found similar improvements
through gradient flossing for RNNs initialized with orthogonal weights (see appendix G).


                                                                 7
7      Gradient Flossing During Training
     A          delayed temporal XOR     B                                                             delayed spatial XOR
                    100        flossing during training                                    100
test accuracy (%)    90        preflossing                                                  90


                                                                       test accuracy (%)
                               no flossing
                     80                                                                     80
                     70                                                                     70
                     60                                                                     60
                     50                                                                     50
                          0       2000     4000 6000      8000 10000                             0   2000   4000 6000    8000 10000
                                              Epochs                                                           Epochs
     C                                                                      D
                    100                                                                    100
                     90                                                                     90
test accuracy (%)


                                                                       test accuracy (%)
                     80                                                                     80                     flossing during training
                                                                                                                   preflossing
                     70                                                                     70                     no flossing
                     60                                                                     60
                     50                                                                     50
                          10     20
                      30 40 50 60 70                         10 20 30 40 50 60 70
                     complexity (delay d)                               complexity (delay d)
Figure 4: Gradient flossing during training further improves trainability
A) Test accuracy for Vanilla RNNs trained on delayed temporal binary XOR task yt = xt−d/2 ⊕ xt−d
with gradient flossing during training (green), preflossing (gradient flossing before training) (orange),
and with no gradient flossing (blue) for d = 70. Solid lines are mean across 20 network realizations,
individual network realizations shown in transparent fine lines. B) Same as A for delayed spatial
XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d . Parameters (g = 1, batch size b = 16). C) Test accuracy
as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test accuracy as a
function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1, batch size
b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on k = 75 before
training and during training for green lines, and only before training for orange lines. Same plotting
conventions as previous figure. Task loss: cross-entropy between y and ŷ.
We next investigate the effects of gradient flossing during the training and find that gradient flossing
during training can further improve trainability. We trained RNNs on two more challenging tasks
with variable temporal complexity and performed gradient flossing either both during and before
training, only before training, or not at all.
Fig 4A shows the test accuracy for Vanilla RNNs training on the delayed temporal XOR task
yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of Vanilla RNNs
falls to chance level for d ≥ 40 (Fig 4C). With gradient flossing before training, the trainability
can be improved, but still goes to chance level for d = 70. In contrast, for networks with gradient
flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed
for 500 epochs before task training and again after 500 epochs of training on the task. In Fig 4B,
D the networks have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a
three-dimensional binary input signal x1 , x2 , and x3 and generate the correct output with a delay of
d steps. While the solution of the task itself is not difficult and could even be implemented by hand
(see appendix), the task is challenging for backpropagation through time because nonlinear temporal
associations bridging long time horizons have to be formed. Again, we observe that gradient flossing
before training improves the performance compared to baseline, but starts failing for long delays
d > 60. In contrast, networks that are also flossed during training can solve even more difficult tasks
(Fig 4D). We find that after gradient flossing, the norm of the error gradient with respect to initial
conditions h0 is amplified (appendix C). Interestingly, gradient flossing can also be detrimental to
task performance if it is continued throughout all training epochs (appendix C)
We note that merely regularizing the spectral radius of the recurrent weight matrix W or the individual
one-step Jacobians Ds numerically or based on mean-field theory does not yield such a training
improvement. This suggests that taking the temporal correlations between Jacobians Ds into account
is important for improving trainability.


                                                                       8
7.1   Gradient Flossing for Different Numbers of Flossed Lyapunov Exponents

We investigated how many Lyapunov exponents k have to be flossed to achieve an improvement in
training success (Fig 5). We studied this in the binary temporal delayed XOR task with gradient
flossing during training (same as Fig 3) and varied the task difficulty by changing the delay d.
We found that as the task becomes more difficult, networks where not enough Lyapunov exponents
k are flossed begin to fall below 100% test accuracy (Fig 5A). Correspondingly, when measuring
final test accuracy as a function of the number of flossed Lyapunov exponents, we observed that
more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the tasks become more
difficult (Fig 5B). We also show the entire parameter plane of median test accuracy as a function of
both number of flossed Lyapunov exponents k and task difficulty (delay d), and found the same trend
(Fig 5B). Overall, we found that tasks with larger delay d require more Lyapunov exponents close to
zero. We note that this might also partially be caused by the ’streaming’ nature of the task: in our
tasks, longer delays automatically imply that more values have to be stored as at any moment all the
values in the ’delay line’ have to be remembered to successfully solve the tasks. This is different from
tasks where a single variable has to be stored and recalled after a long delay. It would be interesting
to study tasks where the number of delay steps and the number of items in memory can be varied
independently.
Finally, we did the same analysis on networks with only preflossing (gradient flossing before training)
and found the same trend (supplement Fig 7D), however, in that case even if all N Lyapunov
exponents were flossed, thus k = N , they were not able to solve the most difficult tasks. This seems
to indicate that gradient flossing during training cannot be replaced by just gradient flossing more
Lyapunov exponents before training.


Figure 5: Gradient flossing for different numbers of flossed Lyapunov exponents
A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number
of flossed Lyapunov exponents k. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104
for delayed temporal XOR, epochs = 5000 for delayed spatial XOR, T = 300, gradient flossing
for Ef = 500 epochs before training and during training for A, B. Shaded areas are 25% and 75%
percentile, solid lines are means, transparent dots are individual simulations, task loss: cross-entropy
between y and ŷ.
8     Limitations
The mathematical connection between Lyapunov exponents and backpropagation through time
exploited in gradient flossing is rigorously established only in the infinite-time limit. It would be
interesting to extend our analysis to finite-time Lyapunov exponents.
Furthermore, the backpropagation through time gradient involves a sum over products of Jacobians
of different time periods t − τ , but the Lyapunov exponent only considers the asymptotic longest
product. Additionally, Lyapunov exponents characterize the asymptotic dynamics on the attractor of
the dynamics, whereas RNNs often exploit transient dynamics from some initial conditions outside
or towards the attractor.
Although our proposed method focuses on exploiting Lyapunov exponents, it neglects the geometry
of covariant Lyapunov vectors [53], which could be used to improve training performance, speed,
and reliability. Additionally, it is important to investigate how sensitive the method is to the choice
of orthonormal basis employed because it is only guaranteed to become unique asymptotically [54].


                                                   9
Finally, the computational cost of our method scales with O(N k 2 ), where N is the network size
and k is the number of Lyapunov exponents calculated. To reduce the computational cost, we
suggest doing QR decomposition only sufficiently often to ensure that the orthonormal system is
not ill-conditioned and using gradient flossing only intermittently or as pretraining. One could also
calculate the Lyapunov spectrum for a shorter time interval or use a cheaper proxy for the Lyapunov
spectrum and investigate more efficient gradient flossing schedules.


9   Discussion

We tackle the problem of gradient signal propagation in recurrent neural networks through a dy-
namical systems lens. We introduce a novel method called gradient flossing that addresses the
problem of gradient instability during training. Our approach enhances gradient signal stability both
before and during training by regularizing Lyapunov exponents. By keeping the long-term Jacobian
well-conditioned, gradient flossing optimizes both training accuracy and speed. To achieve this,
we combine established dynamical systems methods for calculating Lyapunov exponents with an
analytical pullback of the QR factorization. This allows us to establish and maintain gradient stability
in a in a manner that is memory-efficient, numerically stable, and exact across long time horizons.
Our method is applicable to arbitrary RNN architectures, nonlinearities, and also neural ODEs [55].
Empirically, pre-training with gradient flossing enhances both training speed and accuracy. For
difficult temporal credit assignment problems, gradient flossing throughout training further enhances
signal propagation. We also demonstrate the versatility of our method on a set of synthetic tasks
with controllable time-complexity and show that it can be combined with other approaches to tackle
exploding and vanishing gradients, such as dynamic mean-field theory for initialization, orthogonal
initialization and specialized single units, such as LSTMs.
Prior research on exploding and vanishing gradients mainly focused on selecting network architectures
that are less prone to exploding/vanishing gradients or finding parameter initializations that provide
well-conditioned gradients at least at the beginning of training. Our introduced gradient flossing can
be seen as a complementary approach that can further enhance gradient stability throughout training.
Compared to the work on picking good parameter initializations based on random matrix theory [41]
and mean-field heuristics [40], gradient flossing provides several improvements: First, mean-field
theory only considers the gradient flow at initialization, while gradient flossing can maintain gradient
flow and well-conditioned Jacobians throughout the training process. Second, random matrix theory
and mean-field heuristics are usually confined to the limit of large networks [52], while gradient
flossing can be used for networks of any size. The link between Lyapunov exponents and the gradients
of backpropagation through time has been described previously [9, 12] and has been spelled out
analytically and studied numerically [10, 11, 56, 57, 58]. In contrast, we use Lyapunov exponents
here not only as a diagnostic tool for gradient stability but also to show that they can directly be part
of the cure for exploding and vanishing gradients.
Future investigations could delve further into the roles of the second to N th Lyapunov exponents
in trainability, and how it is related to the task at hand, the rank of the parameter update, the dimen-
sionality of the solution space, as well as the network dynamics (see also [32, 59, 60]). Our results
suggest a trade-off between trainability across long time horizons and the nonlinear task demands
that is worth exploring in more detail (appendix C). Applying gradient flossing to real-time recurrent
learning and its biologically plausible variants is another avenue [61]. Extending gradient flossing to
feedforward networks, state-space models and transformers is a promising avenue for future research
(see also [62, 63]). While Lyapunov exponents are only strictly defined for dynamical systems, such
as maps or flows that are endomorphisms, the long-term Jacobian of deep feedforward networks
could be treated similarly. This could also provide a link between the stability of the network against
adversarial examples and its dynamic stability, as measured by Lyapunov exponents. Given that
time-varying input can suppress chaos in recurrent networks [9, 12, 64, 65, 66, 67], we anticipate they
may exacerbate vanishing gradients. Gradient flossing could also be applied in neural architecture
search, to identify and optimize trainable networks. Finally, gradient flossing is applicable to other
model parameters, as well. For instance, gradients of Lyapunov exponents with respect to single-
unit parameters could optimize the activation function and single-neuron biophysics in biologically
plausible neuron models.


                                                   10
Acknowledgments and Disclosure of Funding
I thank E. Izhikevich, F. Wolf, S. Goedeke, J. Lindner, L.F. Abbott, L. Logiaco, M. Schottdorf, G.
Wayne and P. Sokol for fruitful discussions and J. Stone, L.F. Abbott, M. Ding, O. Marshall, S.
Goedeke, S. Lippl, M.P. Puelma-Touzel, J. Lindner and the reviewers for feedback on the manuscript.
I thank Jinguo Liu for the Julia package BackwardsLinalg.jl. Research supported by NSF NeuroNex
Award (DBI-1707398), the Gatsby Charitable Foundation (GAT3708), the Simons Collaboration for
the Global Brain (542939SPI), and the Swartz Foundation (2021-6).

References
 [1] P. WERBOS. Beyond Regression :. Ph. D. dissertation, Harvard University, 1974.
 [2] DB Parker. Learning-logic (TR-47). Center for Computational Research in Economics and Management
     Science. MIT-Press, Cambridge, Mass, 8, 1985.
 [3] Y. LECUN. Une procedure d’apprentissage ponr reseau a seuil asymetrique. Proceedings of Cognitiva 85,
     pages 599–604, 1985.
 [4] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-
     propagating errors. Nature, 323(6088):533, October 1986.
 [5] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. PhD thesis, 1991.
 [6] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735–
     1780, 1997.
 [7] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult.
     IEEE Transactions on Neural Networks, 5(2):157–166, March 1994.
 [8] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural
     Networks. arXiv:1211.5063 [cs], November 2012. arXiv: 1211.5063.
 [9] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
     arXiv:2006.02427 [nlin, q-bio], June 2020. arXiv: 2006.02427.
[10] Jonas Mikhaeil, Zahra Monfared, and Daniel Durstewitz. On the difficulty of learning chaotic dynamics
     with RNNs. Advances in Neural Information Processing Systems, 35:11297–11312, December 2022.
[11] Il Memming Park, Ábel Ságodi, and Piotr Aleksander Sokół. Persistent learning signals and working
     memory without continuous attractors. ArXiv, page arXiv:2308.12585v1, August 2023.
[12] Rainer Engelken, Fred Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.
     Physical Review Research, 5(4):043044, October 2023.
[13] J. P. Eckmann and D. Ruelle. Ergodic theory of chaos and strange attractors. Reviews of Modern Physics,
     57(3):617–656, July 1985.
[14] Arkady Pikovsky and Antonio Politi. Lyapunov Exponents: A Tool to Explore Complex Dynamics.
     Cambridge University Press, Cambridge, February 2016.
[15] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the Properties of
     Neural Machine Translation: Encoder-Decoder Approaches. Technical report, September 2014. ADS
     Bibcode: 2014arXiv1409.1259C Type: article.
[16] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
     networks. In Proceedings of the 30th International Conference on International Conference on Machine
     Learning - Volume 28, ICML’13, pages III–1310–III–1318, Atlanta, GA, USA, June 2013. JMLR.org.
[17] Tomáš Mikolov. Statistical language models based on neural networks. Ph.D. thesis, Brno University of
     Technology, Faculty of Information Technology, Brno, CZ, 2012.
[18] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, November 2016.
     Google-Books-ID: omivDQAAQBAJ.
[19] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
     Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine
     Learning, pages 448–456. PMLR, June 2015.


                                                     11
[20] Bo Chang, Minmin Chen, Eldad Haber, and Ed H. Chi. AntisymmetricRNN: A Dynamical System View
     on Recurrent Neural Networks. International Conference on Learning Representations, December 2018.

[21] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of
     learning in deep linear neural networks. arXiv:1312.6120 [cond-mat, q-bio, stat], December 2013. arXiv:
     1312.6120.

[22] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. In
     Proceedings of The 33rd International Conference on Machine Learning, pages 1120–1128. PMLR, June
     2016.

[23] Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal Recurrent Neural Networks with Scaled Cayley
     Transform. In Proceedings of the 35th International Conference on Machine Learning, pages 1969–1978.
     PMLR, July 2018.

[24] T. Konstantin Rusch and Siddhartha Mishra. Coupled Oscillatory Recurrent Neural Network (coRNN):
     An accurate and (gradient) stable architecture for learning long time dependencies. arXiv e-prints, page
     arXiv:2010.00951, October 2020.

[25] N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, and Michael W. Mahoney.
     Lipschitz Recurrent Neural Networks. International Conference on Learning Representations, January
     2021.

[26] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and
     Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. Technical report, March 2023.
     ADS Bibcode: 2023arXiv230306349O Type: article.

[27] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an
     erratum note. Bonn, Germany: German National Research Center for Information Technology GMD
     Technical Report, 148:34, 2001.

[28] Mustafa C. Ozturk, Dongming Xu, and José C. Príncipe. Analysis and Design of Echo State Networks.
     Neural Computation, 19(1):111–138, January 2007.

[29] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training Very Deep Networks. In Advances
     in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

[30] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutnik, and Jürgen Schmidhuber. Recurrent Highway
     Networks. In Proceedings of the 34th International Conference on Machine Learning, pages 4189–4198.
     PMLR, July 2017.

[31] Piotr A. Sokół, Ian Jordan, Eben Kadile, and Il Memming Park. Adjoint Dynamics of Stable Limit Cycle
     Neural Networks. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, pages 884–887,
     November 2019. ISSN: 2576-2303.

[32] Piotr A. Sokół. Geometry of Learning and Representations in Neural Networks. PhD thesis, Stony Brook
     University, May 2023.

[33] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep Sparse Rectifier Neural Networks. volume 15,
     pages 315–323, April 2011.

[34] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential
     expressivity in deep neural networks through transient chaos. arXiv:1606.05340 [cond-mat, stat], June
     2016. arXiv: 1606.05340.

[35] Minmin Chen, Jeffrey Pennington, and Samuel S. Schoenholz. Dynamical Isometry and a Mean Field
     Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks. arXiv:1806.05394
     [cs, stat], August 2018. arXiv: 1806.05394.

[36] Boris Hanin and Mihai Nica. Products of Many Large Random Matrices and Gradients in Deep Neural
     Networks. December 2018.

[37] Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep Information
     Propagation. November 2016.

[38] Boris Hanin and David Rolnick. How to Start Training: The Effect of Initialization and Architecture. In
     Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.


                                                    12
[39] Piotr A. Sokol and Il Memming Park. Information Geometry of Orthogonal Initializations and Training.
     Technical report, October 2018. ADS Bibcode: 2018arXiv181003785S Type: article.
[40] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, and Jeffrey
     Pennington. Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs. January 2019.
[41] Tankut Can, Kamesh Krishnamurthy, and David J. Schwab. Gating creates slow modes and controls
     phase-space complexity in GRUs and LSTMs. arXiv:2002.00025 [cond-mat, stat], January 2020. arXiv:
     2002.00025.
[42] Valery Iustinovich Oseledets. A multiplicative ergodic theorem. Characteristic Ljapunov, exponents of
     dynamical systems. Trudy Moskovskogo Matematicheskogo Obshchestva, 19:179–210, 1968.
[43] Karlheinz Geist, Ulrich Parlitz, and Werner Lauterborn. Comparison of Different Methods for Computing
     Lyapunov Exponents. Progress of Theoretical Physics, 83(5):875–893, May 1990.
[44] Giancarlo Benettin, Luigi Galgani, Antonio Giorgilli, and Jean-Marie Strelcyn. Lyapunov Characteristic
     Exponents for smooth dynamical systems and for hamiltonian systems; A method for computing all of
     them. Part 2: Numerical application. Meccanica, 15(1):21–30, March 1980.
[45] Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang. Differentiable Programming Tensor Networks.
     Physical Review X, 9(3):031041, September 2019. arXiv:1903.09650 [cond-mat, physics:quant-ph].
[46] S. F. Walter and L. Lehmann. Algorithmic Differentiation of Linear Algebra Functions with Application
     in Optimum Experimental Design (Extended Version). Technical report, January 2010. ADS Bibcode:
     2010arXiv1001.1654W Type: article.
[47] Robert Schreiber and Charles Van Loan. A Storage-Efficient $WY$ Representation for Products of
     Householder Transformations. SIAM Journal on Scientific and Statistical Computing, 10(1):53–57, January
     1989.
[48] Matthias Seeger, Asmus Hetzel, Zhenwen Dai, Eric Meissner, and Neil D. Lawrence. Auto-Differentiating
     Linear Algebra. Technical report, October 2017. ADS Bibcode: 2017arXiv171008717S Type: article.
[49] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep learning
     through dynamical isometry: theory and practice. In Advances in Neural Information Processing Systems,
     volume 30. Curran Associates, Inc., 2017.
[50] Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. The Emergence of Spectral Universality in
     Deep Networks. arXiv:1802.09979 [cs, stat], February 2018. arXiv: 1802.09979.
[51] E. Cheney and David Kincaid. Numerical Mathematics and Computing. Cengage Learning, August 2007.
     Google-Books-ID: ZUfVZELlrMEC.
[52] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and
     Surya Ganguli. Statistical Mechanics of Deep Learning. Annual Review of Condensed Matter Physics,
     11(1):501–528, 2020.
[53] F. Ginelli, P. Poggi, A. Turchi, H. Chaté, R. Livi, and A. Politi. Characterizing Dynamics with Covariant
     Lyapunov Vectors. Physical Review Letters, 99(13):130601, September 2007.
[54] Sergey V. Ershov and Alexei B. Potapov. On the concept of stationary Lyapunov basis. Physica D:
     Nonlinear Phenomena, 118(3):167–198, July 1998.
[55] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural Ordinary Differential
     Equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.,
     2018.
[56] Javed Lindner. Investigating the exploding and vanishing gradients problem with Lyapunov exponents.
     Master’s thesis, RWTH Aaachen, Aaachen/Juelich, 2021.
[57] Ryan Vogt, Maximilian Puelma Touzel, Eli Shlizerman, and Guillaume Lajoie. On Lyapunov Exponents
     for RNNs: Understanding Information Propagation Using Dynamical Systems Tools. Frontiers in Applied
     Mathematics and Statistics, 8, 2022.
[58] Kamesh Krishnamurthy, Tankut Can, and David J. Schwab. Theory of Gating in Recurrent Neural Networks.
     Physical Review X, 12(1):011011, January 2022.
[59] L. S. Pontryagin. Mathematical Theory of Optimal Processes: The Mathematical Theory of Optimal
     Processes. Routledge, New York, 1st edition edition, March 1987.


                                                     13
[60] Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. In Calculus
     of Variations and Optimal Control Theory. Princeton University Press, December 2011.

[61] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A unified framework of online learning algorithms
     for training recurrent neural networks. The Journal of Machine Learning Research, 21(1):135:5320–
     135:5353, January 2020.

[62] Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust Learning with Jacobian Regularization. Technical
     report, August 2019. ADS Bibcode: 2019arXiv190802729H Type: article.
[63] Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. Fixup Initialization: Residual Learning Without
     Normalization. Technical report, January 2019. ADS Bibcode: 2019arXiv190109321Z Type: article.

[64] L. Molgedey, J. Schuchhardt, and H. G. Schuster. Suppressing chaos in neural networks by noise. Physical
     Review Letters, 69(26):3717–3719, December 1992.

[65] Kanaka Rajan, L. F. Abbott, and Haim Sompolinsky. Stimulus-dependent suppression of chaos in recurrent
     neural networks. Physical Review E, 82(1):011903, July 2010.

[66] Jannis Schuecker, Sven Goedeke, David Dahmen, and Moritz Helias. Functional methods for disordered
     neural networks. arXiv:1605.06758 [cond-mat, q-bio], May 2016. arXiv: 1605.06758.

[67] Rainer Engelken, Alessandro Ingrosso, Ramin Khajeh, Sven Goedeke, and L. F. Abbott. Input correlations
     impede suppression of chaos and learning in balanced firing-rate networks. PLOS Computational Biology,
     18(12):e1010590, December 2022.

[68] Edward Ott. Chaos in Dynamical Systems. Cambridge University Press, August 2002. Google-Books-ID:
     PfXoAwAAQBAJ.
[69] Angelo Vulpiani, Fabio Cecconi, and Massimo Cencini. Chaos: From Simple Models to Complex Systems.
     World Scientific Pub Co Inc, Hackensack, NJ, September 2009.


                                                     14
A         Backpropagation Through QR Decomposition
The backward pass of the QR decomposition is given by [45, 46, 47, 48]
                                 Q = Q + Q copyltu(M) R−T
                                                               
                                                                                                              (11)
                     T    T
where M = RR − Q Q and the copyltu function generates a symmetric matrix by copying the lower
triangle of the input matrix to its upper triangle, with the element [copyltu(M )]ij = Mmax(i,j),min(i,j)
[45, 46, 47, 48]. Adjoint variable are written here as T = ∂L/∂T .
Using an analytical pullback is more memory-efficient and less computationally costly than directly doing
automatic differentiation through the QR-decomposition. Moreover, from a practical perspective, for QR
decomposition, often BLAS/LAPACK routines are utilized which are not amenable to common differentiable
programming frameworks like TensorFlow, PyTorch, JAX and Zygote. In our implementation of gradient
flossing, we used the Julia package BackwardsLinalg.jl by Jinguo Liu available at here .

B       Further Details and Analysis of Gradient Flossing
An example implementation of gradient flossing in Flux, a machine learning library in Julia is available here.
We are actively developing implementations for other widely used differentiable programming frameworks.

B.1       Gradient Flossing for recurrent LSTM and ReLU networks

      A                         LSTM                         C                         ReLU
           0.2                             target 1                 6                             target 1
                                           actual 1                                               actual 1
                                                                    4
           0.0
                                                                    2
    1


                                                             1


           0.2
                                                                    0
           0.4
                                                                    2
                 0   20       40      60   80         100               0       20   40      60   80         100
                                Epochs                                                 Epochs
      B                                                      D 101
          10 1
          10 2                                                   10 1
          10 3
 2


                                                            2
 1


                                                            1


          10 4                                                   10 3
          10 5
          10 6                                                   10 5
                 0   20       40      60   80         100                   0   20   40    60     80    100
                                Epochs                                                Epochs


Figure 6: Gradient flossing for recurrent LSTM networks and recurrent ReLU networks A) First
Lyapunov exponent of LSTM network as a function of training epochs. Minimizing the mean
squared error between estimated first Lyapunov exponent and target Lyapunov exponent λ1 = 0
by gradient descent. First Lyapunov exponent of LSTM network (solid lines) converges to target
value (thick dashed lines) within less than 100 epochs. 10 random LSTM RNNs were initialized with
Gaussian recurrent weights, where standard deviations of weight scaling were drawn g ∼ Unif(0, 1).
B) Gradient flossing minimizes the square of the first Lyapunov exponent of random recurrent
LSTM networks over epochs. C) Same as A for recurrent ReLU network. Here networks were
initialized with Gaussian recurrent weights Wij ∼ N (−0.1, g 2 /N ) where values of g were drawn
g ∼ Unif(0, 1) D) B) for recurrent ReLU network. Parameters: network size N = 32 with 10
network realizations. Shaded regions in B, D are 25% and 75% percentiles, solid line shows median.

We demonstrate that gradient flossing can also be applied to recurrent LSTM and ReLU networks in Fig 6. To
this end, we generated random LSTM networks where the weights of all the different gates and biases were


                                                        15
independently and identically distributed (i.i.d.) and sampled from Gaussian distributions of different variance.
Our results show that gradient flossing can also constrain the Lyapunov exponent to be close to zero. The
dynamics of each of the N LSTM units follows the map [6]:
                                      ft    =    σg (Uf ht−1 + Wf xt + bf )                                      (12)
                                      ot    =    σg (Uo ht−1 + Wo xt + bo )                                      (13)
                                      it    =    σg (Ui ht−1 + Wi xt + bi )                                      (14)
                                      c̃t   =    σh (Uc ht−1 + Wc xt + bc )                                      (15)
                                      ct    =    ft ⊙ ct−1 + it ⊙ c̃t                                            (16)
                                      ht    =    ot ⊙ ϕ(ct )                                                     (17)
                                                              1
where ⊙ denotes the Hadamard product, σg (x) = 1+exp(−x)              is the sigmoid function, σh (x) = tanh(x) and
                                                                2
entries of the matrices Ux are drawn from Ux ∼ N (0, gx /N ). For simplicity, the bias terms bx are scalars.
Subscripts f , o and i denote respectively the forget gate, the output gate, the input gate, and c is the cell state.
In each LSTM unit, there are two dynamic variables c and h, and three gates f , o, and i that control the flow
if signals into and out of the cell c. We set the values gih , gix , gf x , bf , gch , gcx , gcx , gox to be uniformly
distributed between 0 and 1 and initialize bi , gf h ,bc ,b0 as zero.
During gradient flossing, the actual Lyapunov exponents of different random network realizations converge close
to the target Lyapunov exponent λtarget
                                     1   = 0 in fewer than 100 epochs as shown in Fig 6A. Fig 6B shows that
the squared Lyapunov exponents converge towards zero. We note that for LSTM networks, a target Lyapunov
exponent of λtarget
               1    = −1 is achieved after 100 gradient flossing steps only for a subset of random network
realizations (not shown). We speculate that behavior is influenced by the gating structure of LSTM units,
which seems to naturally place the first Lyapunov exponent close to zero for certain initializations (See also
[9, 12, 41, 58]).
For the recurrent ReLU networks, we considered the same Vanilla RNN dynamics as in the main manuscript in
Eq 9
                              hs+1 = f (hs , xs+1 ) = Wϕ(hs ) + Vxs+1 ,
The initial entries of W are drawn independently from a Gaussian distribution with a negative mean of −0.1
and variance g 2 /N , where g is a gain parameter that controls the heterogeneity of weights. We use the transfer
function ϕ(x) = max(x, 0). xs is a sequence of inputs and V is the input weight. xs is a stream of i.i.d.
Gaussian input xs ∼ N (0, 1) and the input weights V are N (0, 1). Both W and V are trained during gradient
flossing. We found that some ReLU network had initially unstable dynamics with positive Lyapunov exponents
Fig 6C. However, during gradient flossing, these unstable networks were quickly stabilized. Fig 6D shows that
the squared Lyapunov exponents of ReLU networks converge towards zero.

B.2    Additional Results for Different Numbers of Flossed Lyapunov Exponents
Additionally to the main Fig 5, we did the same analysis on networks with only preflossing (gradient flossing
before training) and found that more Lyapunov exponent k have to be flossed to achieve 100% accuracy as the
tasks become more difficult (Fig 7D), however, in that case even if all N Lyapunov exponents were flossed, thus
k = N , they were not able to solve the most difficult tasks. This seems to indicate that gradient flossing during
training cannot be replaced by just gradient flossing more Lyapunov exponents before training.


C     Additional Results on Gradient Flossing Throughout Training
We now discuss some additional results on gradient flossing throughout training. First, we analyze how gradient
flossing affects the gradients and find that during gradient flossing, the norm of gradients that bridge many time
steps are boosted. Moreover, subordinate singular values of the error norm of the recurrent weights are also
boosted, indicating that gradient flossing can increase the effective rank of the parameter update. Additionally,
we show that if gradient flossing is continued throughout training it can be detrimental to the accuracy. Finally,
we show that Lyapunov exponents of successfully trained networks after training for the spatial delayed XOR
task have a simple relationship to the delay d.


D     Gradient Flossing boosts the Gradient Norm for Long Time Horizons
In this section, we investigate the impact of gradient flossing on the norm and structure of the gradient. It is
important to note that the complete error gradient of backpropagation through time is composed of a summation
of products of one-step Jacobians, reflecting the number of "loops" the error signal traverses through the recurrent
dynamics before reaching its target. Consequently, when the singular values of the long-term Jacobian are
smaller than 1, the influence of the shorter loops typically dominates the long-term Jacobian.


                                                         16
                A                                                                                 B
                     100                                                                               100
                                     k
 test accuracy (%)


                                                                                   test accuracy (%)
                             80          1                                                                     80                           delay
                                         20                                                                                                     30
                                         40                                                                                                     50
                                         60                                                                                                     70
                             60          80                                                                    60

                                      20       40         60                                                        0       20     40       60   80
                                       complexity (delay d)                                                                number of flossed i k
                       C                                          100                                    D                                         100
                             70                                                                                70
      complexity (delay d)


                                                                                        complexity (delay d)
                                                               test accuracy (%)


                                                                                                                                                test accuracy (%)
                             60                                                                                60
                             50                                                                                50
                             40                                                                                40
                             30                                                                                30
                             20                                                                                20
                             10                                                                                10
                                  1 10 20 30 40 50 60 70 80       50                                                1 10 20 30 40 50 60 70 80      50
                                   number of flossed i k                                                             number of flossed i k


Figure 7: Gradient flossing for different numbers of flossed Lyapunov exponents
A) Test accuracy for delayed temporal XOR task as a function of delay d with different numbers
flossed Lyapunov exponents k. B) Same data as A but here test accuracy as a function of number of
flossed Lyapunov exponents k. C) Median test accuracy for delayed temporal XOR task as a function
of delay d and k for networks with gradient flossing during training (500 steps of gradient flossing at
epochs e ∈ {0, 100, 200, 300, 400}). D)Same as B for preflossing only. Parameters: g = 1, batch
size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed spatial
XOR, T = 300, gradient flossing for Ef = 500 epochs before training and during training for A, B,
C, and only before training for C. Shaded areas are 25% and 75% percentiles, solid lines are means,
transparent dots are individual simulations, task loss: cross-entropy btw. y, ŷ.


In our tasks, we have full control over the correlation structure of the task and thus know exactly which loop
length of backpropagation through time is necessary for finding the correct association. We were moreover
careful in our task design not to have any additional signals in our task that might help to bridge the long time
scale. In the case of vanishing gradients, the gradient norm is predominantly influenced by the shorter loops,
even though the actual signal in the gradient originates solely from the loop of length d in our task. To mitigate
the contamination of spurious signals from shorter loops and effectively extract the gradient that spans long time
horizons, we focus on the gradient with respect to the initial conditions h0 .
                   τ =t−1   t−1
                                          !                τ =t−1   t−1
                                                                                   !
     ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ           ∂Lt X         Y ∂hτ ′ +1              ∂Lt
           =                                       =                                 δτ 0 =      Tt (h0 )      (18)
     ∂h0      ∂ht            ′
                                   ∂h τ ′    ∂h 0     ∂h t           ′
                                                                           ∂h  τ ′          ∂ht
                                      τ =t−l   τ =τ                                           τ =t−l                τ =τ

We note that the sum conveniently drops as only the longest ’loop’, in other words, the only summand that
contributes is the product of Jacobians going from 0 to t. By considering this gradient, we can therefore ensure
that no undesired signals stemming from shorter loops interfere with the analysis. Moreover, we note that we
use the binary cross entropy loss which makes the derivative ∂L
                                                              ∂ht
                                                                 t
                                                                   trivial.
In Fig 8 we show that gradient flossing boosts the gradient with respect to the initial conditions. Specifically, we
compare two identical networks trained on the binary delayed temporal XOR task with a loop length of d = 70.
One network is trained with gradient flossing at epochs e ∈ {0, 100, 200, 300, 400}), while the other is trained
without gradient flossing.


                                                                               17
                                                                          dL
For the network without gradient flossing, the gradient norm of | dh        0
                                                                              | diminishes to extremely small values
       −6
(< 10 ) and remains small throughout training. In contrast, for the network trained with gradient flossing, each
episode of gradient flossing causes the norm | dh dL
                                                    0
                                                      | to spike, surpassing values larger than 10−2 . These findings
are direct evidence that gradient flossing boosts the gradient norm, facilitating to bridge long time horizons in
challenging temporal credit assignment tasks. We observe that after several episodes of gradient flossing, the
            dL
gradient | dh 0
                | of the networks stays around 10−4 and eventually rise up to values around 10−2 . Subsequent
in training, the test accuracy surpasses chance level (Fig 8B). We observed this temporal relationship between
                   dL
gradient norm | dh    0
                        | and training success consistently across numerous network realizations (Fig 8C and D).
                                                     dL
These findings suggest that the gradient norm | dh      0
                                                          | can be a good predictor of learning success, sometimes
hundreds of epochs before the accuracy exceeds the chance level of 50%. Indeed, when depicting the gradient
norm aligned to the last epoch where accuracy was ≤ 50%, we see for many network realizations a gradual
growth of gradient norm oven epochs before accuracy surpasses chance level (Fig 9A). Analogously, when
                                                                                    dL
plotting the accuracy as a function of epoch aligned with the last epoch with | dh    0
                                                                                        | < 0.001, we observe for this
                                             dL
task that the increase of gradient norm | dh0 | reliably precedes the epoch at which the accuracy surpasses the
chance level (See Fig 9B). We note that when measuring the overlap of the orientation of the gradient vector
 dL
 dh0
     with the first covariant Lyapunov vector of the forward dynamics, we found a significant increase in overlap
around the training epoch where the accuracy surpasses the chance level both in networks with and without
gradient flossing. This does not come as a surprise as the covariant Lyapunov vector measure the most unstable
(or least stable) direction in the tangent space of a trajectory and perturbations of h0 that have to travel over
many epochs align

D.1 Gradient Flossing Boosts Effective Dimension of Error Gradient
To further investigate the effect of gradient flossing on training, we investigated the structure of the error gradient
                                                                                                                  dL
and how it is changed by gradient flossing. To this end, we decompose the recurrent weight gradient σi dW
into in weighted sum of outer products using singular value decomposition (Fig 10).
As the Lyapunov exponents are the time-averaged logarithms of the singular values of the asymptotic long-term
Jacobian Tt (hτ ), this allows us to directly link the effect of pushing Lyapunov exponents toward zero during
gradient flossing to the structure of the error gradient of the recurrent weights, as they are intimately linked:
                                    τ =t−1    t−1
                                                            !
                      ∂Lt      ∂Lt X          Y ∂hτ ′ +1 ∂hτ            ∂Lt X             ∂hτ
                            =                                        =          Tt (hτ )                        (19)
                      ∂W       ∂ht             ′
                                                    ∂h  τ ′    ∂W       ∂h t
                                                                              τ
                                                                                          ∂W
                                     τ =t−l   τ =τ

We again note that different ’loops’ contribute to the total gradient expression and the Lyapunov exponents only
characterize the longest loop. Further, we note that in our controlled tasks, depending on delay d, only few of the
summands are relevant for solving the task. We thus expect the relevant gradient summand that carries important
signals about the task to be contaminated by summands of both shorter and longer chains, which contribute
irrelevant fluctuations.
                                                              dL
                                                                 
The singular values of the recurrent weight gradient σi dW         as a function of training epoch reveal that the
subordinate singular values subordinate singular value σ20 and σ40 exhibit peaks at the times of gradient flossing,
while the first singular value σ1 only shows a slight peak (Fig 10A). This indicates that gradient flossing
                                                                 dL
increases the effective rank of the recurrent weight gradient dW     . In other words, gradient flossing facilitates
high-dimensional parameter updates. Our interpretation is, as gradient flossing pushes Lyapunov exponents
to zero, the different summands in the total gradient contribute more equitable as long loops have neither a
dominant contribution (which would happen for exploding gradients) nor a vanishing contribution (which would
happen for vanishing gradients). This way, the sum of gradient terms has a higher effective rank.
In contrast, without gradient flossing, the subordinate singular values (in Fig 10A σ20 and σ40 ) rapidly diminish
to extremely small values over training epochs and remain very small throughout training. Note however that the
leading singular values σ1 are of comparable size irrespective whether gradient flossing was performed or not.
                                                                                              dL
We note that similar to the gradient norm of the loss with respect to the initial condition | dh 0
                                                                                                   |, the subordinate
singular values seem to predict when the test accuracy of networks with gradient flossing grows beyond chance
level (Fig 10B). We confirmed this in multiple other network realizations and give here another example we the
accuracy grows beyond chance only later during training (Fig 11).

D.2 Gradient Flossing Throughout Training Can Be Detrimental
We find that gradient flossing continued throughout all training epochs can be detrimental for performance
(Fig 12). We demonstrate this again in the binary delayed temporal XOR task. We compare three different
conditions: Either, we floss throughout the training every 100 training epochs for 500 flossing epochs (red), or
we floss only early during training at training epochs e ∈ {0, 100, 200, 300, 400})(green) or we do not floss at
all (blue).


                                                          18
            A                                                                     C
              10 2                                                                          0.05
              10 4                                                                          0.04
 |dhd 0 |     10 6                                                                          0.03


                                                                         |dhd 0 |
              10 8                                                                          0.02
             10 10                                                                          0.01
                                        with flossing during training
             10 12                      without flossing
                                                                                            0.00
                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
                                       Epochs                                                                  Epochs
            B                                                                     D
                     100                                                                    100
      accuracy (%)


                                                                             accuracy (%)
                               with flossing during training
                               without flossing


                      50                                                                     50
                           0   500 1000 1500 2000 2500 3000                                        0   500 1000 1500 2000 2500 3000
                                       Epochs                                                                  Epochs
                                                                                                dL
Figure 8: Gradient flossing boosts norm of long-term Jacobian A) Gradient norm of | dh             0
                                                                                                     | as
a function of training epochs for networks without flossing (blue) and networks with flossing
during training (orange). Error gradient norm is boosted after gradient flossing at epochs e ∈
                                                                                               dL
{0, 100, 200, 300, 400, 500}). In networks without gradient flossing, the gradient norms | dh    0
                                                                                                   | are
much smaller overall. One out of ten random network realizations with solid line, the other 9 with
transparent line. B) Accuracy as a function of epoch, same depiction and network realizations as in
A. Note that accuracy of networks with gradient flossing grows beyond chance level approximately
                            dL
when the gradient norm | dh   0
                                | becomes macroscopically large. C) Same as A in linear scale. Mean
final test loss as a function of task difficulty (delay d) for delayed copy task. Different colors are
different network realizations with gradient flossing during training. Black lines are without any
gradient flossing. D) Accuracy as a function of epochs, same colors as in C. Note that for all network
                                                    dL
realizations the moments where gradient norm | dh     0
                                                        | becomes macroscopically large coincides with
the moment the accuracy is beyond chance level. Parameters: g = 1, batch size b = 16, N = 80,
epochs = 104 , T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before
training. Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).


We observe that after every episode of gradient flossing, the accuracy drops down close to chance level of 50%
(Fig 12A red line). Between flossing, the accuracy quickly recovers but never reaches 100%. Simultaneously,
                                                                                                                dL
when the accuracy drops the test error jumps up (Fig 12B). We also observed that the gradient norm | dh           0
                                                                                                                    | is
                                                                                                        dL
initially boosted by gradient flossing, but stays close to indistinguishable once the gradient norm | dh0 | becomes
macroscopically large (Fig 12C). This suggests that once gradient flossing facilitates signal propagation across
long time horizons and the network picks up the relevant gradient signal, further gradient flossing can be harmful
to the actual task execution. We hypothesize that there might be (at least for the Vanilla networks considered
here), a trade-off between the ability to bridge long time scales which seems to require one or several Lyapunov
exponents of the forward dynamics close to zero and nonlinear tasks requirements, which require at least a
fraction of the units to be in the nonlinear regime of the nonlinearity ϕ, where ϕ′ (x) < 1. It would be an
interesting future research avenue to further investigate this potential trade-off also in other network architectures.

D.3             Lyapunov Exponents after Training With and Without Gradient Flossing
In Fig 13, we show the first (Fig 13A, B) and the tenth (Fig 13C, D) Lyapunov exponent after training on
the spatial delayed XOR task both with and without gradient flossing. We find for successful networks with
gradient flossing a systematic relationship between the first Lyapunov exponent and the delay, that can be fitted
by approximately by λ1 (d) = −0.2exp.(−0.03delay). Unsuccessful networks with accuracy at chance level
have a much smaller largest Lyapunov exponent. The same seems to hold true for the tenth Lyapunov exponent.
In a previous study [57], a similar trend was observed, albeit in the context of a task that did not possess an


                                                                        19
        A                                                       B 80

                                                                           75
            10 2
                                                                           70
            10 3


                                                            accuracy (%)
                                                                           65
 |dhd 0 |


            10 4                                                           60
                                                                           55
            10 5
                                                                           50
                                                                           45
                   3000   2000     1000      0       1000                       0   100     200    300     400
                                 Epochs                                                   Epochs


Figure 9: Increase of gradient norm precedes epoch when accuracy exceeds chance level
                       dL
A) Gradient norm of | dh 0
                           | as a function of training epochs for 20 network realizations with flossing
during training. Epochs are aligned to last epoch where accuracy is ≤ 50%. B) Same task and
simulations as in A, but here accuracy as a function of epoch, for 20 network realizations with
                                                                        dL
flossing during training. Epochs are aligned to the last epoch with | dh  0
                                                                            | < 0.001. Different colors
are different network realizations. Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 ,
T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents early during training at
training epochs e ∈ {0, 100, 200, 300, 400}. Task: binary delayed XOR, delay d = 70, loss: cross
entropy(y, ŷ).


analytically tractable temporal correlation structure, which might partially explain the less conclusive results.
It is important to note that the numerical evaluation of Lyapunov exponents in recurrent LSTM networks in
[57] was based solely on the N × N Jacobian of the memory state. From a dynamical systems standpoint, a
2N × 2N Jacobian matrix encompassing interactions between both memory and cell states into account is
required [9, 12, 58].


E Gradient Flossing for Linear Network
We provide code for gradient flossing in linear networks here. We find that gradient flossing also helps to train
linear networks on tasks with many time steps that can be solved by linear networks, for example the copy task,
but not for tasks the require a nonlinear input-output operation like the temporally delayed XOR task. Full
analytical description of gradient flossing for linear networks would be a promising avenue for future research as
networks with linear dynamics can still have nonlinear learning dynamics [21]. However this is beyond the cope
of the presented work.


F      Computational Complexity of Gradient Flossing
We present here a more in-depth scaling analysis of the computational cost of gradient flossing. There are three
main contributors to the computational cost (table 1): First the RNN step, which has a computational complexity
of O N 2 b per time step, where N is the dimension of the recurrent network state (which in case of Vanilla
            
networks equals the number of units) and b is the batch-size both in the forward and backward pass. Second, the
Jacobian step which scales with O N 2 k per time step, where k is the number of flossed Lyapunov exponents.
Third, the QR decomposition, which scales with O N k2 , where k is the number of Lyapunov exponents
                                                               
considered.
Together, this results in a total amortized cost of O N 2 b T per training epoch, where T is the number of
                                                                 

training time steps and a total amortized costs per flossing epoch of O N 2 Tf (1 + k/tONS + k) where Tf is
                                                                                                  
the number of flossing time steps.


                                                       20
  A
             10 2
 i dW )
   d
      (


             10 4


                        0           500          1000        1500           2000           2500          3000
  B               100
                            early training flossing
                            without flossing
   accuracy (%)


                  50
                        0           500          1000         1500          2000           2500          3000
                                                             Epochs
Figure 10: Gradient flossing decreases condition number          of recurrent weight error gradient
                                                     dL
                                                         
A) Singular values of recurrent weight gradient σi dW      as a function of training epochs for singular
values i ∈ 1, 20, 40 for networks without gradient flossing (blue) and early training gradient flossing
(green). At epochs of gradient flossing, the subordinate singular value σ20 and σ40 are peaked, while
the first singular value σ1 has only a slight peak. This indicates that gradient flossing increases the
                                                  dL
effective rank of the recurrent weight gradient dW   . B) Accuracy as a function of training epochs.
Note that accuracy of networks with gradient flossing grows beyond chance level approximately
when the subordinate singular values singular value σ20 and σ40 are peaked increase, which enables
high-dimensional parameters updates. Parameters: g = 1, batch size b = 16, N = 80, epochs = 103 ,
T = 300, gradient flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training.
Task: binary delayed XOR, delay d = 70, loss: cross entropy(y, ŷ).


In case of preflossing, thus, the total computation cost scale with O N 2 [Eb T + Ep Tf (1 + k/tONS + k)] ,
                                                                                                         
where E is the number of training epochs and Ep is the number of preflossing epochs.
For gradient flossing during training (assuming that there is also preflossing done), the amortized cost scale with
O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)] , where Ef is the total number of flossing epochs during
training.
Empirically, we find that both the number of preflossing epochs Ep and flossing episodes Ef necessary for
training success is much smaller than the total number of training epochs E. For example, the preflossing for
500 epochs in the numerical experiment of Fig 3 took ∼ 37 seconds, while the overall training on 10000 training
epochs with batch size b = 16 took ∼ 1680 seconds. Thus only approximately 2.2% of the total training time
was spent on gradient flossing. Moreover, Tp can be smaller than T , it just has to be long enough such that the
temporal correlations in the task can be bridged. In case of the tasks discussed in the manuscript, this would be
the delay d. It remains an important challenge to infer the suitable number of flossing time steps Tf for tasks
with unknown temporal correlation structure.
It would also be interesting to investigate how the CPU hours/wall-clock time/flops/Joule/CO2-emission spent
on gradient flossing vs on training networks with larger N are trading off against each other. For this, we would
suggest to first find the smallest network that on median successfully trains on a binary temporal XOR task for
a fixed given delay d and measure the computational resources involved in training it, e.g. in terms of CPU
hours. Then compare it to a network with gradient flossing. This would be a promising analysis but is beyond
our current computational budget. We will start such experiments an might be able to provide results during the
reviewer period.


                                                        21
  A
              10 2
 i dW )
   d
      (


              10 4


                         0           500          1000        1500         2000           2500           3000
  B                100
                             early training flossing
                             without flossing
    accuracy (%)


                   50
                         0           500          1000         1500        2000           2500           3000
                                                              Epochs
Figure 11: Gradient flossing decreases condition number of recurrent weight error gradient
Same as Fig 10 for different network realization.


G         Additional controls

We also investigate the effects of gradient flossing during the training with orthogonal weight initializations
and confirm our finding that gradient flossing improves trainability on tasks that have long time horizons to
bridge. Moreover, we find that gradient flossing during training can further improve trainability. We replicated
the two more challenging tasks from the main paper (Fig 4) for orthogonal initialization with variable temporal
complexity and performed gradient flossing either both during and before training, only before training, or not at
all.
Fig 14A shows the test accuracy for Vanilla RNNs with orthogonal initialization trained on the delayed temporal
XOR task yt = xt−d/2 ⊕ xt−d with random Bernoulli process x ∈ {0, 1}. The accuracy of orthogonal Vanilla
RNNs falls to chance level for d ≥ 40 (Fig 14C). With gradient flossing before training, the trainability can
be improved, but still falls close to chance level for d = 70. In contrast, for initially orthogonal networks with
gradient flossing during training, the accuracy is improved to > 80% at d = 70. In this case, we preflossed for
500 epochs before task training and again after 500 epochs of training on the task. In Fig 14B, D the networks
have to perform the nonlinear XOR operation yt = x1t−d ⊕ x2t−d ⊕ x3t−d on a three-dimensional binary input
signal x1 , x2 , and x3 and generate the correct output with a delay of d steps identical to Fig 4 in the main text.
Again, we observe similar to networks with Gaussian initialization that flossing before training improves the
performance compared to baseline, but starts failing for long delays d > 60. In contrast, orthogonal networks
that are also flossed during training can solve even more difficult tasks (Fig 14D). We note that for Fig 14B and
D, we trained the network only on 5000 epochs, compared to 10000 epochs in networks with random Gaussian
initialization because for 10000 epochs, both networks with gradient flossing only before training and with
gradient flossing before and during training were able to bridge d = 70. These results suggest that orthogonal
initialization does seem to slightly improve performance for tasks with long time horizons to bridge and gradient
flossing and additionally boost the performance. Thus orthogonal initialization and gradient flossing seems to go
well together. It would be interesting to study if orthogonal initialization also reduces the number of gradient
flossing steps necessary to improve performance.


H         Additional Details on Training Tasks

In this section, we provide a more rigorous definition of the tasks used for training, as discussed in Section 3:


                                                         22
  A                  100


      accuracy (%)
                                                                      with flossing throughout training
                                                                      early training flossing
                                                                      without flossing
                     50
                           0       500          1000         1500           2000          2500           3000
  B
                     100
    test error


                 10 1
                           0       500          1000         1500           2000          2500           3000
  C
                 10 3                        with flossing throughout training
                                             early training flossing
                                             without flossing
   |dhd 0 |


                 10 7

                           0       500          1000          1500          2000          2500           3000
                                                             Epochs
Figure 12: Gradient flossing throughout training can be detrimental to learning A) Accuracy as a
function of training epochs for binary temporal delayed XOR task for gradient flossing throughout
training every 100 training epochs (red). Accuracy drops down close to chance level every time after
gradient flossing but recovers quickly between. Same for only 5 episodes of gradient flossing at
epochs e ∈ {0, 100, 200, 300, 400}) (green) and no flossing at all (blue). B) Test error as a function
                                           dL
of training epochs. C) Gradient norm of | dh 0
                                               | as a function of training epochs for networks without
gradient flossing (blue) and networks with flossing throughout training (red) and early training
gradient flossing (green). Error gradient norm is boosted after each gradient flossing. In networks
                                                   dL
without gradient flossing, the gradient norms | dh   0
                                                       | are much smaller overall. Parameters: g = 1,
batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for Ef = 500 epochs on
k = 75 Lyapunov exponents before training. Task: binary delayed XOR, delay d = 70, loss: cross
entropy(y, ŷ).


H.1        Copy task
For the copy task, the target network readout at time t is yt = xt−d , where d denotes the delay. We chose the
input to be sampled i.i.d. from a uniform distribution between 0 and 1.

H.2        Temporal XOR task
The temporal XOR task requires the target network readout yt at time t to be computed as follows:
                                              yt = |xt−d/2 − xt−d |                                            (20)
                                                                                  2
where again d denotes a time delay of d time steps. In the case of x ∈ {0, 1} and y ∈ {0, 1}, the output yt
follows the truth table of the XOR digital logic gate (Table 2). Thus, the function f (xa , xb ) = |xa − xb | can be
seen as an analytical representation of the XOR gate. It is important to note that f (x, 0) = x only for x ≥ 0,
and that this task requires a nonlinearity. The implementation can easily be constructed analytically, for example,
using two rectified linear units ϕ(x) = max(x, 0) the outbut can be constructed by
                               f (xa , xb ) = |xa − xb | = ϕ(xa − xb ) + ϕ(xb − xa ).                          (21)
Together with a delay line to transmit the signal xt−d over time, this can solve the task.


                                                        23
         A            no gradient flossing                          B gradient flossing during training
             0.1
    1
                                                                        0.1


                                                               1
             0.2
                                                                        0.2
                      20       40         60                                        20       40         60
                       complexity (delay d)                                          complexity (delay d)
         C                                                          D
             0.1
             0.2                                                        0.1
    10


                                                               10
             0.3                                                        0.2
                      20       40         60                                        20       40         60
                       complexity (delay d)                                          complexity (delay d)
Figure 13: Lyapunov exponents of trained networks with and without gradient flossing A) First
Lyapunov exponents λ1 for Vanilla networks trained on spatial delayed XOR task as a function of
the delay with no gradient flossing. Colored-coded is test accuracy at the end of training where red
corresponds to 100% accuracy and blue to chance level (50%). B) Same as A for networks with
gradient flossing during training. Black dashed line shows that Lyapunov exponents of successfully
trained networks can be approximated by the empirical fit λ1 (d) = −0.2exp.(−0.03delay). (Proto-
col for gradient flossing during training same as main text Fig 4B). C) Same as A for tenth Lyapunov
exponents λ10 . D) Same as B for tenth Lyapunov exponents λ10 . Same fit as in B also describes
λ10 . Parameters: g = 1, batch size b = 16, N = 80, epochs = 104 , T = 300, gradient flossing for
Ef = 500 epochs on k = 75 Lyapunov exponents before training. Task: binary spatial delayed XOR,
loss: cross entropy(y, ŷ).


I   Additional Background on Lyapunov Exponents of RNNs

An autonomous dynamical system is usually defined by a set of ordinary differential equations dh/dt =
F(h), h ∈ RN in the case of continuous-time dynamics, or as a map hs+1 = f (hs ) in the case of discrete-time
dynamics. In the following, the theory is presented for discrete-time dynamical systems for ease of notation,
but everything directly extends to continuous-time systems [43]. Together with an initial condition h0 , the
map forms a trajectory. As a natural extension of linear stability analysis, one can ask how an infinitesimal
perturbation h′0 = h0 + ϵu0 evolves in time. Chaotic systems are sensitive to initial conditions; almost all
infinitesimal perturbations ϵu0 of the initial condition grow exponentially with time |ϵut | ≈ exp(λ1 t)|ϵu0 |.
Finite-size perturbations, therefore, may lead to a drastically different subsequent behavior. The largest Lyapunov
exponent λ1 measures the average rate of exponential divergence or convergence of nearby initial conditions:


                                                          1              ||ϵut ||
                                        λ1 (h0 ) = lim        lim log                                            (22)
                                                    t→∞ t ϵ→0            ||ϵu0 ||
In dynamical systems that are ergodic on the attractor, the Lyapunov exponents do not depend on the initial
conditions as long as the initial conditions are in the basins of attraction of the attractor. Note that it is crucial
to first take the limit ϵ → 0 and then t → ∞, as λ1 (h0 ) would be trivially zero for a bounded attractor if the
                                       ||ϵut ||
limits are exchanged, as limt→∞ log ||ϵu   0 ||
                                                is bounded for finite perturbations even if the system is chaotic. To
measure k Lyapunov exponents, one has to study the evolution of k independent infinitesimal perturbations us
spanning the tangent space:


                                                   us+1 = Ds us                                                  (23)


                                                         24
                                      forward pass                                            backward pass
                                      O N2 b
                                             
    RNN dynamics                                                                              "
                                      O N2 k
                                              
 Jacobian step                                                                                "
                                      O N k2
                                              
    QR step                                                                                   "
                                      O N2 b T
                                                
    total amortized costs                                                                     "
    per training epoch
                                      O N 2 Tf (1 + k/tONS + k)
                                                                    
    total amortized costs                                                                     "
    per gradient flossing epoch
                                      O N 2 [Eb T + Ep Tf (1 + k/tONS + k)]
                                                                                   
    total amortized costs                                                                     "
    of preflossing
                                      O N 2 [Eb T + Ep Tp + Ef Tf (1 + k/tONS + k)]
                                                                                        
    total amortized costs                                                                     "
    flossing during training

Table 1: Computational cost for gradient flossing and training of RNNs
N denotes number of neurons, b is the batch size, T is the number of time steps in forward pass of
training, Tf is the number of time steps in forward pass of flossing, tONS is the reorthonormalization
interval, k is the number of flossed Lyapunov exponents, E is the number of training epochs, Ep is
the number of preflossing epochs, Ef is the number of flossing epochs during training. Empirically,
we find that the necessary number of preflossing epochs Ep and flossing episodes Ef is much smaller
than both the total number of training epochs E. Moreover, Tp can be smaller than T .

                                                     Table 2: XOR
                                  input xt−d        input xt−2d     target output yt
                                  0                 0               0
                                  0                 1               1
                                  1                 0               1
                                  1                 1               0


where the N × N Jacobian Ds (hs ) = df (hs )/dh characterizes the evolution of generic infinitesimal perturba-
tions during one step. Note that this Jacobian along the trajectory is equivalent to a stability matrix only at a
fixed point, i.e., when hs+1 = f (hs ) = hs .
We are interested in the asymptotic behavior, and therefore we study the long-term Jacobian

                                      Tt (h0 ) = Dt−1 (ht−1 ) . . . D1 (h1 )D0 (h0 ).                        (24)
Note that Tt (h0 ) is a product of generally noncommuting matrices. The Lyapunov exponents λ1 ≥ λ2 · · · ≥ λN
are defined as the logarithms of the eigenvalues of the Oseledets matrix
                                                                              1
                                          Λ(h0 ) = lim [Tt (h0 )⊤ Tt (h0 )] 2t ,                             (25)
                                                    t→∞

where ⊤ denotes the transpose operation. The expression inside the brackets is the Gram matrix of the long-term
Jacobian Tt (h0 ). Geometrically, the determinant of the Gram matrix is the squared volume of the parallelotope
spanned by the columns of Tt (h0 ). Thus, the exponential volume growth rate is given by the sum of the
logarithms of its first k (sorted) eigenvalues. Oseledets’ multiplicative ergodic theorem guarantees the existence
of the Oseledets matrix Λ(h0 ) for almost all initial conditions h0 [42]. In ergodic systems, the Lyapunov
exponents λi do not depend on the initial condition h0 . However, for a numerical calculation of the Lyapunov
spectrum, Eq 25 cannot be used directly because the long-term Jacobian Tt (h0 ) quickly becomes ill-conditioned,
i.e., the ratio between its largest and smallest singular value diverges exponentially with time.


J      Algorithm for Calculating Lyapunov Spectrum of Rate Networks
For calculating the first k Lyapunov exponents, we exploit the fact that the growth rate of a k-dimensional
infinitesimal volume element is given by λ(m) = m                               (1)
                                                                                    , λ2 = λ(2) − λ1 , λ3 =
                                                   P
                                                     i=1 λi . Therefore, λ1 = λ
  (3)
λ − λ1 − λ2 , . . . [44]. The volume growth rates can be obtained via QR-decomposition.


                                                            25
                                                    orthogonal weight initialization
       A                           delayed temporal XOR           B                  delayed spatial XOR
                    100                                                                       100
test accuracy (%)              flossing during training


                                                                          test accuracy (%)
                               preflossing
                     80        no flossing                                                     80

                     60                                                                        60

                          0      2000     4000 6000          8000 10000                             0    1000   2000 3000       4000     5000
                                             Epochs                                                               Epochs
       C                                                                         D
                    100                                                                       100
test accuracy (%)


                                                                          test accuracy (%)
                     80                                                                        80                       flossing during training
                                                                                                                        preflossing
                                                                                                                        no flossing
                     60                                                                        60

                          10     20      30 40 50 60                 70                             10   20     30 40 50 60              70
                                       complexity (delay d)                                                   complexity (delay d)

Figure 14: Gradient flossing before and during training improves trainability for orthogonal
nets
A) Test accuracy for orthogonally initialized vanilla RNNs trained on delayed temporal binary XOR
task yt = xt−d/2 ⊕ xt−d with gradient flossing during training (green), preflossing (orange), and
with no gradient flossing (blue) for d = 70. Solid lines are mean, transparent thin lines are individual
network realizations B) Same as A for delayed spatial XOR task with yt = x1t−d ⊕ x2t−d ⊕ x3t−d .
C) Test accuracy as a function of task difficulty (delay d) for delayed temporal XOR task. D) Test
accuracy as a function of task difficulty (delay d) for delayed spatial XOR task. Parameters: g = 1,
batch size b = 16, N = 80, epochs = 104 for delayed temporal XOR, epochs = 5000 for delayed
spatial XOR, T = 300, flossing for Ef = 500 epochs on k = 75 Lyapunov exponents before training
and during training for green lines, and only before training for orange lines. Shaded areas are 25%
and 75% percentiles, solid lines are means, transparent dots are individual simulations, task loss is
cross-entropy between y, ŷ.


First, we evolve an initially orthonormal system Qs = [q1s , q2s , . . . qm
                                                                          s ] in the tangent space along the trajectory
using the Jacobian Ds :
                                                Qe s+1 = Ds Qs                                                     (26)
A continuous system can be transformed into a discrete system by considering a stroboscopic representation,
where the trajectory is only considered at certain discrete time points. We use here the notation of discrete
dynamical systems where this corresponds to performing the product of Jacobians along the trajectory Q
                                                                                                     e s+1 =
Ds Qs . We study the discrete network dynamics in the limit of small time step ∆t → 0 and for discrete time
∆t = 1. The notation can be readily extended to continuous systems [43].
Second, we extract the exponential growth rates using the QR-decomposition,
                                                                 e s+1 = Qs+1 Rs+1 ,
                                                                 Q

which uniquely decomposes Q    e s+1 into an orthonormal matrix Qs+1 of size N × k so Q⊤    s+1 Qs+1 = 1m×m
and to an upper triangular matrix Rs+1 of size k × k with positive diagonal elements. Geometrically, Qs+1
describes the rotation of Qs caused by Ds and the diagonal entries of Rs+1 describe the stretching and shrinking
of the columns of Qs , while the off-diagonal elements represent the shearing. Fig 15 visualizes Ds and the
QR-decomposition for k = 2.
The Lyapunov exponents are given by time-averaged logarithms of the diagonal elements of Rs :
                                                                     t                t
                                                              1     Y              1X
                                                 λi = lim       log     Rsii = lim       log Rsii .                                             (27)
                                                          t→∞ t                t→∞ t
                                                                    s=1              s=1


                                                                          26
                      Ds
                                                                        QR

                                                         s+1
                                                        R22
qs2                              ~2         s+1
                                                                                    2
                                                                                   qs+1
                                 qs+1      R11          ~1
                                                        qs+1
                                                                              s+1
      qs1                                                                    R11             1
                                                                                            qs+1
                                                                                                    s+1
                                                                                                   R22

Figure 15: Geometric illustration of Lyapunov spectrum calculation. An orthonormal matrix
Qs = [q1s , q2s , . . . qms ], whose columns are the axes of an k-dimensional cube, is rotated and distorted
by the Jacobian Ds into an k-dimensional parallelotope Q           e s+1 = Ds Qs embedded in RN . The
figure illustrates this for k = 2, in which case the columns of Q        e s+1 span a parallelogram, which
can be divided into a right triangle and a trapezoid and rearranged into a rectangle. Thus, the
area of the gray parallelogram is the same as that of the orange rectangle. The QR-decomposition
reorthonormalizes Q       e s+1 by decomposing it into the product of an orthonormal matrix Qs+1 =
[q1s+1 , q2s+1 , . . . qms+1 ] and the upper-triangular matrix R
                                                                   s+1
                                                                       . Qs+1 describes the rotation of Qs
                                                  s+1
caused by Ds . The diagonal entries of R              gives the stretching/shrinking along the columns of
Qs+1 ,Qthus the volume of the parallelotope formed by the first k columns of Q            e s+1 is given by
          m
Vm = i=1 Rs+1     ii    . The  time-averaged logarithms  of the diagonal  elements of R s
                                                                                          give the Lyapunov
                                   1
                                         Qt       s                1
                                                                     Pt          s
spectrum: λi = limtsim →∞ tsim log s=1 Rii = limtsim →∞ t s=1 log Rii .


Note that the QR-decomposition does not need to be performed at every simulation step, just sufficiently often,
i.e., once every sONS steps such that Q     ONS = Ds+sONS −1 · Ds+sONS −2 . . . Ds · Qs remains well-conditioned
                                      e s+s
[44]. An appropriate reorthonormalization interval sONS = tONS /∆t thus depends on the condition number, the
ratio of the smallest and largest singular value:
                                                                                    s+s
                              e s+s ) = κ2 (Rs+sONS ) =    σ1 (Rs+sONS )    R ONS
                         κ2 ( Q                                          = 11          .                   (28)
                                                                            Rs+s
                                   ONS                            s+s
                                                           σm (R ONS )         mm
                                                                                   ONS

An initial transient should be disregarded in the calculation of the Lyapunov spectrum because h first has to
converge towards the attractor and Q has to converge to the unique eigenvectors of the Oseledets matrix (Eq 25)
[54]. A simple example of this algorithm in pseudocode is:

Algorithm 2 Jacobian-based algorithm for Lyapunov spectrum
  initialize h, Q
  evolve h until it is on attractor (avoid initial transient)
  evolve Q until it converges to the eigenvectors of the backward Oseledets matrix
  set γi = 0
  for t = 1 → T do
       h ← f (h)
              df
       D ← dh
       Q←D·Q
       if s ≡ 0 (mod sONS ) then
           Q, R ← qr(Q)
           γi += log(Rii )
       end if
  end for
  λi = γi /T


It is guaranteed that under general conditions initially random orthonormal systems will exponentially converge
towards a unique basis that is given by the eigenvectors of the Oseledets matrix Eq 25 [54]. A minimal example
of this algorithm in pseudocode is shown in appendix 3. A feasible strategy to determine the reorthonormalization
time interval tONS is to get first a rough estimate of the Lyapunov spectrum using a short simulation time tsim and
a small tONS and repeat with a longer simulation time and a tONS based on the Lyapunov spectrum of the rough
estimate of the Lyapunov spectrum. Another strategy is, to first iteratively adapt tONS on a short simulation
run to get an acceptable condition number. It should be noted that there exists a diversity of other methods to
estimate the Lyapunov spectrum [14, 43, 68, 69].


                                                        27
K            Convergence of Lyapunov Exponents of RNNs
In Fig. 16, we demonstrate the convergence of the Lyapunov exponents. We show the estimate of the Lyapunov
exponents λi for i = 1, 20, 60, 80 for different initial conditions but identical network realization.

             0
             10
i(1/steps)


             20
             30
             40
                  100                                    101                                      102
                                                        steps


Figure 16: Convergence of Lyapunov exponents Convergence of selected Lyapunov exponents
λi for ten identical network realizations with different initial conditions with simulation time (i =
1, 20, 60, 80) for σ = 1 and g = 1. (Other parameters: N = 80, tsim = 100 steps, tONS = 1).


                                                   28