refs/paper_2603.12934.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039

                                                      Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
                                                                                   toward Softmax
                                                                                            Hyoseok Park1 and Yeonsang Park1, ∗
                                                                 1
                                                                     Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea
                                                                                                   (Dated: March 26, 2026)
                                                                  The rapid growth of large-scale AI models has intensified energy consumption and data-movement
                                                               challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear
                                                               matrix multiplications of transformer inference at high throughput and low energy. However, the
                                                               softmax attention layer—which requires element-wise exponentiation followed by normalization—still
                                                               relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much
                                                               of the potential photonic advantage.
arXiv:2603.12934v3 [physics.optics] 25 Mar 2026


                                                                  We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel
                                                               exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case
                                                               relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe
                                                               at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a
                                                               multiplicative transfer function whose logarithm is approximately linear.
                                                                  We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate
                                                               the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN)
                                                               add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade
                                                               and confirms agreement with theory primarily over the upper operating range; deeper cascades and
                                                               higher quality factors are assessed analytically. The cascade implements the per-channel exponential
                                                               block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel
                                                               chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation,
                                                               summation, and normalization—on a single photonic chip without per-channel normalization circuitry.


                                                                     I.   INTRODUCTION                              is approximately linear over a finite interval, enabling
                                                                                                                    exponential-function synthesis with sub-2% worst-case
                                                     Transformer inference is often limited by power and            error—an order of magnitude more accurate than SOFT-
                                                  memory traffic, motivating optical accelerators that ex-          ONIC’s polynomial approach—while remaining compati-
                                                  ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9].   ble with integrated microring platforms [20–24]. We term
                                                  Recent perspective articles also discuss data-center power        this cascade block an approximate exponential function
                                                  consumption as one motivation for optical comput-                 (AEF) unit. We further propose a WDM-parallel archi-
                                                  ing [3, 8]. While linear operators are comparatively              tecture with a single PI feedback loop that realizes the
                                                  amenable to photonic implementation [4–6], the softmax            complete softmax function—including summation and
                                                  function used in attention layers requires an exponen-            normalization—without per-channel electronic process-
                                                  tial mapping together with global normalization—both              ing.
                                                  difficult to realize in passive photonic circuits, where             We extend the theoretical framework with three-
                                                  transmission is fundamentally bounded by unity. Parallel          dimensional FDTD simulations of a single X-cut TFLN
                                                  digital-hardware studies treat the exponential/softmax            add-drop micro-ring resonator. The simulated device
                                                  stage as a bottleneck and propose dedicated approxima-            parameters—quality factor, free spectral range, and
                                                  tions [11–19]. Many integrated-photonic classifier demon-         electro-optic sensitivity—calibrate the cascade design pa-
                                                  strations still rely on electronic post-processing for the        rameters, bridging analytical fitting and physically realiz-
                                                  final nonlinear readout [10]; the resulting electro-optic         able hardware. Two operating regimes emerge from this
                                                  conversion overhead can negate the throughput and en-             calibration: an FDTD-characterized regime with moder-
                                                  ergy benefits of the photonic front-end. Notably, the             ate drop-port depth (Dmax ≈ 0.36), where the analytic
                                                  SOFTONIC architecture [11] explicitly argues that “the            error stays below ∼5% for N ≤ 7 but the power bud-
                                                  inability of MRRs and MZMs to handle SMA’s expo-                  get limits practical cascades to N ≤ 5; and a projected
                                                  nential and division functions” necessitates alternative          high-Q regime (Dmax ≥ 0.95), enabling deeper cascades
                                                  approaches based on microdisk modulators and polyno-              (N ≤ 30) with sub-percent error. Cascade performance is
                                                  mial approximation, achieving 89.7% accuracy with a               predicted analytically and validated by a five-ring cascade
                                                  third-degree Chebyshev polynomial. Here we challenge              3D FDTD simulation (Sec. IV).
                                                  this premise: we show that a passive Lorentzian cascade              The paper is organized as follows: Section II presents
                                                  of microring resonators can be tuned so that its logarithm        the mapping, transfer model, and depth-design rules; Sec-
                                                                                                                    tion III provides numerical fits and validation; Section IV
                                                                                                                    describes the single-ring TFLN device design and FDTD
                                                                                                                    validation; Section V assesses physical feasibility including
                                                  ∗ yeonsang.park@cnu.ac.kr; Corresponding author
                                                                                                                    voltage requirements, insertion loss, and energy efficiency;
                                                                                                                                       2

Section VI discusses implementation scope, platform com-
parisons, and limits; and Section VII concludes.                                                                1
                                                                                    Tk (∆ωk ) =                     .                (9)
                                                                                                                ∆ωk 2
                                                                                                        1+       Γ
    II.   MODEL AND DESIGN FRAMEWORK
                                                                In a control–probe architecture, a nonnegative control-
                                                                signal amplitude I ≥ 0 shifts the ring resonance. Here I
Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an           denotes a generic control amplitude: for optical-pump op-
arbitrary real-valued sequence (or vector). Directly gener-     eration it maps to optical intensity, while for EO operation
ating exp(xn ) as a passive optical transmission is impos-      it maps to electrical control level (e.g., voltage). Across
sible in general because exp(x) grows beyond unity while        many physical mechanisms (optical pump via Kerr/XPM,
a passive transmission satisfies 0 < T ≤ 1 [25]. However,       EO drive via Pockels effect, thermal, carrier tuning), the
for softmax,                                                    shift can be linearized on a working range [20, 26–30]:

                                exn                                                                       (0)
                 softmax(x)n = P xj ,                     (1)                            ω0,k (I) = ω0,k + ηI,                      (10)
                                 je
                                                                        (0)
                                                                where ω0,k is the cold-cavity resonance and η is the control-
a common shift cancels:                                         to-resonance sensitivity. In practice, the control channel
                                                                can be optical or electrical (optical pump, EO/Pockels
             exn +c   exn                                       drive, thermal, or carrier tuning); a quantitative EO
            P x +c = P x                  (∀c ∈ R).       (2)   feasibility example is given in the Discussion. With
              je       je
                  j       j
                                                                                  (0)
                                                                ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be-
Thus it suffices to generate                                    comes


                exn −m ,       m ≡ max xj ,               (3)                           ∆ωk (I) = ∆ω0,k − ηI.                       (11)
                                      j
                                                                Define dimensionless parameters
since the global factor em cancels.
   To ensure a nonnegative control-signal amplitude, de-
fine                                                                                          ∆ω0,k                η
                                                                                   ak ≡             ,           b≡− .               (12)
                                                                                               Γ                   Γ
                                                                Then Eq. (9) yields the control-to-probe transfer of a
un ≡ xn − m ≤ 0,           L ≡ − min un = m − min xn ≥ 0,       single ring,
                                  n                   n
                                                     (4)
and map each scalar to a nonnegative control-signal am-                                                     1
plitude                                                                             Tk (I) =                            .           (13)
                                                                                                   1 + (ak + bI)2
                                                                 Physical meaning: ak is a static detuning in linewidth
                   In ≡ un + L ∈ [0, L].                  (5)    units (set by heater/carrier tuning/fabrication), and |b|
                                                                 is the normalized sensitivity magnitude (linewidths of
Then
                                                                 resonance shift per unit control-signal amplitude); the sign
                                                                 convention is absorbed into the detuning expression. For
                  exn −m = eun = eIn −L .                 (6)   “same-material/same-geometry” rings, b is often common,
                                                                while ak can be tuned per ring.
Hence the optical design task is to realize, for I ∈ [0, L],    Sign convention. Simultaneously flipping (ak , b) 7→
                                                                (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0
                                                                without loss of generality.
                 f (I) = eI−L ∈ [e−L , 1].                (7)       Let N rings be cascaded in a serial add-drop topology:
                                                                 Tk (I) denotes the add-to-drop transmission of ring k, and
Control–probe transfer. Consider a weak probe at                 the drop output of ring k feeds the add (input bus) port
fixed angular frequency ωL . For the kth ring, let ω0,k          of ring k+1. Assuming the probe is sufficiently weak so
denote its resonance frequency and Γ > 0 its loaded half-        the control channel dominates the resonance shift, the
width at half maximum (HWHM). Define the detuning                normalized probe output is the product

                    ∆ωk ≡ ωL − ω0,k .                     (8)                 (probe)
                                                                           Pout         (I)
                                                                                                  N
                                                                                                  Y                 N
                                                                                                                    Y         1
                                                                  y(I) ≡                      =         Tk (I) =                       .
Near resonance, the normalized Lorentzian transmission
                                                                                (probe)
                                                                              Pin                                       1 + (ak + bI)2
                                                                                                  k=1               k=1
is modeled as [20, 21]                                                                                                               (14)
                                                                                                                                   3


                (a) Electronic Preprocessing
                                                                                                           Control In
                                     Find max:              Shift:                   Bias:
                  {xn }             m = max(xn )         un = xn −m               In = un +L


                                                                      EO tuning
                (b) N -MRR Cascade

                                                                      N stages
      Probe
 (fixed ωL )


                               MRR                  MRR                MRR                     MRR                      MRR
                               #1                   #2                 #3                      #4                       #5


                (c) Output

                                                     ỹ(In ) ≈ exp(In − L) → exp(xn − m)                                      PD


 FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing
    maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and
In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a
 weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the
        next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements
                y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax.


To focus on the shape of the approximation, we allow a
global scale factor C > 0:
                                                                                    E∞ ≡ sup         ln ỹ(I) − (I − L) .     (18)
                                                                                          I∈[0,L]

                        ỹ(I) ≡ C y(I).                  (15)    If E∞ ≤ εlog , then for all I ∈ [0, L],
In softmax, pn = CeIn −L / j CeIj −L , so C cancels
                                 P
between numerator and denominator and is physically                                 ỹ(I)           ỹ(I)
                                                                       e−εlog ≤           ≤ eεlog ⇒       − 1 ≤ eεlog − 1.    (19)
inessential; nevertheless it is convenient for error analysis.                      f (I)           f (I)
For a fixed (N, b, {ak }), the optimal C for the minimax
                                                                 Thus achieving a prescribed worst-case relative error ε is
log-error in Eq. (18) can be written in closed form. Let
                                                                 guaranteed by
g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax-
optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield-
ing E∞ = (maxI g(I) − minI g(I))/2.                                                   E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε.              (20)
  Taking logarithms,
                                                                 Depth scaling. We derive depth-related constraints and
                                                                 design rules for a prescribed approximation tolerance.
                             N
                             X                                   Necessary slope condition. Differentiate Eq. (16):
                                   ln 1 + (ak + bI)2 .
                                                    
         ln ỹ(I) = ln C −                                (16)
                             k=1
                                                                                                     N
                                                                                   d              X 2b(ak + bI)
The target ln f (I) = I − L is linear; hence exponential                              ln y(I) = −                 .           (21)
                                                                                   dI              1 + (ak + bI)2
approximation is equivalent to the log-linearization goal                                            k=1

                                                                 Since |2u/(1 + u2 )| ≤ 1 for all real u,
     ln ỹ(I) ≈ I − L     uniformly on I ∈ [0, L].        (17)
                                                                                           d
                                                                                              ln y(I) ≤ N |b|.                (22)
Error metric. Define the worst-case log-error on [0, L]:                                   dI
                                                                                                                                 4

The target ln f (I) = I − L has constant slope +1, so a               with a minimax refinement. After choosing N , set
necessary condition to track it is                                    b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa-
                                                                      tion, then refine (a, b) by a two-parameter minimax fit on
                                                                      [0, L].
                            N |b| ≳ 1.                         (23)      A heuristic conservative screening bound N ≥ ⌈(L2 /4 +
Near-optimal parameterization. The full design prob-                  1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion
lem can be written as a minimax fit in the log domain [31]:           argument; see Supplementary Sec. S1) provides a quick
                                                                      upper estimate but is not a rigorous guarantee.

                    min          sup |r(I)|,
               a1 ,...,aN , ln C I∈[0,L]
                                                                           III.   NUMERICAL FITS AND VALIDATION
                   N
                   X                                           (24)
                         ln 1 + (ak + bI)2 − (I − L).
                                          
   r(I) ≡ ln C −                                                         We validate the analytical framework with minimax
                   k=1                                                numerical fits and sampled robustness checks. Figure 2
This objective is permutation-invariant in the ak ’s (ring            shows the fitted approximation quality at L = 8: the
index k). In practice (and in numerical experiments                   top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the
reported below), the optimizer frequently collapses to a              middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8],
permutation-symmetric solution                                        and the bottom panel shows the pointwise relative error
                                                                      with the characteristic Chebyshev equioscillation pattern.
                                                                         We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L]
                     a1 = · · · = aN ≡ a,                      (25)   and compare several depths using a minimax criterion.
                                                                         Table I makes the accuracy–depth trade-off explicit
reducing the design to two parameters (a, b) (plus C).                at L = 8. A worked input-to-output example demon-
With Eq. (25),                                                        strating the mapping from an arbitrary input sequence
                                                                      x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided
                                  
                                   1
                                                      N              in Supplementary Sec. S2. The example shows that the
          ỹ(I) = C y(I) = C                               .   (26)   N = 10 cascade keeps the worst-case relative error below
                             1 + (a + bI)2                            2.7% across all channels.
A robust initialization is obtained by placing the midpoint           Empirical calibration. We calibrate the effective
of the interval on the Lorentzian half-maximum flank and              logit range Leff from autoregressive Transformers (dis-
matching the slope:                                                   tilgpt2/gpt2) [1, 32–35] at context length 128, finding
                                                                      Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen-
                                                                      tary Sec. S2). A clipping threshold t∗ = −12 preserves
                       L                                              p99 softmax accuracy below 0.1%. Full protocol details,
                a+b      ≈ −1,             N b ≈ 1.            (27)
                       2                                              clipping-sweep tables/plots, and per-run statistics are
These two equations already yield a good design; a small              provided in Supplementary Sec. S3.
(two-parameter) refinement can then enforce the desired                  A synthetic design-space map (Supplementary Table S3)
worst-case tolerance.                                                 shows that near L ≈ 8, moderate depth (N ≥ 10) reaches
   Local expansion and depth scaling. A Taylor                        few-percent error, whereas L ≳ 12 requires deeper cas-
expansion of the log-domain residual around the flank-                cades. All fits follow the same pipeline: minimize the
centered point I0 = L/2 (with a + bI0 = −1 and N b = 1)               worst-case log-error on a uniform grid, initialize from the
shows that the quadratic term vanishes identically, leaving           flank rules in Eq. (27), perform multi-start global search,
a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L],        and apply bounded local refinement; implementation de-
this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed            tails and scripts are provided in a public repository [36]
                                     √                                (commit: 585e695).
tolerance εlog requires N ∝ L3/2 / εlog , which explains
the scaling used in Eq. (28). The full derivation is provided
in Supplementary Sec. S0; an intuitive local-expansion
summary appears in Sec. S1.
   Practical engineering estimate. Given L and a                         TABLE I: Depth comparison for L = 8 using fitted
target worst-case relative error ε, define εlog = ln(1 + ε).          ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all
A heuristic engineering estimate (not a rigorous bound)                                          N ).
that matched our percent-level numerical designs is
                                                                      N           a         b       max rel. err.   mean rel. err.
                               L3/2
                                    
                        1
             N ≈ max        , κ√         ,                     (28)    5      −2.0789   0.21658        10.9%            6.43%
                       bmax      εlog                                 10      −1.4588   0.10202        2.68%            1.65%
                                                                      20      −1.2135   0.05025        0.67%            0.42%
where bmax is the physically achievable sensitivity bound             30      −1.1392   0.03341        0.30%            0.19%
and κ ≃ 0.07 for the identical-detuning flank design
                                                                                                                   5

                                                            TABLE II: Waveguide and ring parameters of the X-cut
                                                             TFLN micro-ring resonator. Electro-optic electrode
                                                                parameters are listed separately in Table III.

                                                            Parameter                  Symbol       Value      Unit
                                                            Total TFLN thickness       tTFLN         600       nm
                                                            Etch depth                 tetch         500       nm
                                                            Slab thickness             tslab         100       nm
                                                            Waveguide width            w              1.4      µm
                                                            Bend radius                R              20       µm
                                                            Coupling gap               g             100       nm
                                                            Circumference              Lring        125.7      µm
                                                            Free spectral range        FSR          8.29       nm
                                                            Effective index (TE0 )     neff         1.903      —
                                                            Group index (TE0 )         ng            2.24      —
                                                            Extraordinary index        ne           2.138      —


                                                            IV.   TFLN SINGLE-RING DEVICE DESIGN AND
                                                                          FDTD VALIDATION

                                                                     A.    Waveguide and ring geometry


                                                               The device is based on an X-cut thin-film lithium nio-
                                                            bate (LiNbO3 ) on insulator wafer with a 600 nm-thick
                                                            LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines
                                                            a 1.4 µm-wide single-mode waveguide with a 100 nm un-
                                                            etched slab (Fig. 3). Lumerical MODE simulations yield
                                                            neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda-
                                                            mental TE0 mode.
                                                               The ring resonator (R = 20 µm, Lring = 125.7 µm) is
                                                            configured as an add-drop resonator with 100 nm coupling
                                                            gaps (Fig. 4). The FDTD-measured free spectral range
                                                            is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE
                                                            value due to bend-induced dispersion.


FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale:
  shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The
target eI−L (black) is progressively better matched as N
       increases. (b) Log scale: depth comparison
    (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into
  I ∈ [6, 8] showing convergence. (c) Pointwise relative
  error showing the Chebyshev equioscillation pattern
           characteristic of minimax optimality.
                                                            FIG. 3: Cross-section of the X-cut TFLN rib waveguide
                                                            on a SiO2 substrate. The 600 nm LiNbO3 film is etched
                                                            500 nm to form a 1.4 µm-wide single-mode rib waveguide.
                                                            Lateral signal (S) and ground (G) electrode positions are
                                                               indicated; electrode design details are discussed in
                                                                                    Sec. IV D.
                                                                                                                       6

  Table II summarizes the waveguide and ring parame-
ters.


              B.   3D FDTD Methodology

   The ring resonator response is simulated using Lumeri-
cal 3D FDTD with conformal variant 1 meshing. A broad-
band TE0 mode source (1530 nm to 1570 nm) is injected
into the input bus waveguide, and through- and drop-port
spectra are recorded. A “z-refined 3-fix” meshing strat-
egy ensures convergence in the thin-film geometry [37];
detailed simulation setup is provided in Supplementary
Sec. S4 (Table S6).


                                                              FIG. 5: Simulated through-port (blue) and drop-port
                                                                 (red) transmission spectra of the single add-drop
                                                              micro-ring resonator from 3D FDTD. Top: logarithmic
                                                              scale; bottom: linear scale. Five resonances are visible
                                                                               with FSR ≈ 8.29 nm.


                                                              15,500, Dmax = 0.360); using the five-resonance mean
                                                              would increase required voltages by ∼24% (see Table IV
                                                              caption).
                                                                 The simulation time of 50 ps exceeds the loaded pho-
                                                              ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but
                                                              the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex-
                                                              tracted Qi may be slightly conservative. An independent
                                                              eigenmode (FDE) analysis of the same cross-section at
                                                              R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5×
  FIG. 4: Top view of the single add-drop micro-ring          finer than the FDTD vertical grid)—yields Qrad+leak =
 resonator used in the 3D FDTD simulation. The ring           2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89)
  waveguide (R = 20 µm, w = 1.4 µm) is evanescently           gives a theoretical Qi > 107 [37–42], confirming that
  coupled to input and drop bus waveguides through            the gap between the numerical Qi and published val-
     100 nm gaps at coupling points CP1 and CP2.              ues (> 106 ) originates from mesh discretization (Sup-
                                                              plementary S4.5, Table S8). In the CMT framework,
                                                              Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present
                                                              coupling gap, increasing Qi to 106 would raise Dmax from
                                                              0.36 to ∼0.95 and QL from 15,500 to ∼25,200.
         C.    Single-Ring Add-Drop Results
                                                                Figure 6(a) shows a Lorentzian fit to the best drop-
   Figure 5 shows the through- and drop-port spectra from     port resonance at λ = 1566 nm, validating the cascade
3D FDTD. Five resonances are resolved across 1530 nm          model (Eq. 9). Figure 6(b) demonstrates that cascading
to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30).                    N copies of this FDTD-extracted Lorentzian reproduces
                                                              the target exponential eI−L with increasing fidelity as N
   Lorentzian fitting of the drop-port peaks yields QL =
                                                              grows.
10,300–15,500, with the best resonance at λ = 1566 nm
reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360,               To validate the cascade prediction directly, a five-
−4.4 dB). The through-port extinction ratio is 1.6 dB to      ring cascade 3D FDTD simulation was performed us-
2.6 dB, and the five-resonance mean is QL = 12,500 ±          ing Tidy3D [43]; the full simulation notebook is publicly
1,800 (Dmax = 0.29–0.36). CMT   √    analysis of the best     available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)]
resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈         confirms resonant excitation across all five rings. Map-
38,800, confirming that the 500 nm etch provides sufficient   ping the drop-port spectrum onto the control variable I
confinement and that the 100 nm gap places the ring           yields 11 data points within the AEF operating range
in the coupling-limited regime. The cascade analysis          [Fig. 6(e, f)], with the FDTD transmission closely tracking
below adopts the best-case FDTD calibration (QL =             the N = 5 theoretical curve near I ≈ L = 8.
                                                                                                                 7


FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD
    (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near
 λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR
layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD
    simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with
                                          11 spectral FDTD data points.
                                                                                                                                   8

     D.   X-cut electrode design and EO parameters               TABLE III: Electro-optic electrode parameters for the
                                                                X-cut TFLN micro-ring with lateral S–G arc electrodes.
   We employ lateral signal–ground (S–G) arc electrodes
on the slab surface alongside the ring waveguide (Fig. 7).      Parameter                      Symbol    Value          Unit
In the X-cut orientation, the crystal Z-axis is at 45◦ from     Crystal orientation            —         X-cut          —
the horizontal in the substrate plane, giving a lateral-        EO coefficient                 r33       30.9           pm V−1
field projection proportional to cos(θ − 45◦ ) at azimuthal     EO fill factor                 fEO    1/π ≈ 0.318       —
angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦           EO overlap factor              ΓEO        0.7           —
and 315◦ naturally separate the coupling regions from           Electrode gap                  gel         5            µm
                                                                Effective electrode distance   deff       2.5           µm
the electrode regions. Each ring carries a full semicir-
cular arc electrode on the side opposite to its coupling
points, engaging the large r33 = 30.9 pm V−1 Pockels co-
efficient [37, 38]. The effective EO fill factor follows from   ized voltage sensitivity is (Supplementary Sec. S4; here
integrating | cos(θ − 45◦ )| over the semicircle:               dλ/dV = 28.5 pm/V is the straight-section value and
                             1                                  fEO accounts for partial electrode coverage of the ring
                     fEO =     ≈ 0.318                  (29)    circumference):
                             π
(see Supplementary Sec. S4 for derivation). The electrode                         2 Q (dλ/dV )
gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic                   bV =                fEO ≈ 0.182 V−1              (30)
overlap integral is ΓEO = 0.7. Table III lists the electrode                           λ0
parameters.
                                                                at QL = 15,500. This estimate relies on a first-order
                                                                electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30%
                                                                variation in bV would shift the cascade depth by one to
                                                                two rings at constant εmax (Table IV), leaving the quali-
                                                                tative design conclusions unchanged. With the cascade
                                                                framework of Sec. II (Eqs. 14–18), the N -ring drop-port
                                                                transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates
                                                                eI−L over I ∈ [0, L], with (a, b) optimized by minimax
                                                                fitting for each N .
                                                                   Table IV presents the optimization results for the stan-
                                                                dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB).

                                                                TABLE IV: Cascade optimization results for L = 8. The
                                                                   bias voltage Vbias = |a|/bV sets the DC offset, and
                                                                Vctrl = bL/bV is the maximum control voltage at I = L.
                                                                   Voltages computed with bV = 0.182 V−1 (X-cut arc
                                                                electrode, FDTD-calibrated best resonance QL = 15,500,
                                                                 ng = 2.30). The mean FDTD quality factor across five
FIG. 7: Illustrative two-ring cascade layout showing the        resonances is QL = 12,500 ± 1,800; using the mean would
lateral S–G arc electrode placement on X-cut TFLN (the                         increase voltages by ∼24%.
cascade design extends to N rings; this two-ring example
  clarifies the electrode geometry). The crystal Z-axis is      N     a       b     E∞ εmax (%) Vbias (V) Vctrl (V)
   oriented at 45◦ from the horizontal in the substrate          5 −2.0789 0.21658 0.1035 10.91   11.4       9.5
plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and         10 −1.4588 0.10202 0.0265  2.68    8.0       4.5
   315◦ naturally separate the bus-waveguide coupling           12 −1.3731 0.08450 0.0184  1.86    7.5       3.7
regions from the electrode semicircles: each ring carries a     20 −1.2136 0.05025 0.0067  0.67    6.7       2.2
                                                                25 −1.1685 0.04013 0.0043  0.43    6.4       1.8
full semicircular arc electrode on the side opposite to its
                                                                30 −1.141 0.03340 0.0030   0.30    6.3       1.5
 coupling points. The resulting effective EO fill factor is     32 −1.1301 0.03131 0.0026  0.26    6.2       1.4
                      fEO = 1/π ≈ 0.318.
                                                                a The complete cascade optimization results for all N values are

                                                                  listed in Supplementary Table S7.


E.    FDTD-Calibrated bV and Cascade Optimization                 The approximation quality across different cascade
                                                                depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g.,
  From the device parameters in Tables II and III and           ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete
the FDTD-calibrated ng ≈ 2.30, the effective normal-            optimization results are listed in Supplementary Sec. S4.
                                                                                                                                    9

             V.    PHYSICAL FEASIBILITY                          TABLE V: Two-regime power budget for the MRR
                                                                       cascade. Pout assumes per-channel input
  Having established the cascade approximation theory           Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW
(Sec. II) and the FDTD-calibrated device parameters            laser split across M = 10 parallel channels via a 1×M
(Sec. IV), we now assess the physical feasibility of the      splitter, or equivalently multiplexed as d WDM channels
proposed architecture in terms of voltage requirements,       sharing a single cascade) and accounts only for the ideal
                                                                                                     N
insertion loss, and energy efficiency.                        on-resonance cascade transmission Dmax      (upper bound);
                                                                additional inter-ring coupling loss (ηcoupling ≈ 0.9 per
                                                               stage, ∼0.46 dB/stage) and off-resonance propagation
       A.     Electro-optic voltage requirements                 loss (0.08–0.25 dB/stage) are analyzed separately in
                                                                                        Sec. V C.
  For the primary target of ε < 2% (N = 12), minimax
                                                                                          N
optimization gives a = −1.373, b = 0.0845. With the                    Dmax      N     Dmax     (dB)    Pout   εmax
FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the                     0.36       3    0.0467   −13.3 4.67 µW ∼15%
                                                                  I
required voltages are                                         (FDTD) 0.36         5   0.00605   −22.2 0.61 µW 10.9%
                                                                       0.36       7 7.84 × 10−4 −31.1 78 nW    ∼5%
                        |a|   1.373                                    0.95      10     0.599   −2.2 59.9 µW 2.68%
               Vbias =      =        = 7.5 V,         (31)        II
                                                              (high-Q) 0.95      20     0.358   −4.5 35.8 µW 0.67%
                        bV    0.182
                                                                       0.95      30     0.215   −6.7 21.5 µW ∼0.30%
                        bL    0.0845 × 8
            Vctrl,max =     =             = 3.7 V.    (32)        Regime I: FDTD-characterized (Qi = 38,800). Regime II:
                        bV       0.182                          fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch .

Since bV ∝ Q, voltage scales inversely with quality factor:

                            bL      bL λ0                     independent evidence that intrinsic quality factors in
                  Vctrl =      =               .      (33)    the projected range are physically achievable in TFLN—
                            bV   2Q |dλ0 /dV |
                                                              albeit with wider waveguides and larger ring radii than the
CMOS-compatible control voltages (Vctrl < 3.3 V) are          present design. Transferring comparable sidewall quality
achievable at N ≥ 14 with QL = 15,500; at the design          to our geometry (R = 20 µm, W = 1.4 µm) is an open
point N = 30 (εmax = 0.30%), Vctrl = 1.47 V.                  fabrication challenge; the projections should be read as
                                                              design targets contingent on achieving it.
                                                                 The total insertion loss comprises on-resonance
                                                                                        N
       B.     Power budget: two-regime analysis               cascade transmission Dmax     , inter-ring coupling loss
                                                              (∼0.46 dB/stage for the present diagonal-bus layout),
   The on-resonance cascade transmission DmaxN
                                                  is the      off-resonance propagation loss (0.08–0.25 dB/stage), and
dominant contribution to total insertion loss. Table V        fiber-to-chip coupling (1.5–3.0 dB). For the fabricated
presents two regimes: the FDTD-characterized regime           high-Q regime (N = 30), the total ranges from ∼13 dB
(Dmax = 0.36) and the fabricated high-Q regime (Dmax =        (optimized layout) to ∼24 dB (current geometry); see
0.95, achievable with Qi > 106 and gap-optimized cou-         Supplementary Sec. S6 for detailed scenarios.
pling).
   In the FDTD-characterized regime, Dmax = 0.36 limits
practical cascades to N ≤ 5: at N = 5 the output is                             D.    Energy comparison
0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof-
of-concept validation. In the fabricated high-Q regime           For N = 30 X-cut TFLN micro-ring resonators in the
(Dmax ≥ 0.95), deep cascades become practical: N = 30         fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup-
yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%.            plementary Sec. S5), the three energy components are EO
The transition to fabricated high-Q devices is therefore      tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ,
critical for achieving both high accuracy and sufficient      shared across M = 10 channels), and photodetector
output power.                                                 (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva-
                                                              tions in Supplementary Sec. S7). Including thermal stabi-
                                                              lization for N = 30 rings (0.15–0.60 pJ; Supplementary
                   C.    Feasibility outlook                  Sec. S7), the total rises to 0.94–1.39 pJ.
                                                                 Table S12 compares the photonic cascade with digital
  Published TFLN micro-ring resonators achieve Qi ≥           implementations. Including thermal stabilization (0.94–
106 –108 using optimized fabrication [39–42]. At Qi =         1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×,
106 with the present coupling geometry, CMT predicts          while operating at 10 GHz bandwidth and 58× lower than
Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5,           digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO
Tables S4–S7), enabling deep cascades (N ≤ 30) with           drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal;
sub-percent error. The literature values provide strong       Supplementary Table S11), recovering a 3.2× advantage
                                                                                                                             10

     TABLE VI: Energy per exponential operation:                    with a distinct FSR order of the same ring set, traverse a
            single-channel comparison.                              single N -ring cascade simultaneously (Fig. 8). Because
                                                                    each channel λj sees its own Lorentzian detuning set by
 Implementation                 E/op (pJ) Bandwidth           Notes an independent control   QN
                                                                                                voltage Vj , the cascade output
 Digital FP32 (Taylor)              ∼46        1 GHz      10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all
 Digital INT8 (Taylor)              ∼2.3       1 GHz      10 INT MACsd exponentials are computed in parallel on the same phys-
 Photonic MRR (N = 30) 0.94–1.39 10 GHz                     Analog† ical waveguide. Compared with a 1×M power-splitter
    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal.    architecture that replicates the cascade for each channel,
 Self-consistent with fabricated high-Q regime (QL = 25,200); see   the WDM approach reduces the total ring count from
                       Supplementary Sec. S7.                       N × d to N (a factor-d saving) and eliminates the splitter
                                                                    insertion loss (10 log10 d dB). At the output, a WDM
                                                                    demultiplexer or wavelength-selective photodetector array
over INT8. Since EEO ∝ 1/Q2 , improving Q beyond                    separates the channels for electrical readout. Figure 8
∼30,000 yields diminishing energy returns but continues             shows a representative chip layout for N = 5 cascade
to relax CMOS driver voltage requirements.                          stages and d = 8 WDM channels, where alternating U-
                                                                    turn bus connections route the drop-port output of each
                                                                    stage into the input bus of the next.
                      VI. DISCUSSION                                   Why cascade helps. A single Lorentzian in I is too
                                                                    rigid to mimic the log-linear target over a wide interval.
   Practical design procedure. For a given input se-                Cascading turns the transfer into a product; taking a
quence x = (x1 , . . . , xK ), the design proceeds as follows:      logarithm gives a sum of smooth terms, and the approx-
                                                                    imation improves as N increases. The slope constraint
    1. Compute m = maxn xn , un = xn − m, and L =                   N |b| ≳ 1 is an immediate feasibility check.
         − minn un .                                                   Global softmax normalization via WDM feed-
    2. Map to nonnegative control-signal amplitudes: In =           back.   The WDM-parallel architecture (Fig. 8) integrates
         un + L ∈ [0, L].                                           naturally   with a closed-loop normalization scheme to com-
                                                                    plete the full softmax function. After the N -stage cascade,
    3. Choose tolerance ε and set εlog = ln(1 + ε).                 a WDM demultiplexer (e.g., arrayed-waveguide grating or
                                                                    ring-filter bank) routes each channel λj to a dedicated pho-
    4. Select a physically feasible bmax and estimate N             todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj .
         using Eq. (28).                                            The d photocurrents are summed electrically:
   5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2,                                d                   d
      then refine (a, b) by a two-parameter minimax fit if
                                                                                        X                   X
                                                                                   S=         Iλj ∝ C Pin         eVj .     (35)
      required.                                                                         j=1                 j=1

   6. The optical block yields ỹ(In ) ≈ exn −m , and soft-       A proportional–integral (PI) controller compares S with
      max weights follow as                                       a fixed reference Sref and adjusts the shared WDM laser
                                                                  power Pin so that S → Sref [44, 45]. Because all d channels
                                                                  share the same probe source, scaling Pin multiplies every
                            ỹ(In )
                      pn = P           .                 (34)     ỹj by the same factor; upon convergence
                             j ỹ(Ij )
                                                                                   Iλj      eVj
                                                                            pj =        = Pd        = softmax(V )j ,        (36)
   Scope and limits. The approximation is for a fi-                                Sref          Vk
                                                                                           k=1 e
nite interval I ∈ [0, L], where L is determined by the
input batch via Eq. (4). In practice, one designs for a           realizing the complete softmax with a single feedback loop
worst-case L expected in operation (or retunes a and              and no per-channel normalization circuitry. Compared
rescales the control signal to adapt L). Noise, insertion         with the replicated-cascade approach (one AEF block per
loss, and control-induced parasitics limit accuracy and           channel), WDM feedback offers two additional benefits:
dynamic range; we treat these effects as platform-specific        (i) the splitter-induced power imbalance that would bias
margins. Detailed non-ideality assumptions, parameter             the Iλj ratios is absent, since all channels traverse the
distributions, and robustness statistics are reported in          same optical path; and (ii) a single laser control point
Supplementary Sec. S8. With K channels in parallel,               replaces d independent probe adjustments. Design de-
one can form softmax by summing channel powers and                tails and stability analysis of the PI loop are provided in
applying a shared reciprocal scale factor, depending on           Supplementary Sec. S9.
the chosen mixed-signal normalization scheme.                        Beyond ring-resonator AEF implementations, the same
   WDM parallelism. A particularly hardware-efficient             cascade principle can be extended to other cavity-based
realization exploits wavelength-division multiplexing             photonic platforms, such as serial 1D photonic-crystal cav-
(WDM): d probe wavelengths λ1 , . . . , λd , each resonant        ities and other cascaded resonant architectures [21, 46].
                                                                                                                                  11

What these platforms share is transfer-function shaping          TABLE VII: Summary of evidence levels for key claims.
through cascaded resonances; loss, tuning range, fabrica-
tion tolerance, and calibration overhead remain platform-        Claim                              Evidence       Sec.
dependent.                                                       Cascade → exp. approx.             Analytic        II
    The insertion loss budget (Sec. V C) and electro-optic       Depth scaling                  Analytic + num. II, III
voltage requirements (Sec. V A) suggest that the cas-            QL , Dmax , bV                    3-D FDTD         IV
cade architecture is feasible under optimized coupling           5-ring line shape                 3-D FDTD         IV
and layout conditions. Using monolithic TFLN microring           N ≤ 30 deep cascade              CMT proj.∗         V
                                                                 Energy < 1 pJ                      Estimate        V
data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈
                                                                 Full softmax (WDM + feedback) Conceptual + layout VI
9–20 pm/V), the normalized sensitivity bV ≃ 0.063–
                                                                 ∗ Based on published Q
0.14 V−1 , within the range required by the cascade design.                               i ≥ 10
                                                                                                   6 values [39, 42] and CMT coupling

                                                                                                   model.
Crystal orientation and electrode design. The X-
cut TFLN platform was chosen for several reasons. First,
X-cut is the prevailing industry standard for integrated         tified in the Monte Carlo robustness analysis (Supple-
TFLN modulators, with well-established fabrication pro-          mentary Sec. S8). Monte Carlo simulations (Supplemen-
cesses and commercial wafer availability [37, 38]. Second,       tary Sec. S8) show that under nominal non-ideality levels
the TE0 mode—which is strongly confined in the rib               (σa = 0.020, σb,rel = 0.020), a single-point calibration of
waveguide geometry—can engage the large r33 coefficient          C per chip keeps the median softmax KL divergence below
via lateral electric fields aligned with the crystal Z-axis.     2.2 × 10−4 , with 95th-percentile max probability error
In contrast, Z-cut geometry with TE polarization can only        under 0.32%. Even under stress conditions (σa = 0.032),
access the smaller r13 coefficient (∼ 10 pm/V), resulting        95th-percentile errors remain below 0.42%, demonstrat-
in significantly lower electro-optic efficiency. The arc elec-   ing that the identical-detuning design is robust to realis-
trode design (Sec. IV D) addresses the phase-cancellation        tic fabrication variations provided a per-chip calibration
problem inherent to X-cut circular rings [47] by orienting       step is performed. Conversely, if coupling gaps are in-
the crystal Z-axis at 45◦ from the horizontal in the sub-        tentionally varied across rings, the per-ring parameters
strate plane. This rotation places the cos(θ − 45◦ ) = 0         (ak , bk ) become independent degrees of freedom. A Taylor-
boundaries at θ = 135◦ and 315◦ , naturally separating the       expansion analysis shows that K non-identical rings can
bus-waveguide coupling regions from the electrode regions.       cancel curvature
                                                                               P terms up to order 2K in the Taylor series
Each ring carries a full semicircular arc electrode on the       of g(I) = k ln Tk , one order higher than K identical
side opposite to its coupling points, yielding an effective      rings, so that fewer rings suffice for a given error target.
fill factor fEO = 1/π ≈ 0.318. While this reduces the
round-trip EO efficiency compared to a hypothetical full-
circumference design, it preserves the compact footprint
of a circular ring resonator. The cascade performance
can be further improved beyond the R = 20 µm circular-
ring design presented here. Increasing the ring radius
reduces bending loss and raises the intrinsic quality factor
Qi , which directly increases bV (∝ Q) and lowers the
required control voltage. Alternatively, adopting a race-
track geometry with extended straight coupling sections
strengthens the bus–ring coupling, pushing the drop-port
maximum Dmax closer to critical coupling and improving
the per-stage transfer efficiency. Either approach—or their
combination—would yield higher bV and Dmax , enabling
lower N or tighter approximation accuracy at reduced
operating voltages.
Fabrication considerations. The X-cut TFLN rib
waveguide (600 nm total thickness, 500 nm etch, w =
1.4 µm) follows established fabrication processes for com-
mercial TFLN wafers on SiO2 [37, 38]. The lateral signal–
ground (SG) electrode configuration is fabricated in a
single metal layer, which is standard in TFLN foundry
processes. The primary fabrication challenge for the
cascade architecture is maintaining uniform coupling
gaps (g = 100 nm) across N rings to ensure identi-
cal Lorentzian transfer functions. Post-fabrication trim-
ming via UV exposure or localized thermal oxidation can
compensate residual detuning variations [30], as quan-
                                                                                                                12


               Softmax Full Chip Layout – N = 5 × d = 8 (TFLN)
                                d = 8 WDM channels


                 Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8

  WDM
 λ1−λ8    n=1
         Pin


          n=2
                                                                               N = 5
                                                                               cascade
          n=3                                                                  stages


          n=4


          n=5


                              WDM Demux (AWG / ring filter)

                                                                                             Sref
                        PD1   PD2   PD3     PD4   PD5   PD6   PD7   PD8
                                                                          Iλ
                                                                               j         S          e
                                                                                   Σ          −            PI
                        p1     p2    p3      p4   p5    p6    p7    p8


                                              Feedback: adjust Pin
                                      Iλj
                     Output: pj =             = softmax(V )j
                                      Sref

FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM
 channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a
 Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s
input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output
  passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel
  photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller
          adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields
                                       pj = Iλj /Sref = softmax(V )j (Eq. 36).
                                                                                                                            13

                 VII.    CONCLUSION                             Dmax ≥ 0.95) are realized in the cascade geometry, deeper
                                                                cascades (N ≈ 20–30) would reach sub-percent approx-
   We have presented a cascaded micro-ring resonator ar-        imation error with an estimated per-operation energy
chitecture that approximates the exponential function           of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8
exn −m on a finite interval [0, L] using multiplicative         MAC at the 7 nm node. Monte Carlo analysis shows that
Lorentzian transfer functions. Increasing the cascade           the identical-detuning design tolerates realistic fabrica-
depth N systematically reduces the worst-case relative          tion variations (σa = 0.020, σb,rel = 0.020) with a single
error, and an identical-detuning design initialized by flank    per-chip calibration, keeping the 95th-percentile softmax
and slope matching provides a practical two-parameter           probability error below 0.32%.
design.
   Three-dimensional FDTD simulations of a single X-cut            The formulation is not restricted to electro-optic tuning:
TFLN add-drop ring (R = 20 µm, g = 100 nm) yield                it requires only a controllable detuning coordinate with lo-
QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the             cal linearization, so both Pockels and optical (Kerr/XPM)
cascade transfer model. A five-ring cascade 3D FDTD             mechanisms are compatible [37, 38, 47, 48]. We demon-
simulation directly validates the multi-ring framework:         strate a photonic exponential block and present a WDM-
all five rings exhibit resonant excitation, and mapping         parallel chip architecture (Fig. 8) in which d wavelength
the drop-port spectrum onto the dimensionless control           channels share a single N -ring cascade, reducing the total
variable reproduces the theoretical N = 5 curve with            ring count by a factor of d and eliminating power-splitter
∼11% integrated relative-area error over the upper op-          loss. Combined with a single-loop PI feedback that adjusts
erating range (I ≥ 5.8), providing the first multi-ring         the shared WDM laser power, the architecture realizes the
confirmation of the cascade exponential approximation.          complete softmax function—exponentiation, summation,
At the present FDTD-characterized quality factor, practi-       and normalization—without per-channel normalization
cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q        circuitry. Max-finding and digital interfacing remain open
TFLN resonators reported in the literature (Qi ≥ 106 ,          for future experimental validation.


 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob                Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A.
     Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,          Franken, Keith Powell, Hana Warner, Daniel Assumpcao,
     and Illia Polosukhin. Attention is all you need. In             Dylan Renaud, Ying Wang, et al. Integrated lithium
     Advances in Neural Information Processing Systems 30            niobate photonic computing circuit based on efficient and
     (NeurIPS 2017), pages 5998–6008, 2017.                          high-speed electro-optic conversion. Nature Communica-
 [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,               tions, 16:8178, 2025.
     and Christopher Ré. FlashAttention: Fast and memory-      [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang.
     efficient exact attention with IO-awareness. In Advances        SOFTONIC: A photonic design approach to softmax
     in Neural Information Processing Systems 35 (NeurIPS            activation for high-speed fully analog AI acceleration.
     2022), pages 16344–16359, 2022.                                 In Proceedings of the Great Lakes Symposium on VLSI
 [3] Neil Savage. Light could lower AI’s appetite for power.         (GLSVLSI ’25), pages 118–125, 2025.
     Nature Nanotechnology, 21:6–8, 2026.                       [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto-
 [4] Yichen Shen et al. Deep learning with coherent nanopho-         electronic nonlinear softmax operator based on diffractive
     tonic circuits. Nature Photonics, 11(7):441–446, 2017.          neural networks. Optics Express, 32(15):26458–26469,
 [5] Johannes Feldmann et al. Parallel convolutional process-        2024.
     ing using an integrated photonic tensor core. Nature,      [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang,
     589(7840):52–58, 2021.                                          Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang,
 [6] Nicholas C. Harris et al. Linear programmable nanopho-          Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans-
     tonic processors. Optica, 5(12):1623–1631, 2018.                former chip: interference is all you need. PhotoniX, 6:45,
 [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre               2025.
     Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan    [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai,
     Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice,         Brucek Khailany, and Anand Raghunathan. Softermax:
     and H. Bhaskaran. Higher-dimensional processing using           Hardware/software co-design of an efficient softmax for
     a photonic tensor core with continuous-time data. Nature        transformers. In Proceedings of the 58th ACM/IEEE
     Photonics, 17(12):1080–1088, 2023.                              Design Automation Conference (DAC), pages 469–474,
 [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski,                 2021.
     John E. Bowers, Michael Hochberg, Richard Soref, and       [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong
     Bhavin J. Shastri. Roadmapping the next generation of           Chang. Hardware-efficient softmax approximation for
     silicon photonics. Nature Communications, 15:751, 2024.         self-attention networks. In Proceedings of the IEEE Inter-
 [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor           national Symposium on Circuits and Systems (ISCAS),
     cores for machine learning. Applied Physics Reviews,            pages 1–5, 2023.
     7(3):031404, 2020.                                         [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun,
[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo,             and Yongpan Liu. SOLE: Hardware-software co-design
                                                                                                                               14

     of softmax and layernorm for efficient transformer infer-          2025. accessed 2026-02-21.
     ence. In Proceedings of the IEEE/ACM International            [35] Jane Austen. Pride and prejudice. Project Gutenberg
     Conference on Computer-Aided Design (ICCAD), pages                 eBook No. 1342, 2025. accessed 2026-02-21.
     1–9, 2023.                                                    [36] Hyoseok Park. MRR-AEF: reproducible MRR depth-
[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan,              sweep fitting and supplementary validation scripts.
     Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft-              GitHub repository, 2025. commit 585e695, accessed 2026-
     max function: Suitability for training and efficient hard-         02-21.
     ware implementation. IEEE Transactions on Circuits and        [37] Di Zhu et al. Integrated photonics on thin-film lithium
     Systems I: Regular Papers, 69(9):3605–3618, 2022.                  niobate. Advances in Optics and Photonics, 13(2):242–352,
[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing                2021.
     Pan. TEA-S: A tiny and efficient architecture for PLAC-       [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang
     based softmax in transformers. IEEE Transactions on                Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
     Circuits and Systems II: Express Briefs, 70:3594–3598,             CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo,
     2023.                                                              Amirhassan Shams-Ansari, David Barton, Neil Sinclair,
[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and                  and Marko Loncar. Integrated electro-optics on thin-film
     Fabrizio Lombardi. Approximate softmax functions for               lithium niobate. Nature Reviews Physics, 2025.
     energy-efficient deep neural networks. IEEE Transactions      [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan
     on Very Large Scale Integration (VLSI) Systems, 31:4–16,           Shams-Ansari, and Marko Lončar. Monolithic ultra-high-
     2023.                                                              Q lithium niobate microring resonator. Optica, 4(12):1536–
[20] Wim Bogaerts et al. Silicon microring resonators. Laser            1537, 2017.
     & Photonics Reviews, 6(1):47–73, 2012.                        [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q
[21] John E. Heebner, Robert W. Boyd, and Q.-Han                        thin-film lithium niobate microrings fabricated with wet
     Park. Scissor solitons and other propagation effects in            etching. Adv. Mater., 35(3):2208113, 2023.
     microresonator-modified waveguides. Journal of the Opti-      [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K.
     cal Society of America B, 19(4):722–731, 2002.                     Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag-
[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and                 alhães, Amirhassan Shams-Ansari, Neil Sinclair, and
     Shanhui Fan. Microring-based programmable coherent                 Marko Lončar. Twenty-nine million intrinsic Q-factor
     optical neural networks. Optics Express, 31(12):18871,             monolithic microresonators on thin-film lithium niobate.
     2023.                                                              Photon. Res., 12(8):A63–A68, 2024.
[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo.          [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian
     StarLight: a photonic neural network accelerator featur-           Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng.
     ing a hybrid mode-wavelength division multiplexing and             Lithium niobate microring with ultra-high Q factor above
     photonic nonvolatile memory. Optics Express, 30:37051,             108 . Chin. Opt. Lett., 20(1):011902, 2022.
     2022.                                                         [43] Flexcompute Inc.       Tidy3D: electromagnetic simula-
[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang,                   tion software. https://www.flexcompute.com/tidy3d/,
     and Weifeng Zhang. Reconfigurable low-threshold all-               2024.       v2.10; cloud GPU FDTD. Accompany-
     optical nonlinear activation functions based on an add-            ing notebook: https://www.flexcompute.com/tidy3d/
     drop silicon microring resonator. IEEE Photonics Journal,          community/notebooks/CascadedMRRTFLN/.
     14(6):1–7, 2022.                                              [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights.
[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals                Silicon photonic dynamic optical channel leveler with
     of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007.                 external feedback loop. Optics Express, 18(13):13805–
[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R.                   13812, 2010.
     Panepucci, and Michal Lipson. All-optical control of light    [45] Karl J. Åström and Richard M. Murray. Feedback Systems:
     on a silicon chip. Nature, 431(7012):1081–1084, 2004.              An Introduction for Scientists and Engineers. Princeton
[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and                   University Press, Princeton, NJ, 2008.
     Michal Lipson. Micrometre-scale silicon electro-optic mod-    [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel
     ulator. Nature, 435(7040):325–327, 2005.                           Scherer. Coupled-resonator optical waveguide: a proposal
[28] Kishore Padmaraju and Keren Bergman. Resolving the                 and analysis. Optics Letters, 24(11):711–713, 1999.
     thermal challenges for silicon microring resonator devices.   [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien,
     Nanophotonics, 3:269–281, 2014.                                    Lynford L. Goddard, and Songbin Gong. Ultra-efficient
[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X.              and fully isotropic monolithic microring modulators in
     Wang. Transparent conductive oxide-gated silicon mi-               a thin-film lithium niobate photonics platform. Optics
     croring with extreme resonance wavelength tunability.              Express, 28(20):29644–29661, 2020.
     Photonics Research, 7(4):473, 2019.                           [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki,
[30] Lahiru Jayatilleka et al. Post-fabrication trimming of             Peng Yao, and Dennis W. Prather. Tunable hybrid sil-
     silicon photonic ring resonators at wafer-scale. Journal           icon nitride and thin-film lithium niobate electro-optic
     of Lightwave Technology, 39:5083–5088, 2021.                       microresonator. Optics Letters, 44(3):618, 2019.
[31] Elliott W. Cheney. Introduction to Approximation Theory.
     McGraw–Hill, New York, 1966.
[32] Alec Radford et al. Language models are unsupervised
     multitask learners. Technical report, OpenAI, 2019.
[33] Hugging Face. distilgpt2 model card, 2025. accessed
     2026-02-21.
[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn),
                                                                                                                      15

                                      SUPPLEMENTARY INFORMATION

Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
toward Softmax.”


                           S0. RIGOROUS DERIVATION AND VALIDITY SCOPE

  This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions
under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma,
Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration.


                                                  S0.1 Assumptions

Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form
Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized
sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude.
Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of
ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced
                                                                        QN
effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I).
Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces
the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N .
Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is
a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is
negligible.
Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval
I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for
fixed, finite L.
Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1
(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the
steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear.


                                                S0.2 Rigorous results

  Throughout, define the log-domain residual

                          r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L),
                                                                               
                                                                                                                  (S0.1)

and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) +
         
minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout.
Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0,

                                               d
                                                  ln y(I) ≤ N |b|.
                                               dI

Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating:
                                                        

                                           d                 2b(a + bI)
                                              ln y(I) = −N               .
                                           dI              1 + (a + bI)2

Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM).
Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|.
                                                                                                                        16

Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a
necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the
main text and is a rigorous (not heuristic) necessary condition.
Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and
δ = I − I0 . Then
                                                                           δ3
                                                ln ỹ(I) = const + δ +         + R4 (δ),                             (S0.2)
                                                                          6N 2
where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic
term vanishes identically at the flank point u0 = a + bI0 = −1.
Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ))
and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1:
                                         2u
                           ϕ′ (u) = −          ,                             ϕ′ (−1) = 1,
                                       1 + u2
                                     2(u2 − 1)
                          ϕ′′ (u) =              ,                          ϕ′′ (−1) = 0,
                                     (1 + u2 )2
                                     4u(3 − u2 )                                         −4(−1)(3 − 1)
                          ϕ′′′ (u) =               ,                       ϕ′′′ (−1) =                 = 1.
                                      (1 + u2 )3                                           (1 + 1)3
By the chain rule, writing F (δ) = N ϕ(u(δ)):
                                                  F ′ (0) = N b ϕ′ (−1) = N b = 1,
                                                 F ′′ (0) = N b2 ϕ′′ (−1) = 0,
                                                                          1
                                                 F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 =
                                                                            ,
                                                                         N2
where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C
is
                                                              δ2   1 δ3
                                   ln ỹ(I) = const + δ + 0 ·    + 2·    + R4 (δ).
                                                              2   N    6
Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is
bounded by the standard Taylor remainder estimate.
Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the
leading-order worst-case log-error on I ∈ [0, L] satisfies
                                                              3
                                           (leading)       1  L      L3
                                         E∞          ∼            =        .                            (S0.3)
                                                         6N 2 2     48 N 2
          (leading)
Setting E∞            ≤ εlog = ln(1 + ε) and solving for N gives
                                                                 L3/2
                                                            N ≥ p        .                                           (S0.4)
                                                                 48 εlog
Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for
|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving:
                                                 L3                         L3/2
                                                       ≤ εlog     =⇒     N≥p         .
                                                48 N 2                       48 εlog
        √
With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the
one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the
                                                                     √
main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉.
Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The
derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07).
For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and
                                                                                                    √
higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by
numerical evidence (Table I) but should be treated as an engineering design rule.
                                                                                                                      17

                                 S0.3 Derivation of the conservative screening bound

  We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main
text.
Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the
flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write
                                                                          
                                         ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) .

The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and
r(0) = N [ϕ(a) − ϕ(a + bL)] + L.
   For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e.,
ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies

                                              N ∥ϕ′′ ∥∞ b2 L2   N ϕ′ (a + bL) · b − 1
                                   E∞ ≤                       +                       · L,                        (S0.5)
                                                      8                   2

where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|.
Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to
match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem
on [0, L]:
                                                                          Z L
                                                                                1 − h′ (t) dt.
                                                                                       
                          r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) =
                                                                                 I
                                                                  RL
Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound
|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5).
Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0),
and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to

                                                                N b2 L 2
                                                       E∞ ≤              .                                        (S0.6)
                                                                   4
Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add
the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is

                                                             L2    1
                                                     E∞ ≤       + 2 ,                                             (S0.7)
                                                             4N  2b N
which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the
conservative screening depth:
                                                      2
                                                       L /4 + 1/(2b2 )
                                                                       
                                            Nsafe ≥                      .                                     (S0.8)
                                                          ln(1 + ε)

Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is
conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit
the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating
range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N )
arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper
bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary
design-space exploration but not a certified minimax guarantee.


                                            S0.4 Validity scope and failure cases

  The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may
break down.
                                                                                                                       18

(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation
     valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order
     dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large
     detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings.

(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward-
     propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring
     count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade
     transfer function.

(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN .
     In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8
     quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning.
                                                                                          (0)
(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic
     error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide
     ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L
     requiring a control swing exceeding the linearity range of the tuning mechanism.

(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound
     and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime.
     There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at
     p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss
     budgets.

(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation
     ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing
     a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N
     rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition.


                                        S0.5 Mapping to main-text equations

For reference, the results derived here correspond to the following main-text equations:

    • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary
      condition.

    • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to
      main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits.

    • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived
      as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition.

    • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8)
      in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design.

Summary of guarantee status:
Result                            Status                                      Main-text Eq.
Slope bound N |b| ≥ 1             Rigorous (proved)                           (23)
                    √
Scaling N ∼ κL3/2 / εlog          Heuristic (Taylor truncation + empirical κ) (28)
Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic                      (S0.7)
Nsafe screening depth             Conservative backstop                       (S0.8)


            S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND

  This section provides the detailed derivations underlying the depth-scaling relations and conservative screening
bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0.
                                                                                                                          19

                                S1.1 Local expansion and exponential-like behavior

   To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the
flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a
third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives

                                                      N 2 2 2 N (N 2 − 1) 3 3
                                ỹ(I) ≈ 1 + N b δ +      b δ +           b δ + O(δ 4 ),                               (S1.1)
                                                       2          6
so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining
why the initialization is already close before refinement.


                                  S1.2 Log-domain analysis and scaling derivation

  For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1),
expand around I0 = L/2 with δ = I − I0 to obtain

                                                                     N b3 3
                                        ln ỹ(I) = const + N b δ +       δ + O(δ 4 ).                                 (S1.2)
                                                                      6
At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives

                                                                      δ3
                                           ln ỹ(I) = const + δ +         + O(δ 4 ).                                  (S1.3)
                                                                     6N 2
Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 .
Requiring E∞ ≤ εlog leads to

                                                           L3/2
                                                         N∝√      ,                                                   (S1.4)
                                                             εlog

which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a
formal guarantee), and the prefactor remains platform- and fitting-criterion dependent.


                                S1.3 Conservative upper bound and screening depth

   For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition
for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1).
For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the
worst-case log-error as

                                                            L2    1
                                                   E∞ ≤        + 2 .                                                  (S1.5)
                                                            4N  2b N
(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening
estimate (not a formal guarantee), one may use
                                                      2
                                                      L /4 + 1/(2b2 )
                                                                      
                                              N ≥                       .                                     (S1.6)
                                                         ln(1 + ε)

While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design
screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions,
is given in Sec. S0.3.
                                                                                                                  20

              S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION

  This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range
calibration tables referenced in the main text (Sec. III).


                                 S2.1 Worked input-to-output mapping example

  As a worked example, consider

                                                x = [−3.2, 1.2, 4.8, −0.9].                                    (S2.1)

Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped
control-signal levels are

                                               I = u + L = [0, 4.4, 8.0, 2.3],                                 (S2.2)

and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly,
                                                                                     N
                                                      1                              Y
                                  Tk (In ) =                    ,         y(In ) =         Tk (In ).
                                               1 + (ak + bIn )2
                                                                                     k=1

Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes
                                                                                       N
                                                                             1
                                       ỹ(In ) = C y(In ) = C                                .
                                                                      1 + (a + bIn )2
For the re-fitted parameters used in this example,

                                                a = −1.4588,          b = 0.10202,
                                                                                                               (S2.3)
                                               N = 10,       C = 3.0896 × 101 .

which gives
                                                                           N
                                                                 1
                                        ỹ(In ) = C                              ,
                                                          1 + (a + bIn )2
                                                                                                               (S2.4)
                                                 ≈ [3.44 × 10−4 , 2.73 × 10−2 ,
                                                       9.74 × 10−1 , 3.26 × 10−3 ].

  For reference, the corresponding target terms are

                                           In − L = [−8.0, −3.6, 0, −5.7],                                     (S2.5)

and
                                          In −L  
                                          e       ≈ 3.35 × 10−4 , 2.73 × 10−2 ,
                                                                                                               (S2.6)
                                                      1.00, 3.35 × 10−3 .
                                                                        

                            S2.2 Effective-range percentiles and clipping calibration

   We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive
Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context
length 128, the effective range

                               Leff,α = max(log pkept ) − min(log pkept ),              α = 0.999,             (S2.7)

fell in a relatively narrow band, summarized in Table S2.
                                                                                                                          21

 TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters
                        re-fitted on I ∈ [0, 8.0] using the same minimax pipeline.

  xn                      In                     target exn −m                     approx ỹ(In )                   rel. err.
                                                            −4                                −4
−3.2                     0.0                     3.3546 × 10                       3.4443 × 10                       2.673%
 1.2                     4.4                     2.7324 × 10−2                     2.7325 × 10−2                     0.004%
 4.8                     8.0                            1.0000                            0.9739                     2.608%
−0.9                     2.3                     3.3460 × 10−3                     3.2585 × 10−3                     2.614%


                       TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128.

                                           Percentile All runs (4 runs) GPT-2
                                           p50            6.92–7.23    7.09–7.23
                                           p90            8.60–8.75    8.73–8.75
                                           p95            8.97–9.12    9.06–9.12
                                           p99            9.50–9.69    9.58–9.69


  We then test clipping on the same rows with

                                       Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 ,
                                                                                                                      (S2.8)
                                           u(t) = max(u, t),     u = s − max(s).

and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated
at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12).
  In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii)
choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol
details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3.


                                        S2.3 Illustrative synthetic range map
                                                                                                                   √
  As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk
scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3
summarizes achievable approximation error versus depth.

   TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error
                      (design-space illustration; not intended as corpus-level statistics).

L regime                       N =5                       N = 10                        N = 20                      N = 30
   L=8                         10.9%                       2.68%                        0.67%                        0.30%
  L = 12                       40.0%                       9.25%                        2.27%                        1.01%
  L = 16                       113%                        23.0%                        5.44%                        2.41%


  Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8,
moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1%
error.
  We include Table S3 as a synthetic design map rather than an empirical benchmark.
                                                                                                                    22

         S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS

  We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in
the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora
(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention
row, if p = softmax(s) then the raw range is
                                 Lraw = max(s) − min(s) = max(log p) − min(log p),                                (37)
where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p),
we additionally report an effective range:
                                         Leff,α = max(log pkept ) − min(log pkept ),                              (38)
where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999.
  To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128),
and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs.

  TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs
                                             effective (α = 0.999).

                     Model     Dataset             raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99
                     distilgpt2 tiny shakespeare     22.82      69.00    7.10        8.60     8.97   9.50
                     distilgpt2 pride prejudice      21.76      68.60    6.92        8.60     9.03   9.57
                     gpt2       tiny shakespeare     25.48      43.34    7.23        8.73     9.06   9.58
                     gpt2       pride prejudice      24.13      40.92    7.09        8.75     9.12   9.69

  For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this
table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50
= 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69.
Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used
without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define
u = s − max(s) and, for threshold t ≤ 0,
                                       u(t) = max(u, t),           p(t) = softmax(u(t) ).                         (39)
We report the cumulative softmax error
                                                        1 (t)
                                                           p −p ,
                                                   Ecum (t) =                                                     (40)
                                                        2          1
then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows.

       TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t.

                                   t        p50              p90         p95            p99
                                                   −5              −5           −5
                                  −14 2.53 × 10    4.55 × 10   4.80 × 10   5.18 × 10−5
                                                −5          −5          −5
                                  −13 2.69 × 10    4.85 × 10   7.38 × 10   1.48 × 10−4
                                                −5          −4          −4
                                  −12 2.99 × 10    1.21 × 10   2.13 × 10   4.27 × 10−4
                                  −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3
                                  −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3
                                  −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2
                                  −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2
                                  −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2
                                  −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1

   Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep
is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably,
this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating
that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range
budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while
keeping aggregate softmax distortion below 0.1%.
                                                                                                                   23


    FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs.


FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the
                                                0.1% budget (10−3 ).
                                                                                                                        24

                    S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION

  This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage
sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V).


                                       S4.1 z-refined 3-fix simulation strategy

   For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast
between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix”
strategy:
   1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate
      for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation;
   2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient
      substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts;
   3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied.
      The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm
      slab.
This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral
features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation
parameters.

                              TABLE S6: 3D FDTD simulation parameters (Lumerical).

Parameter                                                                                  Value
Solver                                                                                     Lumerical 3D FDTD
Mesh type                                                                                  Conformal variant 1
Mesh accuracy                                                                              3 (auto-mesh)
z-mesh override                                                                            None (auto-mesh)
Simulation time                                                                            50 ps
Auto shutoff                                                                               1 × 10−6
Wavelength range                                                                           1530 nm to 1570 nm
Grid size                                                                                  532 × 816 × 44
Source                                                                                     Broadband mode source (TE0 )


                                S4.2 X-cut arc electrode bV step-by-step derivation

   For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦
from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral
electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and
θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full
semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective
EO fill factor is
                         Z                                   Z +π/2
                       1                                   1                    1       +π/2  1
                fEO =               | cos(θ − 45◦ )| dθ =           cos φ dφ =      sin φ −π/2 = ≈ 0.318.          (S4.1)
                      2π semicircle                       2π −π/2              2π               π
The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor
integral is identical to the standard cos θ case by the change of variable.
   The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm.
The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration.
   The refractive index change per volt in the electrode-covered section is
             ∆neff    1        ΓEO     1                              0.7
                   = − n3e r33      = − × 2.1383 × 30.9 × 10−12 ×            = −4.226 × 10−5 V−1 .                  (S4.2)
              V       2        deff    2                          2.5 × 10−6
                                                                                                                     25

The corresponding resonance wavelength shift is
                                  dλ0           1550 × 4.226 × 10−5
                                              =                     = 28.48 pm V−1 ,                             (S4.3)
                                  dV straight           2.30

giving an intrinsic (straight-section) voltage sensitivity of
                                         2QL dλ0           2 × 15,500
                           bstraight
                            V        =                   =            × 0.02848 = 0.570 V−1 .                    (S4.4)
                                          λ0 dV straight      1550
However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The
effective voltage sensitivity is therefore
                                                                      1
                                     bV = bstraight
                                           V        × fEO = 0.570 ×     ≈ 0.182 V−1 .                            (S4.5)
                                                                      π
A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318),
the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations)
and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length.


                                           S4.3 Full cascade optimization table

  Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to
e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30.

     TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and
Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated
                                          best resonance QL = 15,500).

N                a                    b                 E∞              εmax (%)          Vbias (V)            Vctrl (V)
 5            −2.0789              0.21658             0.1035             10.91             11.4                  9.5
 8            −1.5959              0.12896             0.0412              4.20              8.8                  5.7
10            −1.4588              0.10202             0.0265              2.68              8.0                  4.5
12            −1.3731              0.08450             0.0184              1.86              7.5                  3.7
15            −1.2914              0.06726             0.0118              1.19              7.1                  3.0
17            −1.2543              0.05923             0.0092              0.92              6.9                  2.6
20            −1.2136              0.05025             0.0067              0.67              6.7                  2.2
25            −1.1685              0.04013             0.0043              0.43              6.4                  1.8
30            −1.1392              0.03341             0.0030              0.30              6.3                  1.5


  Key thresholds for the minimum number of rings at various error targets are:
     • ε < 10%: N ≥ 6,
     • ε < 5%: N ≥ 8,
     • ε < 2%: N ≥ 12,
     • ε < 1%: N ≥ 17,
     • ε < 0.5%: N ≥ 24.
These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in
normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning
range, through bV .


                                              S4.4 Lorentzian fit validation

  Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian
Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer
function model used in the cascade analysis.
                                                                                                                      26


  FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The
                           extracted quality factor is QL = 15,500 with FWHM = 101 pm.


                                 S4.5 Eigenmode (FDE) analysis of theoretical Qi

   To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional
finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions.
   a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm
and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is
identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch,
waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300
cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially
finer than the 3D FDTD auto-mesh (55 nm vertical).
   b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided
mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm:
                                        neff = 1.9653 + i (4.73 × 10−8 ),                                            (41)
                                               4π ni
                                                     = 0.383 m−1 0.017 dB cm−1 .
                                                                              
                                   αrad+leak =                                                                       (42)
                                                 λ
Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the
100 nm slab. The corresponding quality factor is
                                                         2π ng
                                         Qrad+leak =               = 2.43 × 107 ,                                    (43)
                                                       αrad+leak λ
where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the
small difference arises from the straight-section approximation inherent to 2D FDE).
  c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields
Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from
                                   1          1        1
                                        =           −       ,      Qbend = 1.43 × 108 .                              (44)
                                  Qbend   Qrad+leak   Qleak
This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the
dominant geometric loss channel.
   d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical
intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is
                                                             2π ng
                                                   Qabs =            ,                                               (45)
                                                            Γ αmat λ
                                                                                                                   27

where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for
representative TFLN absorption values from the literature [6, 7].

TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ),
 substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded.
                       The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 .

Material condition                         αmat (dB/cm)                         Qabs                        Qi (total)
Bulk LiNbO3 (pristine)                         0.002                          2.3 × 108                     2.2 × 107
High-quality TFLN                               0.01                          4.7 × 107                     1.6 × 107
Good TFLN                                       0.03                          1.6 × 107                     9.5 × 106
Typical TFLN                                     0.1                          4.7 × 106                     3.9 × 106


   For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the
FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts
(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm
is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage
through the thin slab (Qleak = 2.93 × 107 ).
                                                                                                                                    28

                               S5. FABRICATED HIGH-Q DESIGN PROJECTIONS

   Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm
slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The
numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional
MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any
realistic intrinsic loss.
   Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been
demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP-
based processes.

  TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates
                whether the fabrication process preserves electrode patterning capability.

Ref.                              Qi                       R (µm)                      w (µm)                           Etch
Zhang [8]                        107                         80                          ∼2                           Ar+ mill
Gao [9]                           108                       100                          ∼3                            CMP∗
Zhuang [10]                     9×106                       100                          ∼2                           Wet etch
Song [11]                      2.9×107                      200                          4.5                       ICP-RIE+CMP
   All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi .

  To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling
quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality
factors (Table S10).

                                                                                              N
  TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal
on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on
                                 Qi ); at N = 20, L = 8: εmax = 0.67% (Table I).

Projection                     Qi                        Dmax                  N =10                  N =20                 N =30
FDTD baseline                  3.88×104                  0.36                  −44.3                  −88.5                 −132.8
Conservative                   5×105                     0.90                  −4.4                   −8.8                  −13.2
Moderate                       106                       0.95                  −2.2                   −4.5                   −6.7
Optimistic                     5×106                     0.99                  −0.44                  −0.88                  −1.3


  Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an
order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated
high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck;
the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature
values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable
in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design.
Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge;
the projections in Table S10 should be read as design targets contingent on achieving it.
                                                                                                                      29

                                   S6. INSERTION LOSS BUDGET DETAILS

  For a cascade of N rings, the total insertion loss is modeled as

                                           ILtot ≈ N · ILstage + ILcoupling ,                                      (S6.1)

where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and
chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios:

   • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB
     (N = 20), and 3.80 dB (N = 30).
   • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10),
     8.00 dB (N = 20), and 10.5 dB (N = 30).

   In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter
power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation
closer to the detector noise floor, reducing usable dynamic range.
   e. Four-component loss breakdown. The total insertion loss of the cascade has four components:
                                         N
   1. On-resonance cascade transmission Dmax (dominant; see Table V);
   2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each
      inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding
      to ∼0.46 dB per inter-ring stage;
   3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16];
   4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15].
                                                   N
Table V presents the ideal on-resonance budget (Dmax   only). Including all four components for the present diagonal-bus
layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈
26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The
inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or
straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized
layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB.
                                                                                                                        30

                             S7. ENERGY EFFICIENCY DETAILED DERIVATION

  This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits
and the photonic MRR cascade, as summarized in the main text (Sec. V).


                                         S7.1 Electrical analog exponential circuits

  Three main families of electrical circuits realize the exponential function in the analog domain:
  f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC =
IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building
block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At
a signal bandwidth of B ≈ 100 MHz, the energy per operation is
                                                            P   300 µW
                                               EGilbert =     =         = 3 pJ.                                     (S7.1)
                                                            B   100 MHz
  g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling
direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements
a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by
subthreshold fT ):
                                                             0.43 µW
                                                 Esub-VT =           = 0.43 pJ.                                     (S7.2)
                                                              1 MHz
This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz).
  h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations.
Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving
                                           Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ.                               (S7.3)
At 8-bit precision (sufficient for inference): ∼2.3 pJ.


                          S7.2 Photonic MRR cascade: single-channel energy derivation

   We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated
high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V
(fully CMOS-compatible). The energy per exponential operation has three components:
   (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral
S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as
                                                            Cel ≈ 18 fF,                                            (S7.4)
based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported
by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected
QL = 25,200, which gives bV = 0.295 V−1 ):
                                                       2
                                      Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ.                       (S7.5)
For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ.
  Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because
                                                        2
                                            EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N.                                   (S7.6)
The bias voltage (3.9 V) is static and does not contribute per-operation energy.
   (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single
CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug
efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10),
requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ.
   (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ.
   The total single-channel energy is
                              (1ch)
                            Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ.                          (S7.7)
                                                                                                                      31

                                       S7.3 Q-factor scaling of energy efficiency

                                2
  Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at
various quality factors.

TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV
 scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the
FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal
                                       stabilization (0.15–0.60 pJ for N = 30).

      Q                    Vctrl (V)                  Vbias (V)                   EEO (pJ)                    Etotal (pJ)
   5,000                     4.57                       19.5                        5.64                         6.21
 10,000                      2.28                        9.7                        1.40                         1.97
 12,500                      1.83                        7.8                        0.90                         1.47
15,500†                      1.47                        6.3                        0.58                         1.15
 20,000                      1.14                        4.9                        0.35                         0.92
25,200‡                      0.91                        3.9                        0.22                         0.79
 30,000                      0.76                        3.2                        0.16                         0.73
 50,000                      0.46                        1.9                        0.06                         0.63


   At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the
design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total
energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather
than energy.
   i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy
for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal
stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than
silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate,
the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to
0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy
comparison should therefore be treated as an order-of-magnitude estimate.


                                S7.4 Comparison with electronic implementations

   Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our
photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume
a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a
1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is
                   N
Pout = Pin,ch × Dmax  , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30)
(Table V).
   j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared:
BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and
digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]).
   k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self-
consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized
laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding
Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ.
Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization.
   l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves
0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including
thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz
bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11),
recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000×
lower bandwidth.
   m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic
estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared
with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for
                                                                                                                                   32

                      TABLE S12: Energy per exponential operation: single-channel comparison.

Implementation                                    E/op (pJ)                        Bandwidth                             Notes
Digital FP32 (Taylor)                                ∼46                             1 GHz                           10 FP MACs
BJT Gilbert cell                                     ∼3                             100 MHz                              Analog
Digital INT8 (Taylor)                                ∼2.3                            1 GHz                           10 INT MACs
Photonic MRR (N = 30)                             0.94–1.39                         10 GHz                             Analog†
Subthreshold CMOS                                   ∼0.43                            1 MHz                               Analog
    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see
                                                                                                                 L
                                                      Supplementary Sec. S7.


TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective
precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage
over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical
and Vctrl remains CMOS-compatible.
                                                                                                                         33

                  S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES

   This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design
(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per-
ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v)
control-channel noise, and (vi) detector noise with one-point calibration at I = L.
   For ring k, we use
                                                                      1
                                        Tk (I) =                                         2,                            (46)
                                                   1 + (ak + bk I + dth + dxt I/L)

with
                                                       N
                                                       Y
                                              y(I) =         Tk (I) × 10−ILtot /10 ,                                   (47)
                                                       k=1

and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance.

                       TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps.

                                        Parameter                 Nominal       Stress
                                        σa                     0.020       0.032
                                        σb,rel                 0.020       0.032
                                        σth                    0.015       0.025
                                        σxt                    0.012       0.020
                                        σI                     0.004       0.007
                                        ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05
                                        σdet                3.0 × 10−6 6.0 × 10−6


                        TABLE S14: Monte Carlo summary (same run reported in main text).

                                     Metric                         Nominal        Stress
                                     Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4
                                     p95 KL(pref ∥papprox )    5.92 × 10−4 2.21 × 10−3
                                     Median max |∆p|             0.170%      0.193%
                                     p95 max |∆p|                0.319%      0.419%

Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family
with fixed b, define

                             ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL),            ϕ(u) = − ln(1 + u2 ),                    (48)

so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder
argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the
flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening
inequality

                                                              L2    1
                                                    E∞ ≲         + 2 ,                                                 (49)
                                                              4N  2b N
which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a
conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening.
                                                                                            34


FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples.
                                                                                                                            35

                      S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION

  We model global normalization as a delayed PI-controlled loop:

                                   S(t) = G(t)P (t) + n(t),                                                               (50)
                                    dP
                                  τ     = −P (t) + u(t − Td ),                                                            (51)
                                    dt                 Z
                                   u(t) = Kp e(t) + Ki      e(t) dt,          e(t) = Sref − S(t),                         (52)

with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical
intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples.

TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case
            ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ).

                                 Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation
                                   100 ns         20 ns              1.24 µs           fast loop
                                    1 µs          200 ns             12.4 µs        moderate loop
                                    5 µs           1 µs               62 µs          slower loop

Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed-
signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip)
for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude
ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices.

    TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9.

                                       Subsystem path                  Tsys          Sources
                                       On-chip memory (L1/L2)     20–200 ns [25]
                                       Off-chip memory (DRAM) 200–700 ns [25, 26]
                                       ADC conversion             10–710 ns [27, 28]
                                       DAC + driver/settling      1–200 ns [29]
                                       On-chip interconnect (NoC) 5–100 ns [30]
                                       Off-chip I/O (PCIe/CXL) 1–10 µs      [25, 31]

Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from
one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time
entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy:

                                                        Tnorm ≈ 12.4 τ.                                                   (53)

This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling
times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening
metric

                                                       Tnorm ≥ β Tsys ,                                                   (54)

with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal
dominance proof. The corresponding threshold is
                                                                    β Tsys
                                                      τcrit (β) =          .                                              (55)
                                                                     12.4
Table S17 gives the resulting numeric ranges.
For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs
gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the
normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant
                                                                                                                  36

                        TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16.

                         Subsystem                       Tsys range τcrit (β = 0.5) τcrit (β = 1)
                         On-chip memory path        20–200 ns 0.81–8.06 ns 1.61–16.13 ns
                         Off-chip memory path      200–700 ns 8.06–28.23 ns 16.13–56.45 ns
                         ADC conversion             10–710 ns 0.40–28.63 ns 0.81–57.26 ns
                         DAC+driver/settling         1–200 ns 0.04–8.06 ns 0.08–16.13 ns
                         On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns
                         Off-chip I/O fabric          1–10 µs  0.04–0.40 µs 0.08–0.81 µs


as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture-
specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16).
Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations
require full timing validation.

TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ).

                                Case      (Kp , Ki , Td /τ ) Overshoot    Settling     Stable
                                Stable    (0.55, 0.8, 0.2)     25.6%      ∼ 12.4τ       Yes
                                Marginal (0.95, 1.6, 0.45)     25.6%      ∼ 12.8τ       Yes
                                Unstable (1.2, 2.2, 0.75)      45.1%     not settled    No


                   TABLE S19: Stable-region fraction from gain-map scans at each delay ratio.

                                                  Td /τ Stable fraction
                                                   0.0        88.1%
                                                   0.2        88.0%
                                                   0.5        72.4%
                                                   0.8        47.5%
                                                                        37


FIG. S5: Step-response examples of the delayed PI normalization loop.
                                                                          38


FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges.
                                                                                                                             39

                                               S10. REPRODUCIBILITY

  Scripts used for this Supplementary validation:
    • scripts/nonideality montecarlo.py

    • scripts/feedback loop validation.py

    • scripts/extract logit range effective.py

    • scripts/analyze softmax clipping validity.py
Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs
are stored under:
    • paper/empirical L v3/


 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
     Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages
     5998–6008, 2017.
 [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
 [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21.
 [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21.
 [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21.
 [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021.
 [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
     CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko
     Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025.
 [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium
     niobate microring resonator. Optica, 4(12):1536–1537, 2017.
 [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium
     niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022.
[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching.
     Adv. Mater., 35(3):2208113, 2023.
[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan
     Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on
     thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024.
[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J.
     Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024.
[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018.
[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi-
     cations, 14:2389, 2023.
[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on-
     insulator platform. Optics Express, 18(17):18278–18283, 2010.
[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of
     the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS
     Benelux Chapter, 2001.
[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975.
[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989.
[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017.
[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and
     Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics,
     10(9):1004, 2021.
[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State
     Circuits Conference (ISSCC), pages 10–14, 2014.
[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully
     isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644–
     29661, 2020.
[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog.
     Phys., 75(4):046402, 2012.
                                                                                                                             40

[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol.,
     35(3):346–396, 2017.
[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via
     microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism
     (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages
     368–379, 2012.
[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital
     converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22.
[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised
     August 2015), 2015. Accessed 2026-02-22.
[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet
     (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22.
[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design
     Automation Conference (DAC), pages 684–689, 2001.
[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and
     Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23
     Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.