Initial commit: ept — backprop-free equilibrium transformer (EP)

Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}), analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints git-ignored (share separately). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
author: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
committer: Yuren Hao <yurenh2@illinois.edu> 2026-07-03 05:56:50 -0500
commit: b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree: b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/extracted_paper.txt
1 files changed, 2039 insertions, 0 deletions
diff --git a/ep_run/extracted_paper.txt b/ep_run/extracted_paper.txt
new file mode 100644
index 0000000..4f521d8
--- /dev/null
+++ b/ep_run/extracted_paper.txt
@@ -0,0 +1,2039 @@
+                                                      Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
+                                                                                   toward Softmax
+                                                                                            Hyoseok Park1 and Yeonsang Park1, ∗
+                                                                 1
+                                                                     Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea
+                                                                                                   (Dated: March 26, 2026)
+                                                                  The rapid growth of large-scale AI models has intensified energy consumption and data-movement
+                                                               challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear
+                                                               matrix multiplications of transformer inference at high throughput and low energy. However, the
+                                                               softmax attention layer—which requires element-wise exponentiation followed by normalization—still
+                                                               relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much
+                                                               of the potential photonic advantage.
+arXiv:2603.12934v3 [physics.optics] 25 Mar 2026
+
+
+
+
+                                                                  We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel
+                                                               exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case
+                                                               relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe
+                                                               at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a
+                                                               multiplicative transfer function whose logarithm is approximately linear.
+                                                                  We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate
+                                                               the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN)
+                                                               add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade
+                                                               and confirms agreement with theory primarily over the upper operating range; deeper cascades and
+                                                               higher quality factors are assessed analytically. The cascade implements the per-channel exponential
+                                                               block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel
+                                                               chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation,
+                                                               summation, and normalization—on a single photonic chip without per-channel normalization circuitry.
+
+
+                                                                     I.   INTRODUCTION                              is approximately linear over a finite interval, enabling
+                                                                                                                    exponential-function synthesis with sub-2% worst-case
+                                                     Transformer inference is often limited by power and            error—an order of magnitude more accurate than SOFT-
+                                                  memory traffic, motivating optical accelerators that ex-          ONIC’s polynomial approach—while remaining compati-
+                                                  ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9].   ble with integrated microring platforms [20–24]. We term
+                                                  Recent perspective articles also discuss data-center power        this cascade block an approximate exponential function
+                                                  consumption as one motivation for optical comput-                 (AEF) unit. We further propose a WDM-parallel archi-
+                                                  ing [3, 8]. While linear operators are comparatively              tecture with a single PI feedback loop that realizes the
+                                                  amenable to photonic implementation [4–6], the softmax            complete softmax function—including summation and
+                                                  function used in attention layers requires an exponen-            normalization—without per-channel electronic process-
+                                                  tial mapping together with global normalization—both              ing.
+                                                  difficult to realize in passive photonic circuits, where             We extend the theoretical framework with three-
+                                                  transmission is fundamentally bounded by unity. Parallel          dimensional FDTD simulations of a single X-cut TFLN
+                                                  digital-hardware studies treat the exponential/softmax            add-drop micro-ring resonator. The simulated device
+                                                  stage as a bottleneck and propose dedicated approxima-            parameters—quality factor, free spectral range, and
+                                                  tions [11–19]. Many integrated-photonic classifier demon-         electro-optic sensitivity—calibrate the cascade design pa-
+                                                  strations still rely on electronic post-processing for the        rameters, bridging analytical fitting and physically realiz-
+                                                  final nonlinear readout [10]; the resulting electro-optic         able hardware. Two operating regimes emerge from this
+                                                  conversion overhead can negate the throughput and en-             calibration: an FDTD-characterized regime with moder-
+                                                  ergy benefits of the photonic front-end. Notably, the             ate drop-port depth (Dmax ≈ 0.36), where the analytic
+                                                  SOFTONIC architecture [11] explicitly argues that “the            error stays below ∼5% for N ≤ 7 but the power bud-
+                                                  inability of MRRs and MZMs to handle SMA’s expo-                  get limits practical cascades to N ≤ 5; and a projected
+                                                  nential and division functions” necessitates alternative          high-Q regime (Dmax ≥ 0.95), enabling deeper cascades
+                                                  approaches based on microdisk modulators and polyno-              (N ≤ 30) with sub-percent error. Cascade performance is
+                                                  mial approximation, achieving 89.7% accuracy with a               predicted analytically and validated by a five-ring cascade
+                                                  third-degree Chebyshev polynomial. Here we challenge              3D FDTD simulation (Sec. IV).
+                                                  this premise: we show that a passive Lorentzian cascade              The paper is organized as follows: Section II presents
+                                                  of microring resonators can be tuned so that its logarithm        the mapping, transfer model, and depth-design rules; Sec-
+                                                                                                                    tion III provides numerical fits and validation; Section IV
+                                                                                                                    describes the single-ring TFLN device design and FDTD
+                                                                                                                    validation; Section V assesses physical feasibility including
+                                                  ∗ yeonsang.park@cnu.ac.kr; Corresponding author
+                                                                                                                    voltage requirements, insertion loss, and energy efficiency;
+                                                                                                                                       2
+
+Section VI discusses implementation scope, platform com-
+parisons, and limits; and Section VII concludes.                                                                1
+                                                                                    Tk (∆ωk ) =                     .                (9)
+                                                                                                                ∆ωk 2
+                                                                                                        1+       Γ
+    II.   MODEL AND DESIGN FRAMEWORK
+                                                                In a control–probe architecture, a nonnegative control-
+                                                                signal amplitude I ≥ 0 shifts the ring resonance. Here I
+Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an           denotes a generic control amplitude: for optical-pump op-
+arbitrary real-valued sequence (or vector). Directly gener-     eration it maps to optical intensity, while for EO operation
+ating exp(xn ) as a passive optical transmission is impos-      it maps to electrical control level (e.g., voltage). Across
+sible in general because exp(x) grows beyond unity while        many physical mechanisms (optical pump via Kerr/XPM,
+a passive transmission satisfies 0 < T ≤ 1 [25]. However,       EO drive via Pockels effect, thermal, carrier tuning), the
+for softmax,                                                    shift can be linearized on a working range [20, 26–30]:
+
+                                exn                                                                       (0)
+                 softmax(x)n = P xj ,                     (1)                            ω0,k (I) = ω0,k + ηI,                      (10)
+                                 je
+                                                                        (0)
+                                                                where ω0,k is the cold-cavity resonance and η is the control-
+a common shift cancels:                                         to-resonance sensitivity. In practice, the control channel
+                                                                can be optical or electrical (optical pump, EO/Pockels
+             exn +c   exn                                       drive, thermal, or carrier tuning); a quantitative EO
+            P x +c = P x                  (∀c ∈ R).       (2)   feasibility example is given in the Discussion. With
+              je       je
+                  j       j
+                                                                                  (0)
+                                                                ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be-
+Thus it suffices to generate                                    comes
+
+
+                exn −m ,       m ≡ max xj ,               (3)                           ∆ωk (I) = ∆ω0,k − ηI.                       (11)
+                                      j
+                                                                Define dimensionless parameters
+since the global factor em cancels.
+   To ensure a nonnegative control-signal amplitude, de-
+fine                                                                                          ∆ω0,k                η
+                                                                                   ak ≡             ,           b≡− .               (12)
+                                                                                               Γ                   Γ
+                                                                Then Eq. (9) yields the control-to-probe transfer of a
+un ≡ xn − m ≤ 0,           L ≡ − min un = m − min xn ≥ 0,       single ring,
+                                  n                   n
+                                                     (4)
+and map each scalar to a nonnegative control-signal am-                                                     1
+plitude                                                                             Tk (I) =                            .           (13)
+                                                                                                   1 + (ak + bI)2
+                                                                 Physical meaning: ak is a static detuning in linewidth
+                   In ≡ un + L ∈ [0, L].                  (5)    units (set by heater/carrier tuning/fabrication), and |b|
+                                                                 is the normalized sensitivity magnitude (linewidths of
+Then
+                                                                 resonance shift per unit control-signal amplitude); the sign
+                                                                 convention is absorbed into the detuning expression. For
+                  exn −m = eun = eIn −L .                 (6)   “same-material/same-geometry” rings, b is often common,
+                                                                while ak can be tuned per ring.
+Hence the optical design task is to realize, for I ∈ [0, L],    Sign convention. Simultaneously flipping (ak , b) 7→
+                                                                (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0
+                                                                without loss of generality.
+                 f (I) = eI−L ∈ [e−L , 1].                (7)       Let N rings be cascaded in a serial add-drop topology:
+                                                                 Tk (I) denotes the add-to-drop transmission of ring k, and
+Control–probe transfer. Consider a weak probe at                 the drop output of ring k feeds the add (input bus) port
+fixed angular frequency ωL . For the kth ring, let ω0,k          of ring k+1. Assuming the probe is sufficiently weak so
+denote its resonance frequency and Γ > 0 its loaded half-        the control channel dominates the resonance shift, the
+width at half maximum (HWHM). Define the detuning                normalized probe output is the product
+
+                    ∆ωk ≡ ωL − ω0,k .                     (8)                 (probe)
+                                                                           Pout         (I)
+                                                                                                  N
+                                                                                                  Y                 N
+                                                                                                                    Y         1
+                                                                  y(I) ≡                      =         Tk (I) =                       .
+Near resonance, the normalized Lorentzian transmission
+                                                                                (probe)
+                                                                              Pin                                       1 + (ak + bI)2
+                                                                                                  k=1               k=1
+is modeled as [20, 21]                                                                                                               (14)
+                                                                                                                                   3
+
+
+                (a) Electronic Preprocessing
+                                                                                                           Control In
+                                     Find max:              Shift:                   Bias:
+                  {xn }             m = max(xn )         un = xn −m               In = un +L
+
+
+                                                                      EO tuning
+                (b) N -MRR Cascade
+
+                                                                      N stages
+      Probe
+ (fixed ωL )
+
+
+                               MRR                  MRR                MRR                     MRR                      MRR
+                               #1                   #2                 #3                      #4                       #5
+
+
+
+
+                (c) Output
+
+                                                     ỹ(In ) ≈ exp(In − L) → exp(xn − m)                                      PD
+
+
+ FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing
+    maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and
+In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a
+ weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the
+        next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements
+                y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax.
+
+
+To focus on the shape of the approximation, we allow a
+global scale factor C > 0:
+                                                                                    E∞ ≡ sup         ln ỹ(I) − (I − L) .     (18)
+                                                                                          I∈[0,L]
+
+                        ỹ(I) ≡ C y(I).                  (15)    If E∞ ≤ εlog , then for all I ∈ [0, L],
+In softmax, pn = CeIn −L / j CeIj −L , so C cancels
+                                 P
+between numerator and denominator and is physically                                 ỹ(I)           ỹ(I)
+                                                                       e−εlog ≤           ≤ eεlog ⇒       − 1 ≤ eεlog − 1.    (19)
+inessential; nevertheless it is convenient for error analysis.                      f (I)           f (I)
+For a fixed (N, b, {ak }), the optimal C for the minimax
+                                                                 Thus achieving a prescribed worst-case relative error ε is
+log-error in Eq. (18) can be written in closed form. Let
+                                                                 guaranteed by
+g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax-
+optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield-
+ing E∞ = (maxI g(I) − minI g(I))/2.                                                   E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε.              (20)
+  Taking logarithms,
+                                                                 Depth scaling. We derive depth-related constraints and
+                                                                 design rules for a prescribed approximation tolerance.
+                             N
+                             X                                   Necessary slope condition. Differentiate Eq. (16):
+                                   ln 1 + (ak + bI)2 .
+                                                    
+         ln ỹ(I) = ln C −                                (16)
+                             k=1
+                                                                                                     N
+                                                                                   d              X 2b(ak + bI)
+The target ln f (I) = I − L is linear; hence exponential                              ln y(I) = −                 .           (21)
+                                                                                   dI              1 + (ak + bI)2
+approximation is equivalent to the log-linearization goal                                            k=1
+
+                                                                 Since |2u/(1 + u2 )| ≤ 1 for all real u,
+     ln ỹ(I) ≈ I − L     uniformly on I ∈ [0, L].        (17)
+                                                                                           d
+                                                                                              ln y(I) ≤ N |b|.                (22)
+Error metric. Define the worst-case log-error on [0, L]:                                   dI
+                                                                                                                                 4
+
+The target ln f (I) = I − L has constant slope +1, so a               with a minimax refinement. After choosing N , set
+necessary condition to track it is                                    b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa-
+                                                                      tion, then refine (a, b) by a two-parameter minimax fit on
+                                                                      [0, L].
+                            N |b| ≳ 1.                         (23)      A heuristic conservative screening bound N ≥ ⌈(L2 /4 +
+Near-optimal parameterization. The full design prob-                  1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion
+lem can be written as a minimax fit in the log domain [31]:           argument; see Supplementary Sec. S1) provides a quick
+                                                                      upper estimate but is not a rigorous guarantee.
+
+                    min          sup |r(I)|,
+               a1 ,...,aN , ln C I∈[0,L]
+                                                                           III.   NUMERICAL FITS AND VALIDATION
+                   N
+                   X                                           (24)
+                         ln 1 + (ak + bI)2 − (I − L).
+                                          
+   r(I) ≡ ln C −                                                         We validate the analytical framework with minimax
+                   k=1                                                numerical fits and sampled robustness checks. Figure 2
+This objective is permutation-invariant in the ak ’s (ring            shows the fitted approximation quality at L = 8: the
+index k). In practice (and in numerical experiments                   top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the
+reported below), the optimizer frequently collapses to a              middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8],
+permutation-symmetric solution                                        and the bottom panel shows the pointwise relative error
+                                                                      with the characteristic Chebyshev equioscillation pattern.
+                                                                         We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L]
+                     a1 = · · · = aN ≡ a,                      (25)   and compare several depths using a minimax criterion.
+                                                                         Table I makes the accuracy–depth trade-off explicit
+reducing the design to two parameters (a, b) (plus C).                at L = 8. A worked input-to-output example demon-
+With Eq. (25),                                                        strating the mapping from an arbitrary input sequence
+                                                                      x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided
+                                  
+                                   1
+                                                      N              in Supplementary Sec. S2. The example shows that the
+          ỹ(I) = C y(I) = C                               .   (26)   N = 10 cascade keeps the worst-case relative error below
+                             1 + (a + bI)2                            2.7% across all channels.
+A robust initialization is obtained by placing the midpoint           Empirical calibration. We calibrate the effective
+of the interval on the Lorentzian half-maximum flank and              logit range Leff from autoregressive Transformers (dis-
+matching the slope:                                                   tilgpt2/gpt2) [1, 32–35] at context length 128, finding
+                                                                      Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen-
+                                                                      tary Sec. S2). A clipping threshold t∗ = −12 preserves
+                       L                                              p99 softmax accuracy below 0.1%. Full protocol details,
+                a+b      ≈ −1,             N b ≈ 1.            (27)
+                       2                                              clipping-sweep tables/plots, and per-run statistics are
+These two equations already yield a good design; a small              provided in Supplementary Sec. S3.
+(two-parameter) refinement can then enforce the desired                  A synthetic design-space map (Supplementary Table S3)
+worst-case tolerance.                                                 shows that near L ≈ 8, moderate depth (N ≥ 10) reaches
+   Local expansion and depth scaling. A Taylor                        few-percent error, whereas L ≳ 12 requires deeper cas-
+expansion of the log-domain residual around the flank-                cades. All fits follow the same pipeline: minimize the
+centered point I0 = L/2 (with a + bI0 = −1 and N b = 1)               worst-case log-error on a uniform grid, initialize from the
+shows that the quadratic term vanishes identically, leaving           flank rules in Eq. (27), perform multi-start global search,
+a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L],        and apply bounded local refinement; implementation de-
+this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed            tails and scripts are provided in a public repository [36]
+                                     √                                (commit: 585e695).
+tolerance εlog requires N ∝ L3/2 / εlog , which explains
+the scaling used in Eq. (28). The full derivation is provided
+in Supplementary Sec. S0; an intuitive local-expansion
+summary appears in Sec. S1.
+   Practical engineering estimate. Given L and a                         TABLE I: Depth comparison for L = 8 using fitted
+target worst-case relative error ε, define εlog = ln(1 + ε).          ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all
+A heuristic engineering estimate (not a rigorous bound)                                          N ).
+that matched our percent-level numerical designs is
+                                                                      N           a         b       max rel. err.   mean rel. err.
+                               L3/2
+                                    
+                        1
+             N ≈ max        , κ√         ,                     (28)    5      −2.0789   0.21658        10.9%            6.43%
+                       bmax      εlog                                 10      −1.4588   0.10202        2.68%            1.65%
+                                                                      20      −1.2135   0.05025        0.67%            0.42%
+where bmax is the physically achievable sensitivity bound             30      −1.1392   0.03341        0.30%            0.19%
+and κ ≃ 0.07 for the identical-detuning flank design
+                                                                                                                   5
+
+                                                            TABLE II: Waveguide and ring parameters of the X-cut
+                                                             TFLN micro-ring resonator. Electro-optic electrode
+                                                                parameters are listed separately in Table III.
+
+                                                            Parameter                  Symbol       Value      Unit
+                                                            Total TFLN thickness       tTFLN         600       nm
+                                                            Etch depth                 tetch         500       nm
+                                                            Slab thickness             tslab         100       nm
+                                                            Waveguide width            w              1.4      µm
+                                                            Bend radius                R              20       µm
+                                                            Coupling gap               g             100       nm
+                                                            Circumference              Lring        125.7      µm
+                                                            Free spectral range        FSR          8.29       nm
+                                                            Effective index (TE0 )     neff         1.903      —
+                                                            Group index (TE0 )         ng            2.24      —
+                                                            Extraordinary index        ne           2.138      —
+
+
+
+                                                            IV.   TFLN SINGLE-RING DEVICE DESIGN AND
+                                                                          FDTD VALIDATION
+
+                                                                     A.    Waveguide and ring geometry
+
+
+                                                               The device is based on an X-cut thin-film lithium nio-
+                                                            bate (LiNbO3 ) on insulator wafer with a 600 nm-thick
+                                                            LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines
+                                                            a 1.4 µm-wide single-mode waveguide with a 100 nm un-
+                                                            etched slab (Fig. 3). Lumerical MODE simulations yield
+                                                            neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda-
+                                                            mental TE0 mode.
+                                                               The ring resonator (R = 20 µm, Lring = 125.7 µm) is
+                                                            configured as an add-drop resonator with 100 nm coupling
+                                                            gaps (Fig. 4). The FDTD-measured free spectral range
+                                                            is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE
+                                                            value due to bend-induced dispersion.
+
+
+
+
+FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale:
+  shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The
+target eI−L (black) is progressively better matched as N
+       increases. (b) Log scale: depth comparison
+    (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into
+  I ∈ [6, 8] showing convergence. (c) Pointwise relative
+  error showing the Chebyshev equioscillation pattern
+           characteristic of minimax optimality.
+                                                            FIG. 3: Cross-section of the X-cut TFLN rib waveguide
+                                                            on a SiO2 substrate. The 600 nm LiNbO3 film is etched
+                                                            500 nm to form a 1.4 µm-wide single-mode rib waveguide.
+                                                            Lateral signal (S) and ground (G) electrode positions are
+                                                               indicated; electrode design details are discussed in
+                                                                                    Sec. IV D.
+                                                                                                                       6
+
+  Table II summarizes the waveguide and ring parame-
+ters.
+
+
+              B.   3D FDTD Methodology
+
+   The ring resonator response is simulated using Lumeri-
+cal 3D FDTD with conformal variant 1 meshing. A broad-
+band TE0 mode source (1530 nm to 1570 nm) is injected
+into the input bus waveguide, and through- and drop-port
+spectra are recorded. A “z-refined 3-fix” meshing strat-
+egy ensures convergence in the thin-film geometry [37];
+detailed simulation setup is provided in Supplementary
+Sec. S4 (Table S6).
+
+
+                                                              FIG. 5: Simulated through-port (blue) and drop-port
+                                                                 (red) transmission spectra of the single add-drop
+                                                              micro-ring resonator from 3D FDTD. Top: logarithmic
+                                                              scale; bottom: linear scale. Five resonances are visible
+                                                                               with FSR ≈ 8.29 nm.
+
+
+
+                                                              15,500, Dmax = 0.360); using the five-resonance mean
+                                                              would increase required voltages by ∼24% (see Table IV
+                                                              caption).
+                                                                 The simulation time of 50 ps exceeds the loaded pho-
+                                                              ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but
+                                                              the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex-
+                                                              tracted Qi may be slightly conservative. An independent
+                                                              eigenmode (FDE) analysis of the same cross-section at
+                                                              R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5×
+  FIG. 4: Top view of the single add-drop micro-ring          finer than the FDTD vertical grid)—yields Qrad+leak =
+ resonator used in the 3D FDTD simulation. The ring           2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89)
+  waveguide (R = 20 µm, w = 1.4 µm) is evanescently           gives a theoretical Qi > 107 [37–42], confirming that
+  coupled to input and drop bus waveguides through            the gap between the numerical Qi and published val-
+     100 nm gaps at coupling points CP1 and CP2.              ues (> 106 ) originates from mesh discretization (Sup-
+                                                              plementary S4.5, Table S8). In the CMT framework,
+                                                              Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present
+                                                              coupling gap, increasing Qi to 106 would raise Dmax from
+                                                              0.36 to ∼0.95 and QL from 15,500 to ∼25,200.
+         C.    Single-Ring Add-Drop Results
+                                                                Figure 6(a) shows a Lorentzian fit to the best drop-
+   Figure 5 shows the through- and drop-port spectra from     port resonance at λ = 1566 nm, validating the cascade
+3D FDTD. Five resonances are resolved across 1530 nm          model (Eq. 9). Figure 6(b) demonstrates that cascading
+to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30).                    N copies of this FDTD-extracted Lorentzian reproduces
+                                                              the target exponential eI−L with increasing fidelity as N
+   Lorentzian fitting of the drop-port peaks yields QL =
+                                                              grows.
+10,300–15,500, with the best resonance at λ = 1566 nm
+reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360,               To validate the cascade prediction directly, a five-
+−4.4 dB). The through-port extinction ratio is 1.6 dB to      ring cascade 3D FDTD simulation was performed us-
+2.6 dB, and the five-resonance mean is QL = 12,500 ±          ing Tidy3D [43]; the full simulation notebook is publicly
+1,800 (Dmax = 0.29–0.36). CMT   √    analysis of the best     available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)]
+resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈         confirms resonant excitation across all five rings. Map-
+38,800, confirming that the 500 nm etch provides sufficient   ping the drop-port spectrum onto the control variable I
+confinement and that the 100 nm gap places the ring           yields 11 data points within the AEF operating range
+in the coupling-limited regime. The cascade analysis          [Fig. 6(e, f)], with the FDTD transmission closely tracking
+below adopts the best-case FDTD calibration (QL =             the N = 5 theoretical curve near I ≈ L = 8.
+                                                                                                                 7
+
+
+
+
+FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD
+    (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near
+ λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR
+layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD
+    simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with
+                                          11 spectral FDTD data points.
+                                                                                                                                   8
+
+     D.   X-cut electrode design and EO parameters               TABLE III: Electro-optic electrode parameters for the
+                                                                X-cut TFLN micro-ring with lateral S–G arc electrodes.
+   We employ lateral signal–ground (S–G) arc electrodes
+on the slab surface alongside the ring waveguide (Fig. 7).      Parameter                      Symbol    Value          Unit
+In the X-cut orientation, the crystal Z-axis is at 45◦ from     Crystal orientation            —         X-cut          —
+the horizontal in the substrate plane, giving a lateral-        EO coefficient                 r33       30.9           pm V−1
+field projection proportional to cos(θ − 45◦ ) at azimuthal     EO fill factor                 fEO    1/π ≈ 0.318       —
+angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦           EO overlap factor              ΓEO        0.7           —
+and 315◦ naturally separate the coupling regions from           Electrode gap                  gel         5            µm
+                                                                Effective electrode distance   deff       2.5           µm
+the electrode regions. Each ring carries a full semicir-
+cular arc electrode on the side opposite to its coupling
+points, engaging the large r33 = 30.9 pm V−1 Pockels co-
+efficient [37, 38]. The effective EO fill factor follows from   ized voltage sensitivity is (Supplementary Sec. S4; here
+integrating | cos(θ − 45◦ )| over the semicircle:               dλ/dV = 28.5 pm/V is the straight-section value and
+                             1                                  fEO accounts for partial electrode coverage of the ring
+                     fEO =     ≈ 0.318                  (29)    circumference):
+                             π
+(see Supplementary Sec. S4 for derivation). The electrode                         2 Q (dλ/dV )
+gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic                   bV =                fEO ≈ 0.182 V−1              (30)
+overlap integral is ΓEO = 0.7. Table III lists the electrode                           λ0
+parameters.
+                                                                at QL = 15,500. This estimate relies on a first-order
+                                                                electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30%
+                                                                variation in bV would shift the cascade depth by one to
+                                                                two rings at constant εmax (Table IV), leaving the quali-
+                                                                tative design conclusions unchanged. With the cascade
+                                                                framework of Sec. II (Eqs. 14–18), the N -ring drop-port
+                                                                transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates
+                                                                eI−L over I ∈ [0, L], with (a, b) optimized by minimax
+                                                                fitting for each N .
+                                                                   Table IV presents the optimization results for the stan-
+                                                                dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB).
+
+                                                                TABLE IV: Cascade optimization results for L = 8. The
+                                                                   bias voltage Vbias = |a|/bV sets the DC offset, and
+                                                                Vctrl = bL/bV is the maximum control voltage at I = L.
+                                                                   Voltages computed with bV = 0.182 V−1 (X-cut arc
+                                                                electrode, FDTD-calibrated best resonance QL = 15,500,
+                                                                 ng = 2.30). The mean FDTD quality factor across five
+FIG. 7: Illustrative two-ring cascade layout showing the        resonances is QL = 12,500 ± 1,800; using the mean would
+lateral S–G arc electrode placement on X-cut TFLN (the                         increase voltages by ∼24%.
+cascade design extends to N rings; this two-ring example
+  clarifies the electrode geometry). The crystal Z-axis is      N     a       b     E∞ εmax (%) Vbias (V) Vctrl (V)
+   oriented at 45◦ from the horizontal in the substrate          5 −2.0789 0.21658 0.1035 10.91   11.4       9.5
+plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and         10 −1.4588 0.10202 0.0265  2.68    8.0       4.5
+   315◦ naturally separate the bus-waveguide coupling           12 −1.3731 0.08450 0.0184  1.86    7.5       3.7
+regions from the electrode semicircles: each ring carries a     20 −1.2136 0.05025 0.0067  0.67    6.7       2.2
+                                                                25 −1.1685 0.04013 0.0043  0.43    6.4       1.8
+full semicircular arc electrode on the side opposite to its
+                                                                30 −1.141 0.03340 0.0030   0.30    6.3       1.5
+ coupling points. The resulting effective EO fill factor is     32 −1.1301 0.03131 0.0026  0.26    6.2       1.4
+                      fEO = 1/π ≈ 0.318.
+                                                                a The complete cascade optimization results for all N values are
+
+                                                                  listed in Supplementary Table S7.
+
+
+E.    FDTD-Calibrated bV and Cascade Optimization                 The approximation quality across different cascade
+                                                                depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g.,
+  From the device parameters in Tables II and III and           ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete
+the FDTD-calibrated ng ≈ 2.30, the effective normal-            optimization results are listed in Supplementary Sec. S4.
+                                                                                                                                    9
+
+             V.    PHYSICAL FEASIBILITY                          TABLE V: Two-regime power budget for the MRR
+                                                                       cascade. Pout assumes per-channel input
+  Having established the cascade approximation theory           Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW
+(Sec. II) and the FDTD-calibrated device parameters            laser split across M = 10 parallel channels via a 1×M
+(Sec. IV), we now assess the physical feasibility of the      splitter, or equivalently multiplexed as d WDM channels
+proposed architecture in terms of voltage requirements,       sharing a single cascade) and accounts only for the ideal
+                                                                                                     N
+insertion loss, and energy efficiency.                        on-resonance cascade transmission Dmax      (upper bound);
+                                                                additional inter-ring coupling loss (ηcoupling ≈ 0.9 per
+                                                               stage, ∼0.46 dB/stage) and off-resonance propagation
+       A.     Electro-optic voltage requirements                 loss (0.08–0.25 dB/stage) are analyzed separately in
+                                                                                        Sec. V C.
+  For the primary target of ε < 2% (N = 12), minimax
+                                                                                          N
+optimization gives a = −1.373, b = 0.0845. With the                    Dmax      N     Dmax     (dB)    Pout   εmax
+FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the                     0.36       3    0.0467   −13.3 4.67 µW ∼15%
+                                                                  I
+required voltages are                                         (FDTD) 0.36         5   0.00605   −22.2 0.61 µW 10.9%
+                                                                       0.36       7 7.84 × 10−4 −31.1 78 nW    ∼5%
+                        |a|   1.373                                    0.95      10     0.599   −2.2 59.9 µW 2.68%
+               Vbias =      =        = 7.5 V,         (31)        II
+                                                              (high-Q) 0.95      20     0.358   −4.5 35.8 µW 0.67%
+                        bV    0.182
+                                                                       0.95      30     0.215   −6.7 21.5 µW ∼0.30%
+                        bL    0.0845 × 8
+            Vctrl,max =     =             = 3.7 V.    (32)        Regime I: FDTD-characterized (Qi = 38,800). Regime II:
+                        bV       0.182                          fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch .
+
+Since bV ∝ Q, voltage scales inversely with quality factor:
+
+                            bL      bL λ0                     independent evidence that intrinsic quality factors in
+                  Vctrl =      =               .      (33)    the projected range are physically achievable in TFLN—
+                            bV   2Q |dλ0 /dV |
+                                                              albeit with wider waveguides and larger ring radii than the
+CMOS-compatible control voltages (Vctrl < 3.3 V) are          present design. Transferring comparable sidewall quality
+achievable at N ≥ 14 with QL = 15,500; at the design          to our geometry (R = 20 µm, W = 1.4 µm) is an open
+point N = 30 (εmax = 0.30%), Vctrl = 1.47 V.                  fabrication challenge; the projections should be read as
+                                                              design targets contingent on achieving it.
+                                                                 The total insertion loss comprises on-resonance
+                                                                                        N
+       B.     Power budget: two-regime analysis               cascade transmission Dmax     , inter-ring coupling loss
+                                                              (∼0.46 dB/stage for the present diagonal-bus layout),
+   The on-resonance cascade transmission DmaxN
+                                                  is the      off-resonance propagation loss (0.08–0.25 dB/stage), and
+dominant contribution to total insertion loss. Table V        fiber-to-chip coupling (1.5–3.0 dB). For the fabricated
+presents two regimes: the FDTD-characterized regime           high-Q regime (N = 30), the total ranges from ∼13 dB
+(Dmax = 0.36) and the fabricated high-Q regime (Dmax =        (optimized layout) to ∼24 dB (current geometry); see
+0.95, achievable with Qi > 106 and gap-optimized cou-         Supplementary Sec. S6 for detailed scenarios.
+pling).
+   In the FDTD-characterized regime, Dmax = 0.36 limits
+practical cascades to N ≤ 5: at N = 5 the output is                             D.    Energy comparison
+0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof-
+of-concept validation. In the fabricated high-Q regime           For N = 30 X-cut TFLN micro-ring resonators in the
+(Dmax ≥ 0.95), deep cascades become practical: N = 30         fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup-
+yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%.            plementary Sec. S5), the three energy components are EO
+The transition to fabricated high-Q devices is therefore      tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ,
+critical for achieving both high accuracy and sufficient      shared across M = 10 channels), and photodetector
+output power.                                                 (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva-
+                                                              tions in Supplementary Sec. S7). Including thermal stabi-
+                                                              lization for N = 30 rings (0.15–0.60 pJ; Supplementary
+                   C.    Feasibility outlook                  Sec. S7), the total rises to 0.94–1.39 pJ.
+                                                                 Table S12 compares the photonic cascade with digital
+  Published TFLN micro-ring resonators achieve Qi ≥           implementations. Including thermal stabilization (0.94–
+106 –108 using optimized fabrication [39–42]. At Qi =         1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×,
+106 with the present coupling geometry, CMT predicts          while operating at 10 GHz bandwidth and 58× lower than
+Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5,           digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO
+Tables S4–S7), enabling deep cascades (N ≤ 30) with           drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal;
+sub-percent error. The literature values provide strong       Supplementary Table S11), recovering a 3.2× advantage
+                                                                                                                             10
+
+     TABLE VI: Energy per exponential operation:                    with a distinct FSR order of the same ring set, traverse a
+            single-channel comparison.                              single N -ring cascade simultaneously (Fig. 8). Because
+                                                                    each channel λj sees its own Lorentzian detuning set by
+ Implementation                 E/op (pJ) Bandwidth           Notes an independent control   QN
+                                                                                                voltage Vj , the cascade output
+ Digital FP32 (Taylor)              ∼46        1 GHz      10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all
+ Digital INT8 (Taylor)              ∼2.3       1 GHz      10 INT MACsd exponentials are computed in parallel on the same phys-
+ Photonic MRR (N = 30) 0.94–1.39 10 GHz                     Analog† ical waveguide. Compared with a 1×M power-splitter
+    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal.    architecture that replicates the cascade for each channel,
+ Self-consistent with fabricated high-Q regime (QL = 25,200); see   the WDM approach reduces the total ring count from
+                       Supplementary Sec. S7.                       N × d to N (a factor-d saving) and eliminates the splitter
+                                                                    insertion loss (10 log10 d dB). At the output, a WDM
+                                                                    demultiplexer or wavelength-selective photodetector array
+over INT8. Since EEO ∝ 1/Q2 , improving Q beyond                    separates the channels for electrical readout. Figure 8
+∼30,000 yields diminishing energy returns but continues             shows a representative chip layout for N = 5 cascade
+to relax CMOS driver voltage requirements.                          stages and d = 8 WDM channels, where alternating U-
+                                                                    turn bus connections route the drop-port output of each
+                                                                    stage into the input bus of the next.
+                      VI. DISCUSSION                                   Why cascade helps. A single Lorentzian in I is too
+                                                                    rigid to mimic the log-linear target over a wide interval.
+   Practical design procedure. For a given input se-                Cascading turns the transfer into a product; taking a
+quence x = (x1 , . . . , xK ), the design proceeds as follows:      logarithm gives a sum of smooth terms, and the approx-
+                                                                    imation improves as N increases. The slope constraint
+    1. Compute m = maxn xn , un = xn − m, and L =                   N |b| ≳ 1 is an immediate feasibility check.
+         − minn un .                                                   Global softmax normalization via WDM feed-
+    2. Map to nonnegative control-signal amplitudes: In =           back.   The WDM-parallel architecture (Fig. 8) integrates
+         un + L ∈ [0, L].                                           naturally   with a closed-loop normalization scheme to com-
+                                                                    plete the full softmax function. After the N -stage cascade,
+    3. Choose tolerance ε and set εlog = ln(1 + ε).                 a WDM demultiplexer (e.g., arrayed-waveguide grating or
+                                                                    ring-filter bank) routes each channel λj to a dedicated pho-
+    4. Select a physically feasible bmax and estimate N             todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj .
+         using Eq. (28).                                            The d photocurrents are summed electrically:
+   5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2,                                d                   d
+      then refine (a, b) by a two-parameter minimax fit if
+                                                                                        X                   X
+                                                                                   S=         Iλj ∝ C Pin         eVj .     (35)
+      required.                                                                         j=1                 j=1
+
+   6. The optical block yields ỹ(In ) ≈ exn −m , and soft-       A proportional–integral (PI) controller compares S with
+      max weights follow as                                       a fixed reference Sref and adjusts the shared WDM laser
+                                                                  power Pin so that S → Sref [44, 45]. Because all d channels
+                                                                  share the same probe source, scaling Pin multiplies every
+                            ỹ(In )
+                      pn = P           .                 (34)     ỹj by the same factor; upon convergence
+                             j ỹ(Ij )
+                                                                                   Iλj      eVj
+                                                                            pj =        = Pd        = softmax(V )j ,        (36)
+   Scope and limits. The approximation is for a fi-                                Sref          Vk
+                                                                                           k=1 e
+nite interval I ∈ [0, L], where L is determined by the
+input batch via Eq. (4). In practice, one designs for a           realizing the complete softmax with a single feedback loop
+worst-case L expected in operation (or retunes a and              and no per-channel normalization circuitry. Compared
+rescales the control signal to adapt L). Noise, insertion         with the replicated-cascade approach (one AEF block per
+loss, and control-induced parasitics limit accuracy and           channel), WDM feedback offers two additional benefits:
+dynamic range; we treat these effects as platform-specific        (i) the splitter-induced power imbalance that would bias
+margins. Detailed non-ideality assumptions, parameter             the Iλj ratios is absent, since all channels traverse the
+distributions, and robustness statistics are reported in          same optical path; and (ii) a single laser control point
+Supplementary Sec. S8. With K channels in parallel,               replaces d independent probe adjustments. Design de-
+one can form softmax by summing channel powers and                tails and stability analysis of the PI loop are provided in
+applying a shared reciprocal scale factor, depending on           Supplementary Sec. S9.
+the chosen mixed-signal normalization scheme.                        Beyond ring-resonator AEF implementations, the same
+   WDM parallelism. A particularly hardware-efficient             cascade principle can be extended to other cavity-based
+realization exploits wavelength-division multiplexing             photonic platforms, such as serial 1D photonic-crystal cav-
+(WDM): d probe wavelengths λ1 , . . . , λd , each resonant        ities and other cascaded resonant architectures [21, 46].
+                                                                                                                                  11
+
+What these platforms share is transfer-function shaping          TABLE VII: Summary of evidence levels for key claims.
+through cascaded resonances; loss, tuning range, fabrica-
+tion tolerance, and calibration overhead remain platform-        Claim                              Evidence       Sec.
+dependent.                                                       Cascade → exp. approx.             Analytic        II
+    The insertion loss budget (Sec. V C) and electro-optic       Depth scaling                  Analytic + num. II, III
+voltage requirements (Sec. V A) suggest that the cas-            QL , Dmax , bV                    3-D FDTD         IV
+cade architecture is feasible under optimized coupling           5-ring line shape                 3-D FDTD         IV
+and layout conditions. Using monolithic TFLN microring           N ≤ 30 deep cascade              CMT proj.∗         V
+                                                                 Energy < 1 pJ                      Estimate        V
+data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈
+                                                                 Full softmax (WDM + feedback) Conceptual + layout VI
+9–20 pm/V), the normalized sensitivity bV ≃ 0.063–
+                                                                 ∗ Based on published Q
+0.14 V−1 , within the range required by the cascade design.                               i ≥ 10
+                                                                                                   6 values [39, 42] and CMT coupling
+
+                                                                                                   model.
+Crystal orientation and electrode design. The X-
+cut TFLN platform was chosen for several reasons. First,
+X-cut is the prevailing industry standard for integrated         tified in the Monte Carlo robustness analysis (Supple-
+TFLN modulators, with well-established fabrication pro-          mentary Sec. S8). Monte Carlo simulations (Supplemen-
+cesses and commercial wafer availability [37, 38]. Second,       tary Sec. S8) show that under nominal non-ideality levels
+the TE0 mode—which is strongly confined in the rib               (σa = 0.020, σb,rel = 0.020), a single-point calibration of
+waveguide geometry—can engage the large r33 coefficient          C per chip keeps the median softmax KL divergence below
+via lateral electric fields aligned with the crystal Z-axis.     2.2 × 10−4 , with 95th-percentile max probability error
+In contrast, Z-cut geometry with TE polarization can only        under 0.32%. Even under stress conditions (σa = 0.032),
+access the smaller r13 coefficient (∼ 10 pm/V), resulting        95th-percentile errors remain below 0.42%, demonstrat-
+in significantly lower electro-optic efficiency. The arc elec-   ing that the identical-detuning design is robust to realis-
+trode design (Sec. IV D) addresses the phase-cancellation        tic fabrication variations provided a per-chip calibration
+problem inherent to X-cut circular rings [47] by orienting       step is performed. Conversely, if coupling gaps are in-
+the crystal Z-axis at 45◦ from the horizontal in the sub-        tentionally varied across rings, the per-ring parameters
+strate plane. This rotation places the cos(θ − 45◦ ) = 0         (ak , bk ) become independent degrees of freedom. A Taylor-
+boundaries at θ = 135◦ and 315◦ , naturally separating the       expansion analysis shows that K non-identical rings can
+bus-waveguide coupling regions from the electrode regions.       cancel curvature
+                                                                               P terms up to order 2K in the Taylor series
+Each ring carries a full semicircular arc electrode on the       of g(I) = k ln Tk , one order higher than K identical
+side opposite to its coupling points, yielding an effective      rings, so that fewer rings suffice for a given error target.
+fill factor fEO = 1/π ≈ 0.318. While this reduces the
+round-trip EO efficiency compared to a hypothetical full-
+circumference design, it preserves the compact footprint
+of a circular ring resonator. The cascade performance
+can be further improved beyond the R = 20 µm circular-
+ring design presented here. Increasing the ring radius
+reduces bending loss and raises the intrinsic quality factor
+Qi , which directly increases bV (∝ Q) and lowers the
+required control voltage. Alternatively, adopting a race-
+track geometry with extended straight coupling sections
+strengthens the bus–ring coupling, pushing the drop-port
+maximum Dmax closer to critical coupling and improving
+the per-stage transfer efficiency. Either approach—or their
+combination—would yield higher bV and Dmax , enabling
+lower N or tighter approximation accuracy at reduced
+operating voltages.
+Fabrication considerations. The X-cut TFLN rib
+waveguide (600 nm total thickness, 500 nm etch, w =
+1.4 µm) follows established fabrication processes for com-
+mercial TFLN wafers on SiO2 [37, 38]. The lateral signal–
+ground (SG) electrode configuration is fabricated in a
+single metal layer, which is standard in TFLN foundry
+processes. The primary fabrication challenge for the
+cascade architecture is maintaining uniform coupling
+gaps (g = 100 nm) across N rings to ensure identi-
+cal Lorentzian transfer functions. Post-fabrication trim-
+ming via UV exposure or localized thermal oxidation can
+compensate residual detuning variations [30], as quan-
+                                                                                                                12
+
+
+
+
+               Softmax Full Chip Layout – N = 5 × d = 8 (TFLN)
+                                d = 8 WDM channels
+
+
+                 Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8
+
+  WDM
+ λ1−λ8    n=1
+         Pin
+
+
+          n=2
+                                                                               N = 5
+                                                                               cascade
+          n=3                                                                  stages
+
+
+
+
+          n=4
+
+
+          n=5
+
+
+
+
+                              WDM Demux (AWG / ring filter)
+
+                                                                                             Sref
+                        PD1   PD2   PD3     PD4   PD5   PD6   PD7   PD8
+                                                                          Iλ
+                                                                               j         S          e
+                                                                                   Σ          −            PI
+                        p1     p2    p3      p4   p5    p6    p7    p8
+
+
+
+
+                                              Feedback: adjust Pin
+                                      Iλj
+                     Output: pj =             = softmax(V )j
+                                      Sref
+
+FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM
+ channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a
+ Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s
+input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output
+  passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel
+  photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller
+          adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields
+                                       pj = Iλj /Sref = softmax(V )j (Eq. 36).
+                                                                                                                            13
+
+                 VII.    CONCLUSION                             Dmax ≥ 0.95) are realized in the cascade geometry, deeper
+                                                                cascades (N ≈ 20–30) would reach sub-percent approx-
+   We have presented a cascaded micro-ring resonator ar-        imation error with an estimated per-operation energy
+chitecture that approximates the exponential function           of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8
+exn −m on a finite interval [0, L] using multiplicative         MAC at the 7 nm node. Monte Carlo analysis shows that
+Lorentzian transfer functions. Increasing the cascade           the identical-detuning design tolerates realistic fabrica-
+depth N systematically reduces the worst-case relative          tion variations (σa = 0.020, σb,rel = 0.020) with a single
+error, and an identical-detuning design initialized by flank    per-chip calibration, keeping the 95th-percentile softmax
+and slope matching provides a practical two-parameter           probability error below 0.32%.
+design.
+   Three-dimensional FDTD simulations of a single X-cut            The formulation is not restricted to electro-optic tuning:
+TFLN add-drop ring (R = 20 µm, g = 100 nm) yield                it requires only a controllable detuning coordinate with lo-
+QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the             cal linearization, so both Pockels and optical (Kerr/XPM)
+cascade transfer model. A five-ring cascade 3D FDTD             mechanisms are compatible [37, 38, 47, 48]. We demon-
+simulation directly validates the multi-ring framework:         strate a photonic exponential block and present a WDM-
+all five rings exhibit resonant excitation, and mapping         parallel chip architecture (Fig. 8) in which d wavelength
+the drop-port spectrum onto the dimensionless control           channels share a single N -ring cascade, reducing the total
+variable reproduces the theoretical N = 5 curve with            ring count by a factor of d and eliminating power-splitter
+∼11% integrated relative-area error over the upper op-          loss. Combined with a single-loop PI feedback that adjusts
+erating range (I ≥ 5.8), providing the first multi-ring         the shared WDM laser power, the architecture realizes the
+confirmation of the cascade exponential approximation.          complete softmax function—exponentiation, summation,
+At the present FDTD-characterized quality factor, practi-       and normalization—without per-channel normalization
+cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q        circuitry. Max-finding and digital interfacing remain open
+TFLN resonators reported in the literature (Qi ≥ 106 ,          for future experimental validation.
+
+
+
+
+ [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob                Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A.
+     Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,          Franken, Keith Powell, Hana Warner, Daniel Assumpcao,
+     and Illia Polosukhin. Attention is all you need. In             Dylan Renaud, Ying Wang, et al. Integrated lithium
+     Advances in Neural Information Processing Systems 30            niobate photonic computing circuit based on efficient and
+     (NeurIPS 2017), pages 5998–6008, 2017.                          high-speed electro-optic conversion. Nature Communica-
+ [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,               tions, 16:8178, 2025.
+     and Christopher Ré. FlashAttention: Fast and memory-      [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang.
+     efficient exact attention with IO-awareness. In Advances        SOFTONIC: A photonic design approach to softmax
+     in Neural Information Processing Systems 35 (NeurIPS            activation for high-speed fully analog AI acceleration.
+     2022), pages 16344–16359, 2022.                                 In Proceedings of the Great Lakes Symposium on VLSI
+ [3] Neil Savage. Light could lower AI’s appetite for power.         (GLSVLSI ’25), pages 118–125, 2025.
+     Nature Nanotechnology, 21:6–8, 2026.                       [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto-
+ [4] Yichen Shen et al. Deep learning with coherent nanopho-         electronic nonlinear softmax operator based on diffractive
+     tonic circuits. Nature Photonics, 11(7):441–446, 2017.          neural networks. Optics Express, 32(15):26458–26469,
+ [5] Johannes Feldmann et al. Parallel convolutional process-        2024.
+     ing using an integrated photonic tensor core. Nature,      [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang,
+     589(7840):52–58, 2021.                                          Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang,
+ [6] Nicholas C. Harris et al. Linear programmable nanopho-          Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans-
+     tonic processors. Optica, 5(12):1623–1631, 2018.                former chip: interference is all you need. PhotoniX, 6:45,
+ [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre               2025.
+     Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan    [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai,
+     Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice,         Brucek Khailany, and Anand Raghunathan. Softermax:
+     and H. Bhaskaran. Higher-dimensional processing using           Hardware/software co-design of an efficient softmax for
+     a photonic tensor core with continuous-time data. Nature        transformers. In Proceedings of the 58th ACM/IEEE
+     Photonics, 17(12):1080–1088, 2023.                              Design Automation Conference (DAC), pages 469–474,
+ [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski,                 2021.
+     John E. Bowers, Michael Hochberg, Richard Soref, and       [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong
+     Bhavin J. Shastri. Roadmapping the next generation of           Chang. Hardware-efficient softmax approximation for
+     silicon photonics. Nature Communications, 15:751, 2024.         self-attention networks. In Proceedings of the IEEE Inter-
+ [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor           national Symposium on Circuits and Systems (ISCAS),
+     cores for machine learning. Applied Physics Reviews,            pages 1–5, 2023.
+     7(3):031404, 2020.                                         [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun,
+[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo,             and Yongpan Liu. SOLE: Hardware-software co-design
+                                                                                                                               14
+
+     of softmax and layernorm for efficient transformer infer-          2025. accessed 2026-02-21.
+     ence. In Proceedings of the IEEE/ACM International            [35] Jane Austen. Pride and prejudice. Project Gutenberg
+     Conference on Computer-Aided Design (ICCAD), pages                 eBook No. 1342, 2025. accessed 2026-02-21.
+     1–9, 2023.                                                    [36] Hyoseok Park. MRR-AEF: reproducible MRR depth-
+[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan,              sweep fitting and supplementary validation scripts.
+     Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft-              GitHub repository, 2025. commit 585e695, accessed 2026-
+     max function: Suitability for training and efficient hard-         02-21.
+     ware implementation. IEEE Transactions on Circuits and        [37] Di Zhu et al. Integrated photonics on thin-film lithium
+     Systems I: Regular Papers, 69(9):3605–3618, 2022.                  niobate. Advances in Optics and Photonics, 13(2):242–352,
+[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing                2021.
+     Pan. TEA-S: A tiny and efficient architecture for PLAC-       [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang
+     based softmax in transformers. IEEE Transactions on                Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
+     Circuits and Systems II: Express Briefs, 70:3594–3598,             CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo,
+     2023.                                                              Amirhassan Shams-Ansari, David Barton, Neil Sinclair,
+[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and                  and Marko Loncar. Integrated electro-optics on thin-film
+     Fabrizio Lombardi. Approximate softmax functions for               lithium niobate. Nature Reviews Physics, 2025.
+     energy-efficient deep neural networks. IEEE Transactions      [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan
+     on Very Large Scale Integration (VLSI) Systems, 31:4–16,           Shams-Ansari, and Marko Lončar. Monolithic ultra-high-
+     2023.                                                              Q lithium niobate microring resonator. Optica, 4(12):1536–
+[20] Wim Bogaerts et al. Silicon microring resonators. Laser            1537, 2017.
+     & Photonics Reviews, 6(1):47–73, 2012.                        [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q
+[21] John E. Heebner, Robert W. Boyd, and Q.-Han                        thin-film lithium niobate microrings fabricated with wet
+     Park. Scissor solitons and other propagation effects in            etching. Adv. Mater., 35(3):2208113, 2023.
+     microresonator-modified waveguides. Journal of the Opti-      [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K.
+     cal Society of America B, 19(4):722–731, 2002.                     Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag-
+[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and                 alhães, Amirhassan Shams-Ansari, Neil Sinclair, and
+     Shanhui Fan. Microring-based programmable coherent                 Marko Lončar. Twenty-nine million intrinsic Q-factor
+     optical neural networks. Optics Express, 31(12):18871,             monolithic microresonators on thin-film lithium niobate.
+     2023.                                                              Photon. Res., 12(8):A63–A68, 2024.
+[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo.          [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian
+     StarLight: a photonic neural network accelerator featur-           Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng.
+     ing a hybrid mode-wavelength division multiplexing and             Lithium niobate microring with ultra-high Q factor above
+     photonic nonvolatile memory. Optics Express, 30:37051,             108 . Chin. Opt. Lett., 20(1):011902, 2022.
+     2022.                                                         [43] Flexcompute Inc.       Tidy3D: electromagnetic simula-
+[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang,                   tion software. https://www.flexcompute.com/tidy3d/,
+     and Weifeng Zhang. Reconfigurable low-threshold all-               2024.       v2.10; cloud GPU FDTD. Accompany-
+     optical nonlinear activation functions based on an add-            ing notebook: https://www.flexcompute.com/tidy3d/
+     drop silicon microring resonator. IEEE Photonics Journal,          community/notebooks/CascadedMRRTFLN/.
+     14(6):1–7, 2022.                                              [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights.
+[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals                Silicon photonic dynamic optical channel leveler with
+     of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007.                 external feedback loop. Optics Express, 18(13):13805–
+[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R.                   13812, 2010.
+     Panepucci, and Michal Lipson. All-optical control of light    [45] Karl J. Åström and Richard M. Murray. Feedback Systems:
+     on a silicon chip. Nature, 431(7012):1081–1084, 2004.              An Introduction for Scientists and Engineers. Princeton
+[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and                   University Press, Princeton, NJ, 2008.
+     Michal Lipson. Micrometre-scale silicon electro-optic mod-    [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel
+     ulator. Nature, 435(7040):325–327, 2005.                           Scherer. Coupled-resonator optical waveguide: a proposal
+[28] Kishore Padmaraju and Keren Bergman. Resolving the                 and analysis. Optics Letters, 24(11):711–713, 1999.
+     thermal challenges for silicon microring resonator devices.   [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien,
+     Nanophotonics, 3:269–281, 2014.                                    Lynford L. Goddard, and Songbin Gong. Ultra-efficient
+[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X.              and fully isotropic monolithic microring modulators in
+     Wang. Transparent conductive oxide-gated silicon mi-               a thin-film lithium niobate photonics platform. Optics
+     croring with extreme resonance wavelength tunability.              Express, 28(20):29644–29661, 2020.
+     Photonics Research, 7(4):473, 2019.                           [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki,
+[30] Lahiru Jayatilleka et al. Post-fabrication trimming of             Peng Yao, and Dennis W. Prather. Tunable hybrid sil-
+     silicon photonic ring resonators at wafer-scale. Journal           icon nitride and thin-film lithium niobate electro-optic
+     of Lightwave Technology, 39:5083–5088, 2021.                       microresonator. Optics Letters, 44(3):618, 2019.
+[31] Elliott W. Cheney. Introduction to Approximation Theory.
+     McGraw–Hill, New York, 1966.
+[32] Alec Radford et al. Language models are unsupervised
+     multitask learners. Technical report, OpenAI, 2019.
+[33] Hugging Face. distilgpt2 model card, 2025. accessed
+     2026-02-21.
+[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn),
+                                                                                                                      15
+
+                                      SUPPLEMENTARY INFORMATION
+
+Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
+toward Softmax.”
+
+
+                           S0. RIGOROUS DERIVATION AND VALIDITY SCOPE
+
+  This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions
+under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma,
+Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration.
+
+
+                                                  S0.1 Assumptions
+
+Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form
+Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized
+sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude.
+Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of
+ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced
+                                                                        QN
+effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I).
+Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces
+the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N .
+Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is
+a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is
+negligible.
+Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval
+I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for
+fixed, finite L.
+Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1
+(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the
+steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear.
+
+
+                                                S0.2 Rigorous results
+
+  Throughout, define the log-domain residual
+
+                          r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L),
+                                                                               
+                                                                                                                  (S0.1)
+
+and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) +
+         
+minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout.
+Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0,
+
+                                               d
+                                                  ln y(I) ≤ N |b|.
+                                               dI
+
+Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating:
+                                                        
+
+                                           d                 2b(a + bI)
+                                              ln y(I) = −N               .
+                                           dI              1 + (a + bI)2
+
+Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM).
+Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|.
+                                                                                                                        16
+
+Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a
+necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the
+main text and is a rigorous (not heuristic) necessary condition.
+Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and
+δ = I − I0 . Then
+                                                                           δ3
+                                                ln ỹ(I) = const + δ +         + R4 (δ),                             (S0.2)
+                                                                          6N 2
+where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic
+term vanishes identically at the flank point u0 = a + bI0 = −1.
+Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ))
+and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1:
+                                         2u
+                           ϕ′ (u) = −          ,                             ϕ′ (−1) = 1,
+                                       1 + u2
+                                     2(u2 − 1)
+                          ϕ′′ (u) =              ,                          ϕ′′ (−1) = 0,
+                                     (1 + u2 )2
+                                     4u(3 − u2 )                                         −4(−1)(3 − 1)
+                          ϕ′′′ (u) =               ,                       ϕ′′′ (−1) =                 = 1.
+                                      (1 + u2 )3                                           (1 + 1)3
+By the chain rule, writing F (δ) = N ϕ(u(δ)):
+                                                  F ′ (0) = N b ϕ′ (−1) = N b = 1,
+                                                 F ′′ (0) = N b2 ϕ′′ (−1) = 0,
+                                                                          1
+                                                 F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 =
+                                                                            ,
+                                                                         N2
+where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C
+is
+                                                              δ2   1 δ3
+                                   ln ỹ(I) = const + δ + 0 ·    + 2·    + R4 (δ).
+                                                              2   N    6
+Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is
+bounded by the standard Taylor remainder estimate.
+Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the
+leading-order worst-case log-error on I ∈ [0, L] satisfies
+                                                              3
+                                           (leading)       1  L      L3
+                                         E∞          ∼            =        .                            (S0.3)
+                                                         6N 2 2     48 N 2
+          (leading)
+Setting E∞            ≤ εlog = ln(1 + ε) and solving for N gives
+                                                                 L3/2
+                                                            N ≥ p        .                                           (S0.4)
+                                                                 48 εlog
+Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for
+|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving:
+                                                 L3                         L3/2
+                                                       ≤ εlog     =⇒     N≥p         .
+                                                48 N 2                       48 εlog
+        √
+With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the
+one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the
+                                                                     √
+main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉.
+Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The
+derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07).
+For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and
+                                                                                                    √
+higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by
+numerical evidence (Table I) but should be treated as an engineering design rule.
+                                                                                                                      17
+
+                                 S0.3 Derivation of the conservative screening bound
+
+  We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main
+text.
+Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the
+flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write
+                                                                          
+                                         ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) .
+
+The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and
+r(0) = N [ϕ(a) − ϕ(a + bL)] + L.
+   For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e.,
+ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies
+
+                                              N ∥ϕ′′ ∥∞ b2 L2   N ϕ′ (a + bL) · b − 1
+                                   E∞ ≤                       +                       · L,                        (S0.5)
+                                                      8                   2
+
+where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|.
+Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to
+match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem
+on [0, L]:
+                                                                          Z L
+                                                                                1 − h′ (t) dt.
+                                                                                       
+                          r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) =
+                                                                                 I
+                                                                  RL
+Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound
+|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5).
+Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0),
+and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to
+
+                                                                N b2 L 2
+                                                       E∞ ≤              .                                        (S0.6)
+                                                                   4
+Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add
+the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is
+
+                                                             L2    1
+                                                     E∞ ≤       + 2 ,                                             (S0.7)
+                                                             4N  2b N
+which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the
+conservative screening depth:
+                                                      2
+                                                       L /4 + 1/(2b2 )
+                                                                       
+                                            Nsafe ≥                      .                                     (S0.8)
+                                                          ln(1 + ε)
+
+Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is
+conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit
+the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating
+range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N )
+arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper
+bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary
+design-space exploration but not a certified minimax guarantee.
+
+
+                                            S0.4 Validity scope and failure cases
+
+  The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may
+break down.
+                                                                                                                       18
+
+(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation
+     valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order
+     dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large
+     detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings.
+
+(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward-
+     propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring
+     count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade
+     transfer function.
+
+(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN .
+     In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8
+     quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning.
+                                                                                          (0)
+(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic
+     error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide
+     ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L
+     requiring a control swing exceeding the linearity range of the tuning mechanism.
+
+(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound
+     and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime.
+     There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at
+     p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss
+     budgets.
+
+(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation
+     ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing
+     a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N
+     rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition.
+
+
+                                        S0.5 Mapping to main-text equations
+
+For reference, the results derived here correspond to the following main-text equations:
+
+    • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary
+      condition.
+
+    • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to
+      main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits.
+
+    • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived
+      as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition.
+
+    • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8)
+      in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design.
+
+Summary of guarantee status:
+Result                            Status                                      Main-text Eq.
+Slope bound N |b| ≥ 1             Rigorous (proved)                           (23)
+                    √
+Scaling N ∼ κL3/2 / εlog          Heuristic (Taylor truncation + empirical κ) (28)
+Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic                      (S0.7)
+Nsafe screening depth             Conservative backstop                       (S0.8)
+
+
+            S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND
+
+  This section provides the detailed derivations underlying the depth-scaling relations and conservative screening
+bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0.
+                                                                                                                          19
+
+                                S1.1 Local expansion and exponential-like behavior
+
+   To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the
+flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a
+third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives
+
+                                                      N 2 2 2 N (N 2 − 1) 3 3
+                                ỹ(I) ≈ 1 + N b δ +      b δ +           b δ + O(δ 4 ),                               (S1.1)
+                                                       2          6
+so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining
+why the initialization is already close before refinement.
+
+
+                                  S1.2 Log-domain analysis and scaling derivation
+
+  For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1),
+expand around I0 = L/2 with δ = I − I0 to obtain
+
+                                                                     N b3 3
+                                        ln ỹ(I) = const + N b δ +       δ + O(δ 4 ).                                 (S1.2)
+                                                                      6
+At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives
+
+                                                                      δ3
+                                           ln ỹ(I) = const + δ +         + O(δ 4 ).                                  (S1.3)
+                                                                     6N 2
+Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 .
+Requiring E∞ ≤ εlog leads to
+
+                                                           L3/2
+                                                         N∝√      ,                                                   (S1.4)
+                                                             εlog
+
+which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a
+formal guarantee), and the prefactor remains platform- and fitting-criterion dependent.
+
+
+                                S1.3 Conservative upper bound and screening depth
+
+   For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition
+for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1).
+For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the
+worst-case log-error as
+
+                                                            L2    1
+                                                   E∞ ≤        + 2 .                                                  (S1.5)
+                                                            4N  2b N
+(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening
+estimate (not a formal guarantee), one may use
+                                                      2
+                                                      L /4 + 1/(2b2 )
+                                                                      
+                                              N ≥                       .                                     (S1.6)
+                                                         ln(1 + ε)
+
+While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design
+screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions,
+is given in Sec. S0.3.
+                                                                                                                  20
+
+              S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION
+
+  This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range
+calibration tables referenced in the main text (Sec. III).
+
+
+                                 S2.1 Worked input-to-output mapping example
+
+  As a worked example, consider
+
+                                                x = [−3.2, 1.2, 4.8, −0.9].                                    (S2.1)
+
+Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped
+control-signal levels are
+
+                                               I = u + L = [0, 4.4, 8.0, 2.3],                                 (S2.2)
+
+and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly,
+                                                                                     N
+                                                      1                              Y
+                                  Tk (In ) =                    ,         y(In ) =         Tk (In ).
+                                               1 + (ak + bIn )2
+                                                                                     k=1
+
+Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes
+                                                                                       N
+                                                                             1
+                                       ỹ(In ) = C y(In ) = C                                .
+                                                                      1 + (a + bIn )2
+For the re-fitted parameters used in this example,
+
+                                                a = −1.4588,          b = 0.10202,
+                                                                                                               (S2.3)
+                                               N = 10,       C = 3.0896 × 101 .
+
+which gives
+                                                                           N
+                                                                 1
+                                        ỹ(In ) = C                              ,
+                                                          1 + (a + bIn )2
+                                                                                                               (S2.4)
+                                                 ≈ [3.44 × 10−4 , 2.73 × 10−2 ,
+                                                       9.74 × 10−1 , 3.26 × 10−3 ].
+
+  For reference, the corresponding target terms are
+
+                                           In − L = [−8.0, −3.6, 0, −5.7],                                     (S2.5)
+
+and
+                                          In −L  
+                                          e       ≈ 3.35 × 10−4 , 2.73 × 10−2 ,
+                                                                                                               (S2.6)
+                                                      1.00, 3.35 × 10−3 .
+                                                                        
+
+
+
+
+                            S2.2 Effective-range percentiles and clipping calibration
+
+   We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive
+Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context
+length 128, the effective range
+
+                               Leff,α = max(log pkept ) − min(log pkept ),              α = 0.999,             (S2.7)
+
+fell in a relatively narrow band, summarized in Table S2.
+                                                                                                                          21
+
+ TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters
+                        re-fitted on I ∈ [0, 8.0] using the same minimax pipeline.
+
+  xn                      In                     target exn −m                     approx ỹ(In )                   rel. err.
+                                                            −4                                −4
+−3.2                     0.0                     3.3546 × 10                       3.4443 × 10                       2.673%
+ 1.2                     4.4                     2.7324 × 10−2                     2.7325 × 10−2                     0.004%
+ 4.8                     8.0                            1.0000                            0.9739                     2.608%
+−0.9                     2.3                     3.3460 × 10−3                     3.2585 × 10−3                     2.614%
+
+
+                       TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128.
+
+                                           Percentile All runs (4 runs) GPT-2
+                                           p50            6.92–7.23    7.09–7.23
+                                           p90            8.60–8.75    8.73–8.75
+                                           p95            8.97–9.12    9.06–9.12
+                                           p99            9.50–9.69    9.58–9.69
+
+
+  We then test clipping on the same rows with
+
+                                       Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 ,
+                                                                                                                      (S2.8)
+                                           u(t) = max(u, t),     u = s − max(s).
+
+and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated
+at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12).
+  In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii)
+choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol
+details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3.
+
+
+                                        S2.3 Illustrative synthetic range map
+                                                                                                                   √
+  As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk
+scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3
+summarizes achievable approximation error versus depth.
+
+   TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error
+                      (design-space illustration; not intended as corpus-level statistics).
+
+L regime                       N =5                       N = 10                        N = 20                      N = 30
+   L=8                         10.9%                       2.68%                        0.67%                        0.30%
+  L = 12                       40.0%                       9.25%                        2.27%                        1.01%
+  L = 16                       113%                        23.0%                        5.44%                        2.41%
+
+
+  Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8,
+moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1%
+error.
+  We include Table S3 as a synthetic design map rather than an empirical benchmark.
+                                                                                                                    22
+
+         S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS
+
+  We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in
+the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora
+(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention
+row, if p = softmax(s) then the raw range is
+                                 Lraw = max(s) − min(s) = max(log p) − min(log p),                                (37)
+where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p),
+we additionally report an effective range:
+                                         Leff,α = max(log pkept ) − min(log pkept ),                              (38)
+where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999.
+  To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128),
+and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs.
+
+  TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs
+                                             effective (α = 0.999).
+
+                     Model     Dataset             raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99
+                     distilgpt2 tiny shakespeare     22.82      69.00    7.10        8.60     8.97   9.50
+                     distilgpt2 pride prejudice      21.76      68.60    6.92        8.60     9.03   9.57
+                     gpt2       tiny shakespeare     25.48      43.34    7.23        8.73     9.06   9.58
+                     gpt2       pride prejudice      24.13      40.92    7.09        8.75     9.12   9.69
+
+  For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this
+table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50
+= 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69.
+Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used
+without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define
+u = s − max(s) and, for threshold t ≤ 0,
+                                       u(t) = max(u, t),           p(t) = softmax(u(t) ).                         (39)
+We report the cumulative softmax error
+                                                        1 (t)
+                                                           p −p ,
+                                                   Ecum (t) =                                                     (40)
+                                                        2          1
+then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows.
+
+       TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t.
+
+                                   t        p50              p90         p95            p99
+                                                   −5              −5           −5
+                                  −14 2.53 × 10    4.55 × 10   4.80 × 10   5.18 × 10−5
+                                                −5          −5          −5
+                                  −13 2.69 × 10    4.85 × 10   7.38 × 10   1.48 × 10−4
+                                                −5          −4          −4
+                                  −12 2.99 × 10    1.21 × 10   2.13 × 10   4.27 × 10−4
+                                  −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3
+                                  −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3
+                                  −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2
+                                  −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2
+                                  −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2
+                                  −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1
+
+   Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep
+is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably,
+this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating
+that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range
+budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while
+keeping aggregate softmax distortion below 0.1%.
+                                                                                                                   23
+
+
+
+
+    FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs.
+
+
+
+
+FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the
+                                                0.1% budget (10−3 ).
+                                                                                                                        24
+
+                    S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION
+
+  This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage
+sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V).
+
+
+                                       S4.1 z-refined 3-fix simulation strategy
+
+   For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast
+between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix”
+strategy:
+   1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate
+      for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation;
+   2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient
+      substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts;
+   3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied.
+      The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm
+      slab.
+This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral
+features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation
+parameters.
+
+                              TABLE S6: 3D FDTD simulation parameters (Lumerical).
+
+Parameter                                                                                  Value
+Solver                                                                                     Lumerical 3D FDTD
+Mesh type                                                                                  Conformal variant 1
+Mesh accuracy                                                                              3 (auto-mesh)
+z-mesh override                                                                            None (auto-mesh)
+Simulation time                                                                            50 ps
+Auto shutoff                                                                               1 × 10−6
+Wavelength range                                                                           1530 nm to 1570 nm
+Grid size                                                                                  532 × 816 × 44
+Source                                                                                     Broadband mode source (TE0 )
+
+
+
+
+                                S4.2 X-cut arc electrode bV step-by-step derivation
+
+   For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦
+from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral
+electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and
+θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full
+semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective
+EO fill factor is
+                         Z                                   Z +π/2
+                       1                                   1                    1       +π/2  1
+                fEO =               | cos(θ − 45◦ )| dθ =           cos φ dφ =      sin φ −π/2 = ≈ 0.318.          (S4.1)
+                      2π semicircle                       2π −π/2              2π               π
+The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor
+integral is identical to the standard cos θ case by the change of variable.
+   The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm.
+The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration.
+   The refractive index change per volt in the electrode-covered section is
+             ∆neff    1        ΓEO     1                              0.7
+                   = − n3e r33      = − × 2.1383 × 30.9 × 10−12 ×            = −4.226 × 10−5 V−1 .                  (S4.2)
+              V       2        deff    2                          2.5 × 10−6
+                                                                                                                     25
+
+The corresponding resonance wavelength shift is
+                                  dλ0           1550 × 4.226 × 10−5
+                                              =                     = 28.48 pm V−1 ,                             (S4.3)
+                                  dV straight           2.30
+
+giving an intrinsic (straight-section) voltage sensitivity of
+                                         2QL dλ0           2 × 15,500
+                           bstraight
+                            V        =                   =            × 0.02848 = 0.570 V−1 .                    (S4.4)
+                                          λ0 dV straight      1550
+However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The
+effective voltage sensitivity is therefore
+                                                                      1
+                                     bV = bstraight
+                                           V        × fEO = 0.570 ×     ≈ 0.182 V−1 .                            (S4.5)
+                                                                      π
+A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318),
+the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations)
+and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length.
+
+
+                                           S4.3 Full cascade optimization table
+
+  Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to
+e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30.
+
+     TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and
+Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated
+                                          best resonance QL = 15,500).
+
+N                a                    b                 E∞              εmax (%)          Vbias (V)            Vctrl (V)
+ 5            −2.0789              0.21658             0.1035             10.91             11.4                  9.5
+ 8            −1.5959              0.12896             0.0412              4.20              8.8                  5.7
+10            −1.4588              0.10202             0.0265              2.68              8.0                  4.5
+12            −1.3731              0.08450             0.0184              1.86              7.5                  3.7
+15            −1.2914              0.06726             0.0118              1.19              7.1                  3.0
+17            −1.2543              0.05923             0.0092              0.92              6.9                  2.6
+20            −1.2136              0.05025             0.0067              0.67              6.7                  2.2
+25            −1.1685              0.04013             0.0043              0.43              6.4                  1.8
+30            −1.1392              0.03341             0.0030              0.30              6.3                  1.5
+
+
+  Key thresholds for the minimum number of rings at various error targets are:
+     • ε < 10%: N ≥ 6,
+     • ε < 5%: N ≥ 8,
+     • ε < 2%: N ≥ 12,
+     • ε < 1%: N ≥ 17,
+     • ε < 0.5%: N ≥ 24.
+These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in
+normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning
+range, through bV .
+
+
+                                              S4.4 Lorentzian fit validation
+
+  Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian
+Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer
+function model used in the cascade analysis.
+                                                                                                                      26
+
+
+
+
+  FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The
+                           extracted quality factor is QL = 15,500 with FWHM = 101 pm.
+
+
+                                 S4.5 Eigenmode (FDE) analysis of theoretical Qi
+
+   To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional
+finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions.
+   a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm
+and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is
+identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch,
+waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300
+cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially
+finer than the 3D FDTD auto-mesh (55 nm vertical).
+   b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided
+mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm:
+                                        neff = 1.9653 + i (4.73 × 10−8 ),                                            (41)
+                                               4π ni
+                                                     = 0.383 m−1 0.017 dB cm−1 .
+                                                                              
+                                   αrad+leak =                                                                       (42)
+                                                 λ
+Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the
+100 nm slab. The corresponding quality factor is
+                                                         2π ng
+                                         Qrad+leak =               = 2.43 × 107 ,                                    (43)
+                                                       αrad+leak λ
+where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the
+small difference arises from the straight-section approximation inherent to 2D FDE).
+  c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields
+Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from
+                                   1          1        1
+                                        =           −       ,      Qbend = 1.43 × 108 .                              (44)
+                                  Qbend   Qrad+leak   Qleak
+This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the
+dominant geometric loss channel.
+   d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical
+intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is
+                                                             2π ng
+                                                   Qabs =            ,                                               (45)
+                                                            Γ αmat λ
+                                                                                                                   27
+
+where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for
+representative TFLN absorption values from the literature [6, 7].
+
+TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ),
+ substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded.
+                       The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 .
+
+Material condition                         αmat (dB/cm)                         Qabs                        Qi (total)
+Bulk LiNbO3 (pristine)                         0.002                          2.3 × 108                     2.2 × 107
+High-quality TFLN                               0.01                          4.7 × 107                     1.6 × 107
+Good TFLN                                       0.03                          1.6 × 107                     9.5 × 106
+Typical TFLN                                     0.1                          4.7 × 106                     3.9 × 106
+
+
+   For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the
+FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts
+(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm
+is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage
+through the thin slab (Qleak = 2.93 × 107 ).
+                                                                                                                                    28
+
+                               S5. FABRICATED HIGH-Q DESIGN PROJECTIONS
+
+   Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm
+slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The
+numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional
+MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any
+realistic intrinsic loss.
+   Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been
+demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP-
+based processes.
+
+  TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates
+                whether the fabrication process preserves electrode patterning capability.
+
+Ref.                              Qi                       R (µm)                      w (µm)                           Etch
+Zhang [8]                        107                         80                          ∼2                           Ar+ mill
+Gao [9]                           108                       100                          ∼3                            CMP∗
+Zhuang [10]                     9×106                       100                          ∼2                           Wet etch
+Song [11]                      2.9×107                      200                          4.5                       ICP-RIE+CMP
+   All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi .
+
+  To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling
+quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality
+factors (Table S10).
+
+                                                                                              N
+  TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal
+on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on
+                                 Qi ); at N = 20, L = 8: εmax = 0.67% (Table I).
+
+Projection                     Qi                        Dmax                  N =10                  N =20                 N =30
+FDTD baseline                  3.88×104                  0.36                  −44.3                  −88.5                 −132.8
+Conservative                   5×105                     0.90                  −4.4                   −8.8                  −13.2
+Moderate                       106                       0.95                  −2.2                   −4.5                   −6.7
+Optimistic                     5×106                     0.99                  −0.44                  −0.88                  −1.3
+
+
+  Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an
+order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated
+high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck;
+the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature
+values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable
+in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design.
+Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge;
+the projections in Table S10 should be read as design targets contingent on achieving it.
+                                                                                                                      29
+
+                                   S6. INSERTION LOSS BUDGET DETAILS
+
+  For a cascade of N rings, the total insertion loss is modeled as
+
+                                           ILtot ≈ N · ILstage + ILcoupling ,                                      (S6.1)
+
+where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and
+chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios:
+
+   • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB
+     (N = 20), and 3.80 dB (N = 30).
+   • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10),
+     8.00 dB (N = 20), and 10.5 dB (N = 30).
+
+   In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter
+power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation
+closer to the detector noise floor, reducing usable dynamic range.
+   e. Four-component loss breakdown. The total insertion loss of the cascade has four components:
+                                         N
+   1. On-resonance cascade transmission Dmax (dominant; see Table V);
+   2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each
+      inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding
+      to ∼0.46 dB per inter-ring stage;
+   3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16];
+   4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15].
+                                                   N
+Table V presents the ideal on-resonance budget (Dmax   only). Including all four components for the present diagonal-bus
+layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈
+26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The
+inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or
+straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized
+layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB.
+                                                                                                                        30
+
+                             S7. ENERGY EFFICIENCY DETAILED DERIVATION
+
+  This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits
+and the photonic MRR cascade, as summarized in the main text (Sec. V).
+
+
+                                         S7.1 Electrical analog exponential circuits
+
+  Three main families of electrical circuits realize the exponential function in the analog domain:
+  f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC =
+IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building
+block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At
+a signal bandwidth of B ≈ 100 MHz, the energy per operation is
+                                                            P   300 µW
+                                               EGilbert =     =         = 3 pJ.                                     (S7.1)
+                                                            B   100 MHz
+  g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling
+direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements
+a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by
+subthreshold fT ):
+                                                             0.43 µW
+                                                 Esub-VT =           = 0.43 pJ.                                     (S7.2)
+                                                              1 MHz
+This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz).
+  h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations.
+Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving
+                                           Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ.                               (S7.3)
+At 8-bit precision (sufficient for inference): ∼2.3 pJ.
+
+
+                          S7.2 Photonic MRR cascade: single-channel energy derivation
+
+   We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated
+high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V
+(fully CMOS-compatible). The energy per exponential operation has three components:
+   (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral
+S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as
+                                                            Cel ≈ 18 fF,                                            (S7.4)
+based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported
+by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected
+QL = 25,200, which gives bV = 0.295 V−1 ):
+                                                       2
+                                      Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ.                       (S7.5)
+For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ.
+  Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because
+                                                        2
+                                            EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N.                                   (S7.6)
+The bias voltage (3.9 V) is static and does not contribute per-operation energy.
+   (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single
+CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug
+efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10),
+requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ.
+   (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ.
+   The total single-channel energy is
+                              (1ch)
+                            Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ.                          (S7.7)
+                                                                                                                      31
+
+                                       S7.3 Q-factor scaling of energy efficiency
+
+                                2
+  Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at
+various quality factors.
+
+TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV
+ scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the
+FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal
+                                       stabilization (0.15–0.60 pJ for N = 30).
+
+      Q                    Vctrl (V)                  Vbias (V)                   EEO (pJ)                    Etotal (pJ)
+   5,000                     4.57                       19.5                        5.64                         6.21
+ 10,000                      2.28                        9.7                        1.40                         1.97
+ 12,500                      1.83                        7.8                        0.90                         1.47
+15,500†                      1.47                        6.3                        0.58                         1.15
+ 20,000                      1.14                        4.9                        0.35                         0.92
+25,200‡                      0.91                        3.9                        0.22                         0.79
+ 30,000                      0.76                        3.2                        0.16                         0.73
+ 50,000                      0.46                        1.9                        0.06                         0.63
+
+
+   At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the
+design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total
+energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather
+than energy.
+   i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy
+for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal
+stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than
+silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate,
+the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to
+0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy
+comparison should therefore be treated as an order-of-magnitude estimate.
+
+
+                                S7.4 Comparison with electronic implementations
+
+   Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our
+photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume
+a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a
+1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is
+                   N
+Pout = Pin,ch × Dmax  , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30)
+(Table V).
+   j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared:
+BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and
+digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]).
+   k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self-
+consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized
+laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding
+Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ.
+Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization.
+   l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves
+0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including
+thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz
+bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11),
+recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000×
+lower bandwidth.
+   m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic
+estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared
+with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for
+                                                                                                                                   32
+
+                      TABLE S12: Energy per exponential operation: single-channel comparison.
+
+Implementation                                    E/op (pJ)                        Bandwidth                             Notes
+Digital FP32 (Taylor)                                ∼46                             1 GHz                           10 FP MACs
+BJT Gilbert cell                                     ∼3                             100 MHz                              Analog
+Digital INT8 (Taylor)                                ∼2.3                            1 GHz                           10 INT MACs
+Photonic MRR (N = 30)                             0.94–1.39                         10 GHz                             Analog†
+Subthreshold CMOS                                   ∼0.43                            1 MHz                               Analog
+    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see
+                                                                                                                 L
+                                                      Supplementary Sec. S7.
+
+
+TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective
+precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage
+over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical
+and Vctrl remains CMOS-compatible.
+                                                                                                                         33
+
+                  S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES
+
+   This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design
+(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per-
+ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v)
+control-channel noise, and (vi) detector noise with one-point calibration at I = L.
+   For ring k, we use
+                                                                      1
+                                        Tk (I) =                                         2,                            (46)
+                                                   1 + (ak + bk I + dth + dxt I/L)
+
+with
+                                                       N
+                                                       Y
+                                              y(I) =         Tk (I) × 10−ILtot /10 ,                                   (47)
+                                                       k=1
+
+and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance.
+
+                       TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps.
+
+                                        Parameter                 Nominal       Stress
+                                        σa                     0.020       0.032
+                                        σb,rel                 0.020       0.032
+                                        σth                    0.015       0.025
+                                        σxt                    0.012       0.020
+                                        σI                     0.004       0.007
+                                        ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05
+                                        σdet                3.0 × 10−6 6.0 × 10−6
+
+
+
+                        TABLE S14: Monte Carlo summary (same run reported in main text).
+
+                                     Metric                         Nominal        Stress
+                                     Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4
+                                     p95 KL(pref ∥papprox )    5.92 × 10−4 2.21 × 10−3
+                                     Median max |∆p|             0.170%      0.193%
+                                     p95 max |∆p|                0.319%      0.419%
+
+Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family
+with fixed b, define
+
+                             ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL),            ϕ(u) = − ln(1 + u2 ),                    (48)
+
+so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder
+argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the
+flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening
+inequality
+
+                                                              L2    1
+                                                    E∞ ≲         + 2 ,                                                 (49)
+                                                              4N  2b N
+which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a
+conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening.
+                                                                                            34
+
+
+
+
+FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples.
+                                                                                                                            35
+
+                      S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION
+
+  We model global normalization as a delayed PI-controlled loop:
+
+                                   S(t) = G(t)P (t) + n(t),                                                               (50)
+                                    dP
+                                  τ     = −P (t) + u(t − Td ),                                                            (51)
+                                    dt                 Z
+                                   u(t) = Kp e(t) + Ki      e(t) dt,          e(t) = Sref − S(t),                         (52)
+
+with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical
+intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples.
+
+TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case
+            ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ).
+
+                                 Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation
+                                   100 ns         20 ns              1.24 µs           fast loop
+                                    1 µs          200 ns             12.4 µs        moderate loop
+                                    5 µs           1 µs               62 µs          slower loop
+
+Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed-
+signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip)
+for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude
+ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices.
+
+    TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9.
+
+                                       Subsystem path                  Tsys          Sources
+                                       On-chip memory (L1/L2)     20–200 ns [25]
+                                       Off-chip memory (DRAM) 200–700 ns [25, 26]
+                                       ADC conversion             10–710 ns [27, 28]
+                                       DAC + driver/settling      1–200 ns [29]
+                                       On-chip interconnect (NoC) 5–100 ns [30]
+                                       Off-chip I/O (PCIe/CXL) 1–10 µs      [25, 31]
+
+Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from
+one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time
+entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy:
+
+                                                        Tnorm ≈ 12.4 τ.                                                   (53)
+
+This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling
+times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening
+metric
+
+                                                       Tnorm ≥ β Tsys ,                                                   (54)
+
+with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal
+dominance proof. The corresponding threshold is
+                                                                    β Tsys
+                                                      τcrit (β) =          .                                              (55)
+                                                                     12.4
+Table S17 gives the resulting numeric ranges.
+For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs
+gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the
+normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant
+                                                                                                                  36
+
+                        TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16.
+
+                         Subsystem                       Tsys range τcrit (β = 0.5) τcrit (β = 1)
+                         On-chip memory path        20–200 ns 0.81–8.06 ns 1.61–16.13 ns
+                         Off-chip memory path      200–700 ns 8.06–28.23 ns 16.13–56.45 ns
+                         ADC conversion             10–710 ns 0.40–28.63 ns 0.81–57.26 ns
+                         DAC+driver/settling         1–200 ns 0.04–8.06 ns 0.08–16.13 ns
+                         On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns
+                         Off-chip I/O fabric          1–10 µs  0.04–0.40 µs 0.08–0.81 µs
+
+
+as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture-
+specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16).
+Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations
+require full timing validation.
+
+TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ).
+
+                                Case      (Kp , Ki , Td /τ ) Overshoot    Settling     Stable
+                                Stable    (0.55, 0.8, 0.2)     25.6%      ∼ 12.4τ       Yes
+                                Marginal (0.95, 1.6, 0.45)     25.6%      ∼ 12.8τ       Yes
+                                Unstable (1.2, 2.2, 0.75)      45.1%     not settled    No
+
+
+
+                   TABLE S19: Stable-region fraction from gain-map scans at each delay ratio.
+
+                                                  Td /τ Stable fraction
+                                                   0.0        88.1%
+                                                   0.2        88.0%
+                                                   0.5        72.4%
+                                                   0.8        47.5%
+                                                                        37
+
+
+
+
+FIG. S5: Step-response examples of the delayed PI normalization loop.
+                                                                          38
+
+
+
+
+FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges.
+                                                                                                                             39
+
+                                               S10. REPRODUCIBILITY
+
+  Scripts used for this Supplementary validation:
+    • scripts/nonideality montecarlo.py
+
+    • scripts/feedback loop validation.py
+
+    • scripts/extract logit range effective.py
+
+    • scripts/analyze softmax clipping validity.py
+Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs
+are stored under:
+    • paper/empirical L v3/
+
+
+
+
+ [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
+     Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages
+     5998–6008, 2017.
+ [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
+ [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21.
+ [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21.
+ [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21.
+ [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021.
+ [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
+     CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko
+     Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025.
+ [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium
+     niobate microring resonator. Optica, 4(12):1536–1537, 2017.
+ [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium
+     niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022.
+[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching.
+     Adv. Mater., 35(3):2208113, 2023.
+[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan
+     Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on
+     thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024.
+[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J.
+     Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024.
+[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018.
+[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi-
+     cations, 14:2389, 2023.
+[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on-
+     insulator platform. Optics Express, 18(17):18278–18283, 2010.
+[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of
+     the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS
+     Benelux Chapter, 2001.
+[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975.
+[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989.
+[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017.
+[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and
+     Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics,
+     10(9):1004, 2021.
+[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State
+     Circuits Conference (ISSCC), pages 10–14, 2014.
+[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully
+     isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644–
+     29661, 2020.
+[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog.
+     Phys., 75(4):046402, 2012.
+                                                                                                                             40
+
+[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol.,
+     35(3):346–396, 2017.
+[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via
+     microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
+[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism
+     (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages
+     368–379, 2012.
+[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital
+     converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22.
+[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised
+     August 2015), 2015. Accessed 2026-02-22.
+[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet
+     (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22.
+[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design
+     Automation Conference (DAC), pages 684–689, 2001.
+[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and
+     Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23
+     Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.
+
+\ No newline at end of file
author	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
committer	Yuren Hao <yurenh2@illinois.edu>	2026-07-03 05:56:50 -0500
commit	b83947778e2c776f757a07d4719b7ce961d7ed55 (patch)
tree	b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/extracted_paper.txt