Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
                                                                                   toward Softmax
                                                                                            Hyoseok Park1 and Yeonsang Park1, ∗
                                                                 1
                                                                     Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea
                                                                                                   (Dated: March 26, 2026)
                                                                  The rapid growth of large-scale AI models has intensified energy consumption and data-movement
                                                               challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear
                                                               matrix multiplications of transformer inference at high throughput and low energy. However, the
                                                               softmax attention layer—which requires element-wise exponentiation followed by normalization—still
                                                               relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much
                                                               of the potential photonic advantage.
arXiv:2603.12934v3 [physics.optics] 25 Mar 2026


                                                                  We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel
                                                               exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case
                                                               relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe
                                                               at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a
                                                               multiplicative transfer function whose logarithm is approximately linear.
                                                                  We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate
                                                               the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN)
                                                               add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade
                                                               and confirms agreement with theory primarily over the upper operating range; deeper cascades and
                                                               higher quality factors are assessed analytically. The cascade implements the per-channel exponential
                                                               block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel
                                                               chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation,
                                                               summation, and normalization—on a single photonic chip without per-channel normalization circuitry.


                                                                     I.   INTRODUCTION                              is approximately linear over a finite interval, enabling
                                                                                                                    exponential-function synthesis with sub-2% worst-case
                                                     Transformer inference is often limited by power and            error—an order of magnitude more accurate than SOFT-
                                                  memory traffic, motivating optical accelerators that ex-          ONIC’s polynomial approach—while remaining compati-
                                                  ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9].   ble with integrated microring platforms [20–24]. We term
                                                  Recent perspective articles also discuss data-center power        this cascade block an approximate exponential function
                                                  consumption as one motivation for optical comput-                 (AEF) unit. We further propose a WDM-parallel archi-
                                                  ing [3, 8]. While linear operators are comparatively              tecture with a single PI feedback loop that realizes the
                                                  amenable to photonic implementation [4–6], the softmax            complete softmax function—including summation and
                                                  function used in attention layers requires an exponen-            normalization—without per-channel electronic process-
                                                  tial mapping together with global normalization—both              ing.
                                                  difficult to realize in passive photonic circuits, where             We extend the theoretical framework with three-
                                                  transmission is fundamentally bounded by unity. Parallel          dimensional FDTD simulations of a single X-cut TFLN
                                                  digital-hardware studies treat the exponential/softmax            add-drop micro-ring resonator. The simulated device
                                                  stage as a bottleneck and propose dedicated approxima-            parameters—quality factor, free spectral range, and
                                                  tions [11–19]. Many integrated-photonic classifier demon-         electro-optic sensitivity—calibrate the cascade design pa-
                                                  strations still rely on electronic post-processing for the        rameters, bridging analytical fitting and physically realiz-
                                                  final nonlinear readout [10]; the resulting electro-optic         able hardware. Two operating regimes emerge from this
                                                  conversion overhead can negate the throughput and en-             calibration: an FDTD-characterized regime with moder-
                                                  ergy benefits of the photonic front-end. Notably, the             ate drop-port depth (Dmax ≈ 0.36), where the analytic
                                                  SOFTONIC architecture [11] explicitly argues that “the            error stays below ∼5% for N ≤ 7 but the power bud-
                                                  inability of MRRs and MZMs to handle SMA’s expo-                  get limits practical cascades to N ≤ 5; and a projected
                                                  nential and division functions” necessitates alternative          high-Q regime (Dmax ≥ 0.95), enabling deeper cascades
                                                  approaches based on microdisk modulators and polyno-              (N ≤ 30) with sub-percent error. Cascade performance is
                                                  mial approximation, achieving 89.7% accuracy with a               predicted analytically and validated by a five-ring cascade
                                                  third-degree Chebyshev polynomial. Here we challenge              3D FDTD simulation (Sec. IV).
                                                  this premise: we show that a passive Lorentzian cascade              The paper is organized as follows: Section II presents
                                                  of microring resonators can be tuned so that its logarithm        the mapping, transfer model, and depth-design rules; Sec-
                                                                                                                    tion III provides numerical fits and validation; Section IV
                                                                                                                    describes the single-ring TFLN device design and FDTD
                                                                                                                    validation; Section V assesses physical feasibility including
                                                  ∗ yeonsang.park@cnu.ac.kr; Corresponding author
                                                                                                                    voltage requirements, insertion loss, and energy efficiency;
                                                                                                                                       2

Section VI discusses implementation scope, platform com-
parisons, and limits; and Section VII concludes.                                                                1
                                                                                    Tk (∆ωk ) =                     .                (9)
                                                                                                                ∆ωk 2
                                                                                                        1+       Γ
    II.   MODEL AND DESIGN FRAMEWORK
                                                                In a control–probe architecture, a nonnegative control-
                                                                signal amplitude I ≥ 0 shifts the ring resonance. Here I
Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an           denotes a generic control amplitude: for optical-pump op-
arbitrary real-valued sequence (or vector). Directly gener-     eration it maps to optical intensity, while for EO operation
ating exp(xn ) as a passive optical transmission is impos-      it maps to electrical control level (e.g., voltage). Across
sible in general because exp(x) grows beyond unity while        many physical mechanisms (optical pump via Kerr/XPM,
a passive transmission satisfies 0 < T ≤ 1 [25]. However,       EO drive via Pockels effect, thermal, carrier tuning), the
for softmax,                                                    shift can be linearized on a working range [20, 26–30]:

                                exn                                                                       (0)
                 softmax(x)n = P xj ,                     (1)                            ω0,k (I) = ω0,k + ηI,                      (10)
                                 je
                                                                        (0)
                                                                where ω0,k is the cold-cavity resonance and η is the control-
a common shift cancels:                                         to-resonance sensitivity. In practice, the control channel
                                                                can be optical or electrical (optical pump, EO/Pockels
             exn +c   exn                                       drive, thermal, or carrier tuning); a quantitative EO
            P x +c = P x                  (∀c ∈ R).       (2)   feasibility example is given in the Discussion. With
              je       je
                  j       j
                                                                                  (0)
                                                                ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be-
Thus it suffices to generate                                    comes


                exn −m ,       m ≡ max xj ,               (3)                           ∆ωk (I) = ∆ω0,k − ηI.                       (11)
                                      j
                                                                Define dimensionless parameters
since the global factor em cancels.
   To ensure a nonnegative control-signal amplitude, de-
fine                                                                                          ∆ω0,k                η
                                                                                   ak ≡             ,           b≡− .               (12)
                                                                                               Γ                   Γ
                                                                Then Eq. (9) yields the control-to-probe transfer of a
un ≡ xn − m ≤ 0,           L ≡ − min un = m − min xn ≥ 0,       single ring,
                                  n                   n
                                                     (4)
and map each scalar to a nonnegative control-signal am-                                                     1
plitude                                                                             Tk (I) =                            .           (13)
                                                                                                   1 + (ak + bI)2
                                                                 Physical meaning: ak is a static detuning in linewidth
                   In ≡ un + L ∈ [0, L].                  (5)    units (set by heater/carrier tuning/fabrication), and |b|
                                                                 is the normalized sensitivity magnitude (linewidths of
Then
                                                                 resonance shift per unit control-signal amplitude); the sign
                                                                 convention is absorbed into the detuning expression. For
                  exn −m = eun = eIn −L .                 (6)   “same-material/same-geometry” rings, b is often common,
                                                                while ak can be tuned per ring.
Hence the optical design task is to realize, for I ∈ [0, L],    Sign convention. Simultaneously flipping (ak , b) 7→
                                                                (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0
                                                                without loss of generality.
                 f (I) = eI−L ∈ [e−L , 1].                (7)       Let N rings be cascaded in a serial add-drop topology:
                                                                 Tk (I) denotes the add-to-drop transmission of ring k, and
Control–probe transfer. Consider a weak probe at                 the drop output of ring k feeds the add (input bus) port
fixed angular frequency ωL . For the kth ring, let ω0,k          of ring k+1. Assuming the probe is sufficiently weak so
denote its resonance frequency and Γ > 0 its loaded half-        the control channel dominates the resonance shift, the
width at half maximum (HWHM). Define the detuning                normalized probe output is the product

                    ∆ωk ≡ ωL − ω0,k .                     (8)                 (probe)
                                                                           Pout         (I)
                                                                                                  N
                                                                                                  Y                 N
                                                                                                                    Y         1
                                                                  y(I) ≡                      =         Tk (I) =                       .
Near resonance, the normalized Lorentzian transmission
                                                                                (probe)
                                                                              Pin                                       1 + (ak + bI)2
                                                                                                  k=1               k=1
is modeled as [20, 21]                                                                                                               (14)
                                                                                                                                   3


                (a) Electronic Preprocessing
                                                                                                           Control In
                                     Find max:              Shift:                   Bias:
                  {xn }             m = max(xn )         un = xn −m               In = un +L


                                                                      EO tuning
                (b) N -MRR Cascade

                                                                      N stages
      Probe
 (fixed ωL )


                               MRR                  MRR                MRR                     MRR                      MRR
                               #1                   #2                 #3                      #4                       #5


                (c) Output

                                                     ỹ(In ) ≈ exp(In − L) → exp(xn − m)                                      PD


 FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing
    maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and
In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a
 weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the
        next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements
                y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax.


To focus on the shape of the approximation, we allow a
global scale factor C > 0:
                                                                                    E∞ ≡ sup         ln ỹ(I) − (I − L) .     (18)
                                                                                          I∈[0,L]

                        ỹ(I) ≡ C y(I).                  (15)    If E∞ ≤ εlog , then for all I ∈ [0, L],
In softmax, pn = CeIn −L / j CeIj −L , so C cancels
                                 P
between numerator and denominator and is physically                                 ỹ(I)           ỹ(I)
                                                                       e−εlog ≤           ≤ eεlog ⇒       − 1 ≤ eεlog − 1.    (19)
inessential; nevertheless it is convenient for error analysis.                      f (I)           f (I)
For a fixed (N, b, {ak }), the optimal C for the minimax
                                                                 Thus achieving a prescribed worst-case relative error ε is
log-error in Eq. (18) can be written in closed form. Let
                                                                 guaranteed by
g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax-
optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield-
ing E∞ = (maxI g(I) − minI g(I))/2.                                                   E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε.              (20)
  Taking logarithms,
                                                                 Depth scaling. We derive depth-related constraints and
                                                                 design rules for a prescribed approximation tolerance.
                             N
                             X                                   Necessary slope condition. Differentiate Eq. (16):
                                   ln 1 + (ak + bI)2 .
                                                    
         ln ỹ(I) = ln C −                                (16)
                             k=1
                                                                                                     N
                                                                                   d              X 2b(ak + bI)
The target ln f (I) = I − L is linear; hence exponential                              ln y(I) = −                 .           (21)
                                                                                   dI              1 + (ak + bI)2
approximation is equivalent to the log-linearization goal                                            k=1

                                                                 Since |2u/(1 + u2 )| ≤ 1 for all real u,
     ln ỹ(I) ≈ I − L     uniformly on I ∈ [0, L].        (17)
                                                                                           d
                                                                                              ln y(I) ≤ N |b|.                (22)
Error metric. Define the worst-case log-error on [0, L]:                                   dI
                                                                                                                                 4

The target ln f (I) = I − L has constant slope +1, so a               with a minimax refinement. After choosing N , set
necessary condition to track it is                                    b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa-
                                                                      tion, then refine (a, b) by a two-parameter minimax fit on
                                                                      [0, L].
                            N |b| ≳ 1.                         (23)      A heuristic conservative screening bound N ≥ ⌈(L2 /4 +
Near-optimal parameterization. The full design prob-                  1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion
lem can be written as a minimax fit in the log domain [31]:           argument; see Supplementary Sec. S1) provides a quick
                                                                      upper estimate but is not a rigorous guarantee.

                    min          sup |r(I)|,
               a1 ,...,aN , ln C I∈[0,L]
                                                                           III.   NUMERICAL FITS AND VALIDATION
                   N
                   X                                           (24)
                         ln 1 + (ak + bI)2 − (I − L).
                                          
   r(I) ≡ ln C −                                                         We validate the analytical framework with minimax
                   k=1                                                numerical fits and sampled robustness checks. Figure 2
This objective is permutation-invariant in the ak ’s (ring            shows the fitted approximation quality at L = 8: the
index k). In practice (and in numerical experiments                   top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the
reported below), the optimizer frequently collapses to a              middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8],
permutation-symmetric solution                                        and the bottom panel shows the pointwise relative error
                                                                      with the characteristic Chebyshev equioscillation pattern.
                                                                         We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L]
                     a1 = · · · = aN ≡ a,                      (25)   and compare several depths using a minimax criterion.
                                                                         Table I makes the accuracy–depth trade-off explicit
reducing the design to two parameters (a, b) (plus C).                at L = 8. A worked input-to-output example demon-
With Eq. (25),                                                        strating the mapping from an arbitrary input sequence
                                                                      x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided
                                  
                                   1
                                                      N              in Supplementary Sec. S2. The example shows that the
          ỹ(I) = C y(I) = C                               .   (26)   N = 10 cascade keeps the worst-case relative error below
                             1 + (a + bI)2                            2.7% across all channels.
A robust initialization is obtained by placing the midpoint           Empirical calibration. We calibrate the effective
of the interval on the Lorentzian half-maximum flank and              logit range Leff from autoregressive Transformers (dis-
matching the slope:                                                   tilgpt2/gpt2) [1, 32–35] at context length 128, finding
                                                                      Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen-
                                                                      tary Sec. S2). A clipping threshold t∗ = −12 preserves
                       L                                              p99 softmax accuracy below 0.1%. Full protocol details,
                a+b      ≈ −1,             N b ≈ 1.            (27)
                       2                                              clipping-sweep tables/plots, and per-run statistics are
These two equations already yield a good design; a small              provided in Supplementary Sec. S3.
(two-parameter) refinement can then enforce the desired                  A synthetic design-space map (Supplementary Table S3)
worst-case tolerance.                                                 shows that near L ≈ 8, moderate depth (N ≥ 10) reaches
   Local expansion and depth scaling. A Taylor                        few-percent error, whereas L ≳ 12 requires deeper cas-
expansion of the log-domain residual around the flank-                cades. All fits follow the same pipeline: minimize the
centered point I0 = L/2 (with a + bI0 = −1 and N b = 1)               worst-case log-error on a uniform grid, initialize from the
shows that the quadratic term vanishes identically, leaving           flank rules in Eq. (27), perform multi-start global search,
a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L],        and apply bounded local refinement; implementation de-
this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed            tails and scripts are provided in a public repository [36]
                                     √                                (commit: 585e695).
tolerance εlog requires N ∝ L3/2 / εlog , which explains
the scaling used in Eq. (28). The full derivation is provided
in Supplementary Sec. S0; an intuitive local-expansion
summary appears in Sec. S1.
   Practical engineering estimate. Given L and a                         TABLE I: Depth comparison for L = 8 using fitted
target worst-case relative error ε, define εlog = ln(1 + ε).          ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all
A heuristic engineering estimate (not a rigorous bound)                                          N ).
that matched our percent-level numerical designs is
                                                                      N           a         b       max rel. err.   mean rel. err.
                               L3/2
                                    
                        1
             N ≈ max        , κ√         ,                     (28)    5      −2.0789   0.21658        10.9%            6.43%
                       bmax      εlog                                 10      −1.4588   0.10202        2.68%            1.65%
                                                                      20      −1.2135   0.05025        0.67%            0.42%
where bmax is the physically achievable sensitivity bound             30      −1.1392   0.03341        0.30%            0.19%
and κ ≃ 0.07 for the identical-detuning flank design
                                                                                                                   5

                                                            TABLE II: Waveguide and ring parameters of the X-cut
                                                             TFLN micro-ring resonator. Electro-optic electrode
                                                                parameters are listed separately in Table III.

                                                            Parameter                  Symbol       Value      Unit
                                                            Total TFLN thickness       tTFLN         600       nm
                                                            Etch depth                 tetch         500       nm
                                                            Slab thickness             tslab         100       nm
                                                            Waveguide width            w              1.4      µm
                                                            Bend radius                R              20       µm
                                                            Coupling gap               g             100       nm
                                                            Circumference              Lring        125.7      µm
                                                            Free spectral range        FSR          8.29       nm
                                                            Effective index (TE0 )     neff         1.903      —
                                                            Group index (TE0 )         ng            2.24      —
                                                            Extraordinary index        ne           2.138      —


                                                            IV.   TFLN SINGLE-RING DEVICE DESIGN AND
                                                                          FDTD VALIDATION

                                                                     A.    Waveguide and ring geometry


                                                               The device is based on an X-cut thin-film lithium nio-
                                                            bate (LiNbO3 ) on insulator wafer with a 600 nm-thick
                                                            LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines
                                                            a 1.4 µm-wide single-mode waveguide with a 100 nm un-
                                                            etched slab (Fig. 3). Lumerical MODE simulations yield
                                                            neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda-
                                                            mental TE0 mode.
                                                               The ring resonator (R = 20 µm, Lring = 125.7 µm) is
                                                            configured as an add-drop resonator with 100 nm coupling
                                                            gaps (Fig. 4). The FDTD-measured free spectral range
                                                            is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE
                                                            value due to bend-induced dispersion.


FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale:
  shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The
target eI−L (black) is progressively better matched as N
       increases. (b) Log scale: depth comparison
    (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into
  I ∈ [6, 8] showing convergence. (c) Pointwise relative
  error showing the Chebyshev equioscillation pattern
           characteristic of minimax optimality.
                                                            FIG. 3: Cross-section of the X-cut TFLN rib waveguide
                                                            on a SiO2 substrate. The 600 nm LiNbO3 film is etched
                                                            500 nm to form a 1.4 µm-wide single-mode rib waveguide.
                                                            Lateral signal (S) and ground (G) electrode positions are
                                                               indicated; electrode design details are discussed in
                                                                                    Sec. IV D.
                                                                                                                       6

  Table II summarizes the waveguide and ring parame-
ters.


              B.   3D FDTD Methodology

   The ring resonator response is simulated using Lumeri-
cal 3D FDTD with conformal variant 1 meshing. A broad-
band TE0 mode source (1530 nm to 1570 nm) is injected
into the input bus waveguide, and through- and drop-port
spectra are recorded. A “z-refined 3-fix” meshing strat-
egy ensures convergence in the thin-film geometry [37];
detailed simulation setup is provided in Supplementary
Sec. S4 (Table S6).


                                                              FIG. 5: Simulated through-port (blue) and drop-port
                                                                 (red) transmission spectra of the single add-drop
                                                              micro-ring resonator from 3D FDTD. Top: logarithmic
                                                              scale; bottom: linear scale. Five resonances are visible
                                                                               with FSR ≈ 8.29 nm.


                                                              15,500, Dmax = 0.360); using the five-resonance mean
                                                              would increase required voltages by ∼24% (see Table IV
                                                              caption).
                                                                 The simulation time of 50 ps exceeds the loaded pho-
                                                              ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but
                                                              the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex-
                                                              tracted Qi may be slightly conservative. An independent
                                                              eigenmode (FDE) analysis of the same cross-section at
                                                              R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5×
  FIG. 4: Top view of the single add-drop micro-ring          finer than the FDTD vertical grid)—yields Qrad+leak =
 resonator used in the 3D FDTD simulation. The ring           2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89)
  waveguide (R = 20 µm, w = 1.4 µm) is evanescently           gives a theoretical Qi > 107 [37–42], confirming that
  coupled to input and drop bus waveguides through            the gap between the numerical Qi and published val-
     100 nm gaps at coupling points CP1 and CP2.              ues (> 106 ) originates from mesh discretization (Sup-
                                                              plementary S4.5, Table S8). In the CMT framework,
                                                              Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present
                                                              coupling gap, increasing Qi to 106 would raise Dmax from
                                                              0.36 to ∼0.95 and QL from 15,500 to ∼25,200.
         C.    Single-Ring Add-Drop Results
                                                                Figure 6(a) shows a Lorentzian fit to the best drop-
   Figure 5 shows the through- and drop-port spectra from     port resonance at λ = 1566 nm, validating the cascade
3D FDTD. Five resonances are resolved across 1530 nm          model (Eq. 9). Figure 6(b) demonstrates that cascading
to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30).                    N copies of this FDTD-extracted Lorentzian reproduces
                                                              the target exponential eI−L with increasing fidelity as N
   Lorentzian fitting of the drop-port peaks yields QL =
                                                              grows.
10,300–15,500, with the best resonance at λ = 1566 nm
reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360,               To validate the cascade prediction directly, a five-
−4.4 dB). The through-port extinction ratio is 1.6 dB to      ring cascade 3D FDTD simulation was performed us-
2.6 dB, and the five-resonance mean is QL = 12,500 ±          ing Tidy3D [43]; the full simulation notebook is publicly
1,800 (Dmax = 0.29–0.36). CMT   √    analysis of the best     available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)]
resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈         confirms resonant excitation across all five rings. Map-
38,800, confirming that the 500 nm etch provides sufficient   ping the drop-port spectrum onto the control variable I
confinement and that the 100 nm gap places the ring           yields 11 data points within the AEF operating range
in the coupling-limited regime. The cascade analysis          [Fig. 6(e, f)], with the FDTD transmission closely tracking
below adopts the best-case FDTD calibration (QL =             the N = 5 theoretical curve near I ≈ L = 8.
                                                                                                                 7


FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD
    (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near
 λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR
layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD
    simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with
                                          11 spectral FDTD data points.
                                                                                                                                   8

     D.   X-cut electrode design and EO parameters               TABLE III: Electro-optic electrode parameters for the
                                                                X-cut TFLN micro-ring with lateral S–G arc electrodes.
   We employ lateral signal–ground (S–G) arc electrodes
on the slab surface alongside the ring waveguide (Fig. 7).      Parameter                      Symbol    Value          Unit
In the X-cut orientation, the crystal Z-axis is at 45◦ from     Crystal orientation            —         X-cut          —
the horizontal in the substrate plane, giving a lateral-        EO coefficient                 r33       30.9           pm V−1
field projection proportional to cos(θ − 45◦ ) at azimuthal     EO fill factor                 fEO    1/π ≈ 0.318       —
angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦           EO overlap factor              ΓEO        0.7           —
and 315◦ naturally separate the coupling regions from           Electrode gap                  gel         5            µm
                                                                Effective electrode distance   deff       2.5           µm
the electrode regions. Each ring carries a full semicir-
cular arc electrode on the side opposite to its coupling
points, engaging the large r33 = 30.9 pm V−1 Pockels co-
efficient [37, 38]. The effective EO fill factor follows from   ized voltage sensitivity is (Supplementary Sec. S4; here
integrating | cos(θ − 45◦ )| over the semicircle:               dλ/dV = 28.5 pm/V is the straight-section value and
                             1                                  fEO accounts for partial electrode coverage of the ring
                     fEO =     ≈ 0.318                  (29)    circumference):
                             π
(see Supplementary Sec. S4 for derivation). The electrode                         2 Q (dλ/dV )
gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic                   bV =                fEO ≈ 0.182 V−1              (30)
overlap integral is ΓEO = 0.7. Table III lists the electrode                           λ0
parameters.
                                                                at QL = 15,500. This estimate relies on a first-order
                                                                electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30%
                                                                variation in bV would shift the cascade depth by one to
                                                                two rings at constant εmax (Table IV), leaving the quali-
                                                                tative design conclusions unchanged. With the cascade
                                                                framework of Sec. II (Eqs. 14–18), the N -ring drop-port
                                                                transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates
                                                                eI−L over I ∈ [0, L], with (a, b) optimized by minimax
                                                                fitting for each N .
                                                                   Table IV presents the optimization results for the stan-
                                                                dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB).

                                                                TABLE IV: Cascade optimization results for L = 8. The
                                                                   bias voltage Vbias = |a|/bV sets the DC offset, and
                                                                Vctrl = bL/bV is the maximum control voltage at I = L.
                                                                   Voltages computed with bV = 0.182 V−1 (X-cut arc
                                                                electrode, FDTD-calibrated best resonance QL = 15,500,
                                                                 ng = 2.30). The mean FDTD quality factor across five
FIG. 7: Illustrative two-ring cascade layout showing the        resonances is QL = 12,500 ± 1,800; using the mean would
lateral S–G arc electrode placement on X-cut TFLN (the                         increase voltages by ∼24%.
cascade design extends to N rings; this two-ring example
  clarifies the electrode geometry). The crystal Z-axis is      N     a       b     E∞ εmax (%) Vbias (V) Vctrl (V)
   oriented at 45◦ from the horizontal in the substrate          5 −2.0789 0.21658 0.1035 10.91   11.4       9.5
plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and         10 −1.4588 0.10202 0.0265  2.68    8.0       4.5
   315◦ naturally separate the bus-waveguide coupling           12 −1.3731 0.08450 0.0184  1.86    7.5       3.7
regions from the electrode semicircles: each ring carries a     20 −1.2136 0.05025 0.0067  0.67    6.7       2.2
                                                                25 −1.1685 0.04013 0.0043  0.43    6.4       1.8
full semicircular arc electrode on the side opposite to its
                                                                30 −1.141 0.03340 0.0030   0.30    6.3       1.5
 coupling points. The resulting effective EO fill factor is     32 −1.1301 0.03131 0.0026  0.26    6.2       1.4
                      fEO = 1/π ≈ 0.318.
                                                                a The complete cascade optimization results for all N values are

                                                                  listed in Supplementary Table S7.


E.    FDTD-Calibrated bV and Cascade Optimization                 The approximation quality across different cascade
                                                                depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g.,
  From the device parameters in Tables II and III and           ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete
the FDTD-calibrated ng ≈ 2.30, the effective normal-            optimization results are listed in Supplementary Sec. S4.
                                                                                                                                    9

             V.    PHYSICAL FEASIBILITY                          TABLE V: Two-regime power budget for the MRR
                                                                       cascade. Pout assumes per-channel input
  Having established the cascade approximation theory           Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW
(Sec. II) and the FDTD-calibrated device parameters            laser split across M = 10 parallel channels via a 1×M
(Sec. IV), we now assess the physical feasibility of the      splitter, or equivalently multiplexed as d WDM channels
proposed architecture in terms of voltage requirements,       sharing a single cascade) and accounts only for the ideal
                                                                                                     N
insertion loss, and energy efficiency.                        on-resonance cascade transmission Dmax      (upper bound);
                                                                additional inter-ring coupling loss (ηcoupling ≈ 0.9 per
                                                               stage, ∼0.46 dB/stage) and off-resonance propagation
       A.     Electro-optic voltage requirements                 loss (0.08–0.25 dB/stage) are analyzed separately in
                                                                                        Sec. V C.
  For the primary target of ε < 2% (N = 12), minimax
                                                                                          N
optimization gives a = −1.373, b = 0.0845. With the                    Dmax      N     Dmax     (dB)    Pout   εmax
FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the                     0.36       3    0.0467   −13.3 4.67 µW ∼15%
                                                                  I
required voltages are                                         (FDTD) 0.36         5   0.00605   −22.2 0.61 µW 10.9%
                                                                       0.36       7 7.84 × 10−4 −31.1 78 nW    ∼5%
                        |a|   1.373                                    0.95      10     0.599   −2.2 59.9 µW 2.68%
               Vbias =      =        = 7.5 V,         (31)        II
                                                              (high-Q) 0.95      20     0.358   −4.5 35.8 µW 0.67%
                        bV    0.182
                                                                       0.95      30     0.215   −6.7 21.5 µW ∼0.30%
                        bL    0.0845 × 8
            Vctrl,max =     =             = 3.7 V.    (32)        Regime I: FDTD-characterized (Qi = 38,800). Regime II:
                        bV       0.182                          fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch .

Since bV ∝ Q, voltage scales inversely with quality factor:

                            bL      bL λ0                     independent evidence that intrinsic quality factors in
                  Vctrl =      =               .      (33)    the projected range are physically achievable in TFLN—
                            bV   2Q |dλ0 /dV |
                                                              albeit with wider waveguides and larger ring radii than the
CMOS-compatible control voltages (Vctrl < 3.3 V) are          present design. Transferring comparable sidewall quality
achievable at N ≥ 14 with QL = 15,500; at the design          to our geometry (R = 20 µm, W = 1.4 µm) is an open
point N = 30 (εmax = 0.30%), Vctrl = 1.47 V.                  fabrication challenge; the projections should be read as
                                                              design targets contingent on achieving it.
                                                                 The total insertion loss comprises on-resonance
                                                                                        N
       B.     Power budget: two-regime analysis               cascade transmission Dmax     , inter-ring coupling loss
                                                              (∼0.46 dB/stage for the present diagonal-bus layout),
   The on-resonance cascade transmission DmaxN
                                                  is the      off-resonance propagation loss (0.08–0.25 dB/stage), and
dominant contribution to total insertion loss. Table V        fiber-to-chip coupling (1.5–3.0 dB). For the fabricated
presents two regimes: the FDTD-characterized regime           high-Q regime (N = 30), the total ranges from ∼13 dB
(Dmax = 0.36) and the fabricated high-Q regime (Dmax =        (optimized layout) to ∼24 dB (current geometry); see
0.95, achievable with Qi > 106 and gap-optimized cou-         Supplementary Sec. S6 for detailed scenarios.
pling).
   In the FDTD-characterized regime, Dmax = 0.36 limits
practical cascades to N ≤ 5: at N = 5 the output is                             D.    Energy comparison
0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof-
of-concept validation. In the fabricated high-Q regime           For N = 30 X-cut TFLN micro-ring resonators in the
(Dmax ≥ 0.95), deep cascades become practical: N = 30         fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup-
yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%.            plementary Sec. S5), the three energy components are EO
The transition to fabricated high-Q devices is therefore      tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ,
critical for achieving both high accuracy and sufficient      shared across M = 10 channels), and photodetector
output power.                                                 (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva-
                                                              tions in Supplementary Sec. S7). Including thermal stabi-
                                                              lization for N = 30 rings (0.15–0.60 pJ; Supplementary
                   C.    Feasibility outlook                  Sec. S7), the total rises to 0.94–1.39 pJ.
                                                                 Table S12 compares the photonic cascade with digital
  Published TFLN micro-ring resonators achieve Qi ≥           implementations. Including thermal stabilization (0.94–
106 –108 using optimized fabrication [39–42]. At Qi =         1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×,
106 with the present coupling geometry, CMT predicts          while operating at 10 GHz bandwidth and 58× lower than
Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5,           digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO
Tables S4–S7), enabling deep cascades (N ≤ 30) with           drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal;
sub-percent error. The literature values provide strong       Supplementary Table S11), recovering a 3.2× advantage
                                                                                                                             10

     TABLE VI: Energy per exponential operation:                    with a distinct FSR order of the same ring set, traverse a
            single-channel comparison.                              single N -ring cascade simultaneously (Fig. 8). Because
                                                                    each channel λj sees its own Lorentzian detuning set by
 Implementation                 E/op (pJ) Bandwidth           Notes an independent control   QN
                                                                                                voltage Vj , the cascade output
 Digital FP32 (Taylor)              ∼46        1 GHz      10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all
 Digital INT8 (Taylor)              ∼2.3       1 GHz      10 INT MACsd exponentials are computed in parallel on the same phys-
 Photonic MRR (N = 30) 0.94–1.39 10 GHz                     Analog† ical waveguide. Compared with a 1×M power-splitter
    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal.    architecture that replicates the cascade for each channel,
 Self-consistent with fabricated high-Q regime (QL = 25,200); see   the WDM approach reduces the total ring count from
                       Supplementary Sec. S7.                       N × d to N (a factor-d saving) and eliminates the splitter
                                                                    insertion loss (10 log10 d dB). At the output, a WDM
                                                                    demultiplexer or wavelength-selective photodetector array
over INT8. Since EEO ∝ 1/Q2 , improving Q beyond                    separates the channels for electrical readout. Figure 8
∼30,000 yields diminishing energy returns but continues             shows a representative chip layout for N = 5 cascade
to relax CMOS driver voltage requirements.                          stages and d = 8 WDM channels, where alternating U-
                                                                    turn bus connections route the drop-port output of each
                                                                    stage into the input bus of the next.
                      VI. DISCUSSION                                   Why cascade helps. A single Lorentzian in I is too
                                                                    rigid to mimic the log-linear target over a wide interval.
   Practical design procedure. For a given input se-                Cascading turns the transfer into a product; taking a
quence x = (x1 , . . . , xK ), the design proceeds as follows:      logarithm gives a sum of smooth terms, and the approx-
                                                                    imation improves as N increases. The slope constraint
    1. Compute m = maxn xn , un = xn − m, and L =                   N |b| ≳ 1 is an immediate feasibility check.
         − minn un .                                                   Global softmax normalization via WDM feed-
    2. Map to nonnegative control-signal amplitudes: In =           back.   The WDM-parallel architecture (Fig. 8) integrates
         un + L ∈ [0, L].                                           naturally   with a closed-loop normalization scheme to com-
                                                                    plete the full softmax function. After the N -stage cascade,
    3. Choose tolerance ε and set εlog = ln(1 + ε).                 a WDM demultiplexer (e.g., arrayed-waveguide grating or
                                                                    ring-filter bank) routes each channel λj to a dedicated pho-
    4. Select a physically feasible bmax and estimate N             todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj .
         using Eq. (28).                                            The d photocurrents are summed electrically:
   5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2,                                d                   d
      then refine (a, b) by a two-parameter minimax fit if
                                                                                        X                   X
                                                                                   S=         Iλj ∝ C Pin         eVj .     (35)
      required.                                                                         j=1                 j=1

   6. The optical block yields ỹ(In ) ≈ exn −m , and soft-       A proportional–integral (PI) controller compares S with
      max weights follow as                                       a fixed reference Sref and adjusts the shared WDM laser
                                                                  power Pin so that S → Sref [44, 45]. Because all d channels
                                                                  share the same probe source, scaling Pin multiplies every
                            ỹ(In )
                      pn = P           .                 (34)     ỹj by the same factor; upon convergence
                             j ỹ(Ij )
                                                                                   Iλj      eVj
                                                                            pj =        = Pd        = softmax(V )j ,        (36)
   Scope and limits. The approximation is for a fi-                                Sref          Vk
                                                                                           k=1 e
nite interval I ∈ [0, L], where L is determined by the
input batch via Eq. (4). In practice, one designs for a           realizing the complete softmax with a single feedback loop
worst-case L expected in operation (or retunes a and              and no per-channel normalization circuitry. Compared
rescales the control signal to adapt L). Noise, insertion         with the replicated-cascade approach (one AEF block per
loss, and control-induced parasitics limit accuracy and           channel), WDM feedback offers two additional benefits:
dynamic range; we treat these effects as platform-specific        (i) the splitter-induced power imbalance that would bias
margins. Detailed non-ideality assumptions, parameter             the Iλj ratios is absent, since all channels traverse the
distributions, and robustness statistics are reported in          same optical path; and (ii) a single laser control point
Supplementary Sec. S8. With K channels in parallel,               replaces d independent probe adjustments. Design de-
one can form softmax by summing channel powers and                tails and stability analysis of the PI loop are provided in
applying a shared reciprocal scale factor, depending on           Supplementary Sec. S9.
the chosen mixed-signal normalization scheme.                        Beyond ring-resonator AEF implementations, the same
   WDM parallelism. A particularly hardware-efficient             cascade principle can be extended to other cavity-based
realization exploits wavelength-division multiplexing             photonic platforms, such as serial 1D photonic-crystal cav-
(WDM): d probe wavelengths λ1 , . . . , λd , each resonant        ities and other cascaded resonant architectures [21, 46].
                                                                                                                                  11

What these platforms share is transfer-function shaping          TABLE VII: Summary of evidence levels for key claims.
through cascaded resonances; loss, tuning range, fabrica-
tion tolerance, and calibration overhead remain platform-        Claim                              Evidence       Sec.
dependent.                                                       Cascade → exp. approx.             Analytic        II
    The insertion loss budget (Sec. V C) and electro-optic       Depth scaling                  Analytic + num. II, III
voltage requirements (Sec. V A) suggest that the cas-            QL , Dmax , bV                    3-D FDTD         IV
cade architecture is feasible under optimized coupling           5-ring line shape                 3-D FDTD         IV
and layout conditions. Using monolithic TFLN microring           N ≤ 30 deep cascade              CMT proj.∗         V
                                                                 Energy < 1 pJ                      Estimate        V
data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈
                                                                 Full softmax (WDM + feedback) Conceptual + layout VI
9–20 pm/V), the normalized sensitivity bV ≃ 0.063–
                                                                 ∗ Based on published Q
0.14 V−1 , within the range required by the cascade design.                               i ≥ 10
                                                                                                   6 values [39, 42] and CMT coupling

                                                                                                   model.
Crystal orientation and electrode design. The X-
cut TFLN platform was chosen for several reasons. First,
X-cut is the prevailing industry standard for integrated         tified in the Monte Carlo robustness analysis (Supple-
TFLN modulators, with well-established fabrication pro-          mentary Sec. S8). Monte Carlo simulations (Supplemen-
cesses and commercial wafer availability [37, 38]. Second,       tary Sec. S8) show that under nominal non-ideality levels
the TE0 mode—which is strongly confined in the rib               (σa = 0.020, σb,rel = 0.020), a single-point calibration of
waveguide geometry—can engage the large r33 coefficient          C per chip keeps the median softmax KL divergence below
via lateral electric fields aligned with the crystal Z-axis.     2.2 × 10−4 , with 95th-percentile max probability error
In contrast, Z-cut geometry with TE polarization can only        under 0.32%. Even under stress conditions (σa = 0.032),
access the smaller r13 coefficient (∼ 10 pm/V), resulting        95th-percentile errors remain below 0.42%, demonstrat-
in significantly lower electro-optic efficiency. The arc elec-   ing that the identical-detuning design is robust to realis-
trode design (Sec. IV D) addresses the phase-cancellation        tic fabrication variations provided a per-chip calibration
problem inherent to X-cut circular rings [47] by orienting       step is performed. Conversely, if coupling gaps are in-
the crystal Z-axis at 45◦ from the horizontal in the sub-        tentionally varied across rings, the per-ring parameters
strate plane. This rotation places the cos(θ − 45◦ ) = 0         (ak , bk ) become independent degrees of freedom. A Taylor-
boundaries at θ = 135◦ and 315◦ , naturally separating the       expansion analysis shows that K non-identical rings can
bus-waveguide coupling regions from the electrode regions.       cancel curvature
                                                                               P terms up to order 2K in the Taylor series
Each ring carries a full semicircular arc electrode on the       of g(I) = k ln Tk , one order higher than K identical
side opposite to its coupling points, yielding an effective      rings, so that fewer rings suffice for a given error target.
fill factor fEO = 1/π ≈ 0.318. While this reduces the
round-trip EO efficiency compared to a hypothetical full-
circumference design, it preserves the compact footprint
of a circular ring resonator. The cascade performance
can be further improved beyond the R = 20 µm circular-
ring design presented here. Increasing the ring radius
reduces bending loss and raises the intrinsic quality factor
Qi , which directly increases bV (∝ Q) and lowers the
required control voltage. Alternatively, adopting a race-
track geometry with extended straight coupling sections
strengthens the bus–ring coupling, pushing the drop-port
maximum Dmax closer to critical coupling and improving
the per-stage transfer efficiency. Either approach—or their
combination—would yield higher bV and Dmax , enabling
lower N or tighter approximation accuracy at reduced
operating voltages.
Fabrication considerations. The X-cut TFLN rib
waveguide (600 nm total thickness, 500 nm etch, w =
1.4 µm) follows established fabrication processes for com-
mercial TFLN wafers on SiO2 [37, 38]. The lateral signal–
ground (SG) electrode configuration is fabricated in a
single metal layer, which is standard in TFLN foundry
processes. The primary fabrication challenge for the
cascade architecture is maintaining uniform coupling
gaps (g = 100 nm) across N rings to ensure identi-
cal Lorentzian transfer functions. Post-fabrication trim-
ming via UV exposure or localized thermal oxidation can
compensate residual detuning variations [30], as quan-
                                                                                                                12


               Softmax Full Chip Layout – N = 5 × d = 8 (TFLN)
                                d = 8 WDM channels


                 Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8

  WDM
 λ1−λ8    n=1
         Pin


          n=2
                                                                               N = 5
                                                                               cascade
          n=3                                                                  stages


          n=4


          n=5


                              WDM Demux (AWG / ring filter)

                                                                                             Sref
                        PD1   PD2   PD3     PD4   PD5   PD6   PD7   PD8
                                                                          Iλ
                                                                               j         S          e
                                                                                   Σ          −            PI
                        p1     p2    p3      p4   p5    p6    p7    p8


                                              Feedback: adjust Pin
                                      Iλj
                     Output: pj =             = softmax(V )j
                                      Sref

FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM
 channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a
 Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s
input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output
  passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel
  photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller
          adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields
                                       pj = Iλj /Sref = softmax(V )j (Eq. 36).
                                                                                                                            13

                 VII.    CONCLUSION                             Dmax ≥ 0.95) are realized in the cascade geometry, deeper
                                                                cascades (N ≈ 20–30) would reach sub-percent approx-
   We have presented a cascaded micro-ring resonator ar-        imation error with an estimated per-operation energy
chitecture that approximates the exponential function           of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8
exn −m on a finite interval [0, L] using multiplicative         MAC at the 7 nm node. Monte Carlo analysis shows that
Lorentzian transfer functions. Increasing the cascade           the identical-detuning design tolerates realistic fabrica-
depth N systematically reduces the worst-case relative          tion variations (σa = 0.020, σb,rel = 0.020) with a single
error, and an identical-detuning design initialized by flank    per-chip calibration, keeping the 95th-percentile softmax
and slope matching provides a practical two-parameter           probability error below 0.32%.
design.
   Three-dimensional FDTD simulations of a single X-cut            The formulation is not restricted to electro-optic tuning:
TFLN add-drop ring (R = 20 µm, g = 100 nm) yield                it requires only a controllable detuning coordinate with lo-
QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the             cal linearization, so both Pockels and optical (Kerr/XPM)
cascade transfer model. A five-ring cascade 3D FDTD             mechanisms are compatible [37, 38, 47, 48]. We demon-
simulation directly validates the multi-ring framework:         strate a photonic exponential block and present a WDM-
all five rings exhibit resonant excitation, and mapping         parallel chip architecture (Fig. 8) in which d wavelength
the drop-port spectrum onto the dimensionless control           channels share a single N -ring cascade, reducing the total
variable reproduces the theoretical N = 5 curve with            ring count by a factor of d and eliminating power-splitter
∼11% integrated relative-area error over the upper op-          loss. Combined with a single-loop PI feedback that adjusts
erating range (I ≥ 5.8), providing the first multi-ring         the shared WDM laser power, the architecture realizes the
confirmation of the cascade exponential approximation.          complete softmax function—exponentiation, summation,
At the present FDTD-characterized quality factor, practi-       and normalization—without per-channel normalization
cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q        circuitry. Max-finding and digital interfacing remain open
TFLN resonators reported in the literature (Qi ≥ 106 ,          for future experimental validation.


 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob                Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A.
     Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,          Franken, Keith Powell, Hana Warner, Daniel Assumpcao,
     and Illia Polosukhin. Attention is all you need. In             Dylan Renaud, Ying Wang, et al. Integrated lithium
     Advances in Neural Information Processing Systems 30            niobate photonic computing circuit based on efficient and
     (NeurIPS 2017), pages 5998–6008, 2017.                          high-speed electro-optic conversion. Nature Communica-
 [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra,               tions, 16:8178, 2025.
     and Christopher Ré. FlashAttention: Fast and memory-      [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang.
     efficient exact attention with IO-awareness. In Advances        SOFTONIC: A photonic design approach to softmax
     in Neural Information Processing Systems 35 (NeurIPS            activation for high-speed fully analog AI acceleration.
     2022), pages 16344–16359, 2022.                                 In Proceedings of the Great Lakes Symposium on VLSI
 [3] Neil Savage. Light could lower AI’s appetite for power.         (GLSVLSI ’25), pages 118–125, 2025.
     Nature Nanotechnology, 21:6–8, 2026.                       [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto-
 [4] Yichen Shen et al. Deep learning with coherent nanopho-         electronic nonlinear softmax operator based on diffractive
     tonic circuits. Nature Photonics, 11(7):441–446, 2017.          neural networks. Optics Express, 32(15):26458–26469,
 [5] Johannes Feldmann et al. Parallel convolutional process-        2024.
     ing using an integrated photonic tensor core. Nature,      [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang,
     589(7840):52–58, 2021.                                          Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang,
 [6] Nicholas C. Harris et al. Linear programmable nanopho-          Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans-
     tonic processors. Optica, 5(12):1623–1631, 2018.                former chip: interference is all you need. PhotoniX, 6:45,
 [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre               2025.
     Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan    [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai,
     Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice,         Brucek Khailany, and Anand Raghunathan. Softermax:
     and H. Bhaskaran. Higher-dimensional processing using           Hardware/software co-design of an efficient softmax for
     a photonic tensor core with continuous-time data. Nature        transformers. In Proceedings of the 58th ACM/IEEE
     Photonics, 17(12):1080–1088, 2023.                              Design Automation Conference (DAC), pages 469–474,
 [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski,                 2021.
     John E. Bowers, Michael Hochberg, Richard Soref, and       [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong
     Bhavin J. Shastri. Roadmapping the next generation of           Chang. Hardware-efficient softmax approximation for
     silicon photonics. Nature Communications, 15:751, 2024.         self-attention networks. In Proceedings of the IEEE Inter-
 [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor           national Symposium on Circuits and Systems (ISCAS),
     cores for machine learning. Applied Physics Reviews,            pages 1–5, 2023.
     7(3):031404, 2020.                                         [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun,
[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo,             and Yongpan Liu. SOLE: Hardware-software co-design
                                                                                                                               14

     of softmax and layernorm for efficient transformer infer-          2025. accessed 2026-02-21.
     ence. In Proceedings of the IEEE/ACM International            [35] Jane Austen. Pride and prejudice. Project Gutenberg
     Conference on Computer-Aided Design (ICCAD), pages                 eBook No. 1342, 2025. accessed 2026-02-21.
     1–9, 2023.                                                    [36] Hyoseok Park. MRR-AEF: reproducible MRR depth-
[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan,              sweep fitting and supplementary validation scripts.
     Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft-              GitHub repository, 2025. commit 585e695, accessed 2026-
     max function: Suitability for training and efficient hard-         02-21.
     ware implementation. IEEE Transactions on Circuits and        [37] Di Zhu et al. Integrated photonics on thin-film lithium
     Systems I: Regular Papers, 69(9):3605–3618, 2022.                  niobate. Advances in Optics and Photonics, 13(2):242–352,
[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing                2021.
     Pan. TEA-S: A tiny and efficient architecture for PLAC-       [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang
     based softmax in transformers. IEEE Transactions on                Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
     Circuits and Systems II: Express Briefs, 70:3594–3598,             CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo,
     2023.                                                              Amirhassan Shams-Ansari, David Barton, Neil Sinclair,
[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and                  and Marko Loncar. Integrated electro-optics on thin-film
     Fabrizio Lombardi. Approximate softmax functions for               lithium niobate. Nature Reviews Physics, 2025.
     energy-efficient deep neural networks. IEEE Transactions      [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan
     on Very Large Scale Integration (VLSI) Systems, 31:4–16,           Shams-Ansari, and Marko Lončar. Monolithic ultra-high-
     2023.                                                              Q lithium niobate microring resonator. Optica, 4(12):1536–
[20] Wim Bogaerts et al. Silicon microring resonators. Laser            1537, 2017.
     & Photonics Reviews, 6(1):47–73, 2012.                        [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q
[21] John E. Heebner, Robert W. Boyd, and Q.-Han                        thin-film lithium niobate microrings fabricated with wet
     Park. Scissor solitons and other propagation effects in            etching. Adv. Mater., 35(3):2208113, 2023.
     microresonator-modified waveguides. Journal of the Opti-      [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K.
     cal Society of America B, 19(4):722–731, 2002.                     Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag-
[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and                 alhães, Amirhassan Shams-Ansari, Neil Sinclair, and
     Shanhui Fan. Microring-based programmable coherent                 Marko Lončar. Twenty-nine million intrinsic Q-factor
     optical neural networks. Optics Express, 31(12):18871,             monolithic microresonators on thin-film lithium niobate.
     2023.                                                              Photon. Res., 12(8):A63–A68, 2024.
[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo.          [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian
     StarLight: a photonic neural network accelerator featur-           Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng.
     ing a hybrid mode-wavelength division multiplexing and             Lithium niobate microring with ultra-high Q factor above
     photonic nonvolatile memory. Optics Express, 30:37051,             108 . Chin. Opt. Lett., 20(1):011902, 2022.
     2022.                                                         [43] Flexcompute Inc.       Tidy3D: electromagnetic simula-
[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang,                   tion software. https://www.flexcompute.com/tidy3d/,
     and Weifeng Zhang. Reconfigurable low-threshold all-               2024.       v2.10; cloud GPU FDTD. Accompany-
     optical nonlinear activation functions based on an add-            ing notebook: https://www.flexcompute.com/tidy3d/
     drop silicon microring resonator. IEEE Photonics Journal,          community/notebooks/CascadedMRRTFLN/.
     14(6):1–7, 2022.                                              [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights.
[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals                Silicon photonic dynamic optical channel leveler with
     of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007.                 external feedback loop. Optics Express, 18(13):13805–
[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R.                   13812, 2010.
     Panepucci, and Michal Lipson. All-optical control of light    [45] Karl J. Åström and Richard M. Murray. Feedback Systems:
     on a silicon chip. Nature, 431(7012):1081–1084, 2004.              An Introduction for Scientists and Engineers. Princeton
[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and                   University Press, Princeton, NJ, 2008.
     Michal Lipson. Micrometre-scale silicon electro-optic mod-    [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel
     ulator. Nature, 435(7040):325–327, 2005.                           Scherer. Coupled-resonator optical waveguide: a proposal
[28] Kishore Padmaraju and Keren Bergman. Resolving the                 and analysis. Optics Letters, 24(11):711–713, 1999.
     thermal challenges for silicon microring resonator devices.   [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien,
     Nanophotonics, 3:269–281, 2014.                                    Lynford L. Goddard, and Songbin Gong. Ultra-efficient
[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X.              and fully isotropic monolithic microring modulators in
     Wang. Transparent conductive oxide-gated silicon mi-               a thin-film lithium niobate photonics platform. Optics
     croring with extreme resonance wavelength tunability.              Express, 28(20):29644–29661, 2020.
     Photonics Research, 7(4):473, 2019.                           [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki,
[30] Lahiru Jayatilleka et al. Post-fabrication trimming of             Peng Yao, and Dennis W. Prather. Tunable hybrid sil-
     silicon photonic ring resonators at wafer-scale. Journal           icon nitride and thin-film lithium niobate electro-optic
     of Lightwave Technology, 39:5083–5088, 2021.                       microresonator. Optics Letters, 44(3):618, 2019.
[31] Elliott W. Cheney. Introduction to Approximation Theory.
     McGraw–Hill, New York, 1966.
[32] Alec Radford et al. Language models are unsupervised
     multitask learners. Technical report, OpenAI, 2019.
[33] Hugging Face. distilgpt2 model card, 2025. accessed
     2026-02-21.
[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn),
                                                                                                                      15

                                      SUPPLEMENTARY INFORMATION

Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
toward Softmax.”


                           S0. RIGOROUS DERIVATION AND VALIDITY SCOPE

  This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions
under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma,
Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration.


                                                  S0.1 Assumptions

Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form
Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized
sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude.
Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of
ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced
                                                                        QN
effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I).
Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces
the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N .
Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is
a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is
negligible.
Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval
I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for
fixed, finite L.
Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1
(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the
steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear.


                                                S0.2 Rigorous results

  Throughout, define the log-domain residual

                          r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L),
                                                                               
                                                                                                                  (S0.1)

and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) +
         
minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout.
Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0,

                                               d
                                                  ln y(I) ≤ N |b|.
                                               dI

Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating:
                                                        

                                           d                 2b(a + bI)
                                              ln y(I) = −N               .
                                           dI              1 + (a + bI)2

Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM).
Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|.
                                                                                                                        16

Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a
necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the
main text and is a rigorous (not heuristic) necessary condition.
Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and
δ = I − I0 . Then
                                                                           δ3
                                                ln ỹ(I) = const + δ +         + R4 (δ),                             (S0.2)
                                                                          6N 2
where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic
term vanishes identically at the flank point u0 = a + bI0 = −1.
Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ))
and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1:
                                         2u
                           ϕ′ (u) = −          ,                             ϕ′ (−1) = 1,
                                       1 + u2
                                     2(u2 − 1)
                          ϕ′′ (u) =              ,                          ϕ′′ (−1) = 0,
                                     (1 + u2 )2
                                     4u(3 − u2 )                                         −4(−1)(3 − 1)
                          ϕ′′′ (u) =               ,                       ϕ′′′ (−1) =                 = 1.
                                      (1 + u2 )3                                           (1 + 1)3
By the chain rule, writing F (δ) = N ϕ(u(δ)):
                                                  F ′ (0) = N b ϕ′ (−1) = N b = 1,
                                                 F ′′ (0) = N b2 ϕ′′ (−1) = 0,
                                                                          1
                                                 F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 =
                                                                            ,
                                                                         N2
where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C
is
                                                              δ2   1 δ3
                                   ln ỹ(I) = const + δ + 0 ·    + 2·    + R4 (δ).
                                                              2   N    6
Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is
bounded by the standard Taylor remainder estimate.
Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the
leading-order worst-case log-error on I ∈ [0, L] satisfies
                                                              3
                                           (leading)       1  L      L3
                                         E∞          ∼            =        .                            (S0.3)
                                                         6N 2 2     48 N 2
          (leading)
Setting E∞            ≤ εlog = ln(1 + ε) and solving for N gives
                                                                 L3/2
                                                            N ≥ p        .                                           (S0.4)
                                                                 48 εlog
Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for
|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving:
                                                 L3                         L3/2
                                                       ≤ εlog     =⇒     N≥p         .
                                                48 N 2                       48 εlog
        √
With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the
one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the
                                                                     √
main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉.
Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The
derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07).
For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and
                                                                                                    √
higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by
numerical evidence (Table I) but should be treated as an engineering design rule.
                                                                                                                      17

                                 S0.3 Derivation of the conservative screening bound

  We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main
text.
Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the
flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write
                                                                          
                                         ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) .

The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and
r(0) = N [ϕ(a) − ϕ(a + bL)] + L.
   For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e.,
ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies

                                              N ∥ϕ′′ ∥∞ b2 L2   N ϕ′ (a + bL) · b − 1
                                   E∞ ≤                       +                       · L,                        (S0.5)
                                                      8                   2

where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|.
Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to
match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem
on [0, L]:
                                                                          Z L
                                                                                1 − h′ (t) dt.
                                                                                       
                          r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) =
                                                                                 I
                                                                  RL
Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound
|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5).
Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0),
and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to

                                                                N b2 L 2
                                                       E∞ ≤              .                                        (S0.6)
                                                                   4
Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add
the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is

                                                             L2    1
                                                     E∞ ≤       + 2 ,                                             (S0.7)
                                                             4N  2b N
which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the
conservative screening depth:
                                                      2
                                                       L /4 + 1/(2b2 )
                                                                       
                                            Nsafe ≥                      .                                     (S0.8)
                                                          ln(1 + ε)

Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is
conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit
the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating
range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N )
arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper
bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary
design-space exploration but not a certified minimax guarantee.


                                            S0.4 Validity scope and failure cases

  The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may
break down.
                                                                                                                       18

(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation
     valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order
     dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large
     detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings.

(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward-
     propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring
     count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade
     transfer function.

(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN .
     In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8
     quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning.
                                                                                          (0)
(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic
     error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide
     ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L
     requiring a control swing exceeding the linearity range of the tuning mechanism.

(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound
     and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime.
     There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at
     p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss
     budgets.

(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation
     ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing
     a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N
     rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition.


                                        S0.5 Mapping to main-text equations

For reference, the results derived here correspond to the following main-text equations:

    • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary
      condition.

    • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to
      main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits.

    • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived
      as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition.

    • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8)
      in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design.

Summary of guarantee status:
Result                            Status                                      Main-text Eq.
Slope bound N |b| ≥ 1             Rigorous (proved)                           (23)
                    √
Scaling N ∼ κL3/2 / εlog          Heuristic (Taylor truncation + empirical κ) (28)
Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic                      (S0.7)
Nsafe screening depth             Conservative backstop                       (S0.8)


            S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND

  This section provides the detailed derivations underlying the depth-scaling relations and conservative screening
bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0.
                                                                                                                          19

                                S1.1 Local expansion and exponential-like behavior

   To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the
flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a
third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives

                                                      N 2 2 2 N (N 2 − 1) 3 3
                                ỹ(I) ≈ 1 + N b δ +      b δ +           b δ + O(δ 4 ),                               (S1.1)
                                                       2          6
so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining
why the initialization is already close before refinement.


                                  S1.2 Log-domain analysis and scaling derivation

  For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1),
expand around I0 = L/2 with δ = I − I0 to obtain

                                                                     N b3 3
                                        ln ỹ(I) = const + N b δ +       δ + O(δ 4 ).                                 (S1.2)
                                                                      6
At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives

                                                                      δ3
                                           ln ỹ(I) = const + δ +         + O(δ 4 ).                                  (S1.3)
                                                                     6N 2
Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 .
Requiring E∞ ≤ εlog leads to

                                                           L3/2
                                                         N∝√      ,                                                   (S1.4)
                                                             εlog

which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a
formal guarantee), and the prefactor remains platform- and fitting-criterion dependent.


                                S1.3 Conservative upper bound and screening depth

   For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition
for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1).
For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the
worst-case log-error as

                                                            L2    1
                                                   E∞ ≤        + 2 .                                                  (S1.5)
                                                            4N  2b N
(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening
estimate (not a formal guarantee), one may use
                                                      2
                                                      L /4 + 1/(2b2 )
                                                                      
                                              N ≥                       .                                     (S1.6)
                                                         ln(1 + ε)

While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design
screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions,
is given in Sec. S0.3.
                                                                                                                  20

              S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION

  This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range
calibration tables referenced in the main text (Sec. III).


                                 S2.1 Worked input-to-output mapping example

  As a worked example, consider

                                                x = [−3.2, 1.2, 4.8, −0.9].                                    (S2.1)

Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped
control-signal levels are

                                               I = u + L = [0, 4.4, 8.0, 2.3],                                 (S2.2)

and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly,
                                                                                     N
                                                      1                              Y
                                  Tk (In ) =                    ,         y(In ) =         Tk (In ).
                                               1 + (ak + bIn )2
                                                                                     k=1

Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes
                                                                                       N
                                                                             1
                                       ỹ(In ) = C y(In ) = C                                .
                                                                      1 + (a + bIn )2
For the re-fitted parameters used in this example,

                                                a = −1.4588,          b = 0.10202,
                                                                                                               (S2.3)
                                               N = 10,       C = 3.0896 × 101 .

which gives
                                                                           N
                                                                 1
                                        ỹ(In ) = C                              ,
                                                          1 + (a + bIn )2
                                                                                                               (S2.4)
                                                 ≈ [3.44 × 10−4 , 2.73 × 10−2 ,
                                                       9.74 × 10−1 , 3.26 × 10−3 ].

  For reference, the corresponding target terms are

                                           In − L = [−8.0, −3.6, 0, −5.7],                                     (S2.5)

and
                                          In −L  
                                          e       ≈ 3.35 × 10−4 , 2.73 × 10−2 ,
                                                                                                               (S2.6)
                                                      1.00, 3.35 × 10−3 .
                                                                        

                            S2.2 Effective-range percentiles and clipping calibration

   We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive
Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context
length 128, the effective range

                               Leff,α = max(log pkept ) − min(log pkept ),              α = 0.999,             (S2.7)

fell in a relatively narrow band, summarized in Table S2.
                                                                                                                          21

 TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters
                        re-fitted on I ∈ [0, 8.0] using the same minimax pipeline.

  xn                      In                     target exn −m                     approx ỹ(In )                   rel. err.
                                                            −4                                −4
−3.2                     0.0                     3.3546 × 10                       3.4443 × 10                       2.673%
 1.2                     4.4                     2.7324 × 10−2                     2.7325 × 10−2                     0.004%
 4.8                     8.0                            1.0000                            0.9739                     2.608%
−0.9                     2.3                     3.3460 × 10−3                     3.2585 × 10−3                     2.614%


                       TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128.

                                           Percentile All runs (4 runs) GPT-2
                                           p50            6.92–7.23    7.09–7.23
                                           p90            8.60–8.75    8.73–8.75
                                           p95            8.97–9.12    9.06–9.12
                                           p99            9.50–9.69    9.58–9.69


  We then test clipping on the same rows with

                                       Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 ,
                                                                                                                      (S2.8)
                                           u(t) = max(u, t),     u = s − max(s).

and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated
at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12).
  In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii)
choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol
details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3.


                                        S2.3 Illustrative synthetic range map
                                                                                                                   √
  As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk
scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3
summarizes achievable approximation error versus depth.

   TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error
                      (design-space illustration; not intended as corpus-level statistics).

L regime                       N =5                       N = 10                        N = 20                      N = 30
   L=8                         10.9%                       2.68%                        0.67%                        0.30%
  L = 12                       40.0%                       9.25%                        2.27%                        1.01%
  L = 16                       113%                        23.0%                        5.44%                        2.41%


  Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8,
moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1%
error.
  We include Table S3 as a synthetic design map rather than an empirical benchmark.
                                                                                                                    22

         S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS

  We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in
the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora
(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention
row, if p = softmax(s) then the raw range is
                                 Lraw = max(s) − min(s) = max(log p) − min(log p),                                (37)
where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p),
we additionally report an effective range:
                                         Leff,α = max(log pkept ) − min(log pkept ),                              (38)
where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999.
  To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128),
and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs.

  TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs
                                             effective (α = 0.999).

                     Model     Dataset             raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99
                     distilgpt2 tiny shakespeare     22.82      69.00    7.10        8.60     8.97   9.50
                     distilgpt2 pride prejudice      21.76      68.60    6.92        8.60     9.03   9.57
                     gpt2       tiny shakespeare     25.48      43.34    7.23        8.73     9.06   9.58
                     gpt2       pride prejudice      24.13      40.92    7.09        8.75     9.12   9.69

  For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this
table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50
= 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69.
Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used
without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define
u = s − max(s) and, for threshold t ≤ 0,
                                       u(t) = max(u, t),           p(t) = softmax(u(t) ).                         (39)
We report the cumulative softmax error
                                                        1 (t)
                                                           p −p ,
                                                   Ecum (t) =                                                     (40)
                                                        2          1
then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows.

       TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t.

                                   t        p50              p90         p95            p99
                                                   −5              −5           −5
                                  −14 2.53 × 10    4.55 × 10   4.80 × 10   5.18 × 10−5
                                                −5          −5          −5
                                  −13 2.69 × 10    4.85 × 10   7.38 × 10   1.48 × 10−4
                                                −5          −4          −4
                                  −12 2.99 × 10    1.21 × 10   2.13 × 10   4.27 × 10−4
                                  −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3
                                  −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3
                                  −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2
                                  −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2
                                  −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2
                                  −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1

   Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep
is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably,
this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating
that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range
budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while
keeping aggregate softmax distortion below 0.1%.
                                                                                                                   23


    FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs.


FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the
                                                0.1% budget (10−3 ).
                                                                                                                        24

                    S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION

  This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage
sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V).


                                       S4.1 z-refined 3-fix simulation strategy

   For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast
between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix”
strategy:
   1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate
      for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation;
   2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient
      substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts;
   3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied.
      The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm
      slab.
This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral
features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation
parameters.

                              TABLE S6: 3D FDTD simulation parameters (Lumerical).

Parameter                                                                                  Value
Solver                                                                                     Lumerical 3D FDTD
Mesh type                                                                                  Conformal variant 1
Mesh accuracy                                                                              3 (auto-mesh)
z-mesh override                                                                            None (auto-mesh)
Simulation time                                                                            50 ps
Auto shutoff                                                                               1 × 10−6
Wavelength range                                                                           1530 nm to 1570 nm
Grid size                                                                                  532 × 816 × 44
Source                                                                                     Broadband mode source (TE0 )


                                S4.2 X-cut arc electrode bV step-by-step derivation

   For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦
from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral
electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and
θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full
semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective
EO fill factor is
                         Z                                   Z +π/2
                       1                                   1                    1       +π/2  1
                fEO =               | cos(θ − 45◦ )| dθ =           cos φ dφ =      sin φ −π/2 = ≈ 0.318.          (S4.1)
                      2π semicircle                       2π −π/2              2π               π
The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor
integral is identical to the standard cos θ case by the change of variable.
   The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm.
The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration.
   The refractive index change per volt in the electrode-covered section is
             ∆neff    1        ΓEO     1                              0.7
                   = − n3e r33      = − × 2.1383 × 30.9 × 10−12 ×            = −4.226 × 10−5 V−1 .                  (S4.2)
              V       2        deff    2                          2.5 × 10−6
                                                                                                                     25

The corresponding resonance wavelength shift is
                                  dλ0           1550 × 4.226 × 10−5
                                              =                     = 28.48 pm V−1 ,                             (S4.3)
                                  dV straight           2.30

giving an intrinsic (straight-section) voltage sensitivity of
                                         2QL dλ0           2 × 15,500
                           bstraight
                            V        =                   =            × 0.02848 = 0.570 V−1 .                    (S4.4)
                                          λ0 dV straight      1550
However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The
effective voltage sensitivity is therefore
                                                                      1
                                     bV = bstraight
                                           V        × fEO = 0.570 ×     ≈ 0.182 V−1 .                            (S4.5)
                                                                      π
A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318),
the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations)
and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length.


                                           S4.3 Full cascade optimization table

  Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to
e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30.

     TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and
Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated
                                          best resonance QL = 15,500).

N                a                    b                 E∞              εmax (%)          Vbias (V)            Vctrl (V)
 5            −2.0789              0.21658             0.1035             10.91             11.4                  9.5
 8            −1.5959              0.12896             0.0412              4.20              8.8                  5.7
10            −1.4588              0.10202             0.0265              2.68              8.0                  4.5
12            −1.3731              0.08450             0.0184              1.86              7.5                  3.7
15            −1.2914              0.06726             0.0118              1.19              7.1                  3.0
17            −1.2543              0.05923             0.0092              0.92              6.9                  2.6
20            −1.2136              0.05025             0.0067              0.67              6.7                  2.2
25            −1.1685              0.04013             0.0043              0.43              6.4                  1.8
30            −1.1392              0.03341             0.0030              0.30              6.3                  1.5


  Key thresholds for the minimum number of rings at various error targets are:
     • ε < 10%: N ≥ 6,
     • ε < 5%: N ≥ 8,
     • ε < 2%: N ≥ 12,
     • ε < 1%: N ≥ 17,
     • ε < 0.5%: N ≥ 24.
These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in
normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning
range, through bV .


                                              S4.4 Lorentzian fit validation

  Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian
Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer
function model used in the cascade analysis.
                                                                                                                      26


  FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The
                           extracted quality factor is QL = 15,500 with FWHM = 101 pm.


                                 S4.5 Eigenmode (FDE) analysis of theoretical Qi

   To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional
finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions.
   a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm
and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is
identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch,
waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300
cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially
finer than the 3D FDTD auto-mesh (55 nm vertical).
   b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided
mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm:
                                        neff = 1.9653 + i (4.73 × 10−8 ),                                            (41)
                                               4π ni
                                                     = 0.383 m−1 0.017 dB cm−1 .
                                                                              
                                   αrad+leak =                                                                       (42)
                                                 λ
Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the
100 nm slab. The corresponding quality factor is
                                                         2π ng
                                         Qrad+leak =               = 2.43 × 107 ,                                    (43)
                                                       αrad+leak λ
where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the
small difference arises from the straight-section approximation inherent to 2D FDE).
  c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields
Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from
                                   1          1        1
                                        =           −       ,      Qbend = 1.43 × 108 .                              (44)
                                  Qbend   Qrad+leak   Qleak
This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the
dominant geometric loss channel.
   d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical
intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is
                                                             2π ng
                                                   Qabs =            ,                                               (45)
                                                            Γ αmat λ
                                                                                                                   27

where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for
representative TFLN absorption values from the literature [6, 7].

TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ),
 substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded.
                       The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 .

Material condition                         αmat (dB/cm)                         Qabs                        Qi (total)
Bulk LiNbO3 (pristine)                         0.002                          2.3 × 108                     2.2 × 107
High-quality TFLN                               0.01                          4.7 × 107                     1.6 × 107
Good TFLN                                       0.03                          1.6 × 107                     9.5 × 106
Typical TFLN                                     0.1                          4.7 × 106                     3.9 × 106


   For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the
FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts
(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm
is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage
through the thin slab (Qleak = 2.93 × 107 ).
                                                                                                                                    28

                               S5. FABRICATED HIGH-Q DESIGN PROJECTIONS

   Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm
slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The
numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional
MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any
realistic intrinsic loss.
   Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been
demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP-
based processes.

  TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates
                whether the fabrication process preserves electrode patterning capability.

Ref.                              Qi                       R (µm)                      w (µm)                           Etch
Zhang [8]                        107                         80                          ∼2                           Ar+ mill
Gao [9]                           108                       100                          ∼3                            CMP∗
Zhuang [10]                     9×106                       100                          ∼2                           Wet etch
Song [11]                      2.9×107                      200                          4.5                       ICP-RIE+CMP
   All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi .

  To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling
quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality
factors (Table S10).

                                                                                              N
  TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal
on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on
                                 Qi ); at N = 20, L = 8: εmax = 0.67% (Table I).

Projection                     Qi                        Dmax                  N =10                  N =20                 N =30
FDTD baseline                  3.88×104                  0.36                  −44.3                  −88.5                 −132.8
Conservative                   5×105                     0.90                  −4.4                   −8.8                  −13.2
Moderate                       106                       0.95                  −2.2                   −4.5                   −6.7
Optimistic                     5×106                     0.99                  −0.44                  −0.88                  −1.3


  Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an
order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated
high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck;
the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature
values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable
in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design.
Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge;
the projections in Table S10 should be read as design targets contingent on achieving it.
                                                                                                                      29

                                   S6. INSERTION LOSS BUDGET DETAILS

  For a cascade of N rings, the total insertion loss is modeled as

                                           ILtot ≈ N · ILstage + ILcoupling ,                                      (S6.1)

where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and
chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios:

   • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB
     (N = 20), and 3.80 dB (N = 30).
   • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10),
     8.00 dB (N = 20), and 10.5 dB (N = 30).

   In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter
power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation
closer to the detector noise floor, reducing usable dynamic range.
   e. Four-component loss breakdown. The total insertion loss of the cascade has four components:
                                         N
   1. On-resonance cascade transmission Dmax (dominant; see Table V);
   2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each
      inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding
      to ∼0.46 dB per inter-ring stage;
   3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16];
   4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15].
                                                   N
Table V presents the ideal on-resonance budget (Dmax   only). Including all four components for the present diagonal-bus
layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈
26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The
inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or
straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized
layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB.
                                                                                                                        30

                             S7. ENERGY EFFICIENCY DETAILED DERIVATION

  This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits
and the photonic MRR cascade, as summarized in the main text (Sec. V).


                                         S7.1 Electrical analog exponential circuits

  Three main families of electrical circuits realize the exponential function in the analog domain:
  f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC =
IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building
block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At
a signal bandwidth of B ≈ 100 MHz, the energy per operation is
                                                            P   300 µW
                                               EGilbert =     =         = 3 pJ.                                     (S7.1)
                                                            B   100 MHz
  g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling
direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements
a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by
subthreshold fT ):
                                                             0.43 µW
                                                 Esub-VT =           = 0.43 pJ.                                     (S7.2)
                                                              1 MHz
This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz).
  h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations.
Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving
                                           Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ.                               (S7.3)
At 8-bit precision (sufficient for inference): ∼2.3 pJ.


                          S7.2 Photonic MRR cascade: single-channel energy derivation

   We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated
high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V
(fully CMOS-compatible). The energy per exponential operation has three components:
   (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral
S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as
                                                            Cel ≈ 18 fF,                                            (S7.4)
based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported
by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected
QL = 25,200, which gives bV = 0.295 V−1 ):
                                                       2
                                      Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ.                       (S7.5)
For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ.
  Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because
                                                        2
                                            EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N.                                   (S7.6)
The bias voltage (3.9 V) is static and does not contribute per-operation energy.
   (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single
CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug
efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10),
requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ.
   (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ.
   The total single-channel energy is
                              (1ch)
                            Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ.                          (S7.7)
                                                                                                                      31

                                       S7.3 Q-factor scaling of energy efficiency

                                2
  Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at
various quality factors.

TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV
 scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the
FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal
                                       stabilization (0.15–0.60 pJ for N = 30).

      Q                    Vctrl (V)                  Vbias (V)                   EEO (pJ)                    Etotal (pJ)
   5,000                     4.57                       19.5                        5.64                         6.21
 10,000                      2.28                        9.7                        1.40                         1.97
 12,500                      1.83                        7.8                        0.90                         1.47
15,500†                      1.47                        6.3                        0.58                         1.15
 20,000                      1.14                        4.9                        0.35                         0.92
25,200‡                      0.91                        3.9                        0.22                         0.79
 30,000                      0.76                        3.2                        0.16                         0.73
 50,000                      0.46                        1.9                        0.06                         0.63


   At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the
design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total
energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather
than energy.
   i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy
for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal
stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than
silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate,
the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to
0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy
comparison should therefore be treated as an order-of-magnitude estimate.


                                S7.4 Comparison with electronic implementations

   Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our
photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume
a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a
1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is
                   N
Pout = Pin,ch × Dmax  , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30)
(Table V).
   j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared:
BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and
digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]).
   k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self-
consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized
laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding
Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ.
Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization.
   l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves
0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including
thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz
bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11),
recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000×
lower bandwidth.
   m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic
estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared
with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for
                                                                                                                                   32

                      TABLE S12: Energy per exponential operation: single-channel comparison.

Implementation                                    E/op (pJ)                        Bandwidth                             Notes
Digital FP32 (Taylor)                                ∼46                             1 GHz                           10 FP MACs
BJT Gilbert cell                                     ∼3                             100 MHz                              Analog
Digital INT8 (Taylor)                                ∼2.3                            1 GHz                           10 INT MACs
Photonic MRR (N = 30)                             0.94–1.39                         10 GHz                             Analog†
Subthreshold CMOS                                   ∼0.43                            1 MHz                               Analog
    † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see
                                                                                                                 L
                                                      Supplementary Sec. S7.


TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective
precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage
over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical
and Vctrl remains CMOS-compatible.
                                                                                                                         33

                  S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES

   This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design
(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per-
ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v)
control-channel noise, and (vi) detector noise with one-point calibration at I = L.
   For ring k, we use
                                                                      1
                                        Tk (I) =                                         2,                            (46)
                                                   1 + (ak + bk I + dth + dxt I/L)

with
                                                       N
                                                       Y
                                              y(I) =         Tk (I) × 10−ILtot /10 ,                                   (47)
                                                       k=1

and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance.

                       TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps.

                                        Parameter                 Nominal       Stress
                                        σa                     0.020       0.032
                                        σb,rel                 0.020       0.032
                                        σth                    0.015       0.025
                                        σxt                    0.012       0.020
                                        σI                     0.004       0.007
                                        ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05
                                        σdet                3.0 × 10−6 6.0 × 10−6


                        TABLE S14: Monte Carlo summary (same run reported in main text).

                                     Metric                         Nominal        Stress
                                     Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4
                                     p95 KL(pref ∥papprox )    5.92 × 10−4 2.21 × 10−3
                                     Median max |∆p|             0.170%      0.193%
                                     p95 max |∆p|                0.319%      0.419%

Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family
with fixed b, define

                             ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL),            ϕ(u) = − ln(1 + u2 ),                    (48)

so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder
argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the
flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening
inequality

                                                              L2    1
                                                    E∞ ≲         + 2 ,                                                 (49)
                                                              4N  2b N
which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a
conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening.
                                                                                            34


FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples.
                                                                                                                            35

                      S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION

  We model global normalization as a delayed PI-controlled loop:

                                   S(t) = G(t)P (t) + n(t),                                                               (50)
                                    dP
                                  τ     = −P (t) + u(t − Td ),                                                            (51)
                                    dt                 Z
                                   u(t) = Kp e(t) + Ki      e(t) dt,          e(t) = Sref − S(t),                         (52)

with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical
intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples.

TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case
            ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ).

                                 Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation
                                   100 ns         20 ns              1.24 µs           fast loop
                                    1 µs          200 ns             12.4 µs        moderate loop
                                    5 µs           1 µs               62 µs          slower loop

Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed-
signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip)
for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude
ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices.

    TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9.

                                       Subsystem path                  Tsys          Sources
                                       On-chip memory (L1/L2)     20–200 ns [25]
                                       Off-chip memory (DRAM) 200–700 ns [25, 26]
                                       ADC conversion             10–710 ns [27, 28]
                                       DAC + driver/settling      1–200 ns [29]
                                       On-chip interconnect (NoC) 5–100 ns [30]
                                       Off-chip I/O (PCIe/CXL) 1–10 µs      [25, 31]

Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from
one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time
entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy:

                                                        Tnorm ≈ 12.4 τ.                                                   (53)

This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling
times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening
metric

                                                       Tnorm ≥ β Tsys ,                                                   (54)

with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal
dominance proof. The corresponding threshold is
                                                                    β Tsys
                                                      τcrit (β) =          .                                              (55)
                                                                     12.4
Table S17 gives the resulting numeric ranges.
For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs
gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the
normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant
                                                                                                                  36

                        TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16.

                         Subsystem                       Tsys range τcrit (β = 0.5) τcrit (β = 1)
                         On-chip memory path        20–200 ns 0.81–8.06 ns 1.61–16.13 ns
                         Off-chip memory path      200–700 ns 8.06–28.23 ns 16.13–56.45 ns
                         ADC conversion             10–710 ns 0.40–28.63 ns 0.81–57.26 ns
                         DAC+driver/settling         1–200 ns 0.04–8.06 ns 0.08–16.13 ns
                         On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns
                         Off-chip I/O fabric          1–10 µs  0.04–0.40 µs 0.08–0.81 µs


as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture-
specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16).
Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations
require full timing validation.

TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ).

                                Case      (Kp , Ki , Td /τ ) Overshoot    Settling     Stable
                                Stable    (0.55, 0.8, 0.2)     25.6%      ∼ 12.4τ       Yes
                                Marginal (0.95, 1.6, 0.45)     25.6%      ∼ 12.8τ       Yes
                                Unstable (1.2, 2.2, 0.75)      45.1%     not settled    No


                   TABLE S19: Stable-region fraction from gain-map scans at each delay ratio.

                                                  Td /τ Stable fraction
                                                   0.0        88.1%
                                                   0.2        88.0%
                                                   0.5        72.4%
                                                   0.8        47.5%
                                                                        37


FIG. S5: Step-response examples of the delayed PI normalization loop.
                                                                          38


FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges.
                                                                                                                             39

                                               S10. REPRODUCIBILITY

  Scripts used for this Supplementary validation:
    • scripts/nonideality montecarlo.py

    • scripts/feedback loop validation.py

    • scripts/extract logit range effective.py

    • scripts/analyze softmax clipping validity.py
Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs
are stored under:
    • paper/empirical L v3/


 [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
     Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages
     5998–6008, 2017.
 [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
 [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21.
 [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21.
 [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21.
 [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021.
 [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
     CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko
     Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025.
 [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium
     niobate microring resonator. Optica, 4(12):1536–1537, 2017.
 [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium
     niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022.
[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching.
     Adv. Mater., 35(3):2208113, 2023.
[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan
     Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on
     thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024.
[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J.
     Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024.
[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018.
[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi-
     cations, 14:2389, 2023.
[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on-
     insulator platform. Optics Express, 18(17):18278–18283, 2010.
[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of
     the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS
     Benelux Chapter, 2001.
[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975.
[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989.
[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017.
[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and
     Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics,
     10(9):1004, 2021.
[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State
     Circuits Conference (ISSCC), pages 10–14, 2014.
[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully
     isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644–
     29661, 2020.
[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog.
     Phys., 75(4):046402, 2012.
                                                                                                                             40

[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol.,
     35(3):346–396, 2017.
[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via
     microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism
     (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages
     368–379, 2012.
[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital
     converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22.
[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised
     August 2015), 2015. Accessed 2026-02-22.
[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet
     (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22.
[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design
     Automation Conference (DAC), pages 684–689, 2001.
[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and
     Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23
     Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.