Photonic Exponential Approximation via Cascaded TFLN Microring Resonators toward Softmax Hyoseok Park1 and Yeonsang Park1, ∗ 1 Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea (Dated: March 26, 2026) The rapid growth of large-scale AI models has intensified energy consumption and data-movement challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear matrix multiplications of transformer inference at high throughput and low energy. However, the softmax attention layer—which requires element-wise exponentiation followed by normalization—still relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much of the potential photonic advantage. arXiv:2603.12934v3 [physics.optics] 25 Mar 2026 We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a multiplicative transfer function whose logarithm is approximately linear. We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN) add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade and confirms agreement with theory primarily over the upper operating range; deeper cascades and higher quality factors are assessed analytically. The cascade implements the per-channel exponential block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation, summation, and normalization—on a single photonic chip without per-channel normalization circuitry. I. INTRODUCTION is approximately linear over a finite interval, enabling exponential-function synthesis with sub-2% worst-case Transformer inference is often limited by power and error—an order of magnitude more accurate than SOFT- memory traffic, motivating optical accelerators that ex- ONIC’s polynomial approach—while remaining compati- ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9]. ble with integrated microring platforms [20–24]. We term Recent perspective articles also discuss data-center power this cascade block an approximate exponential function consumption as one motivation for optical comput- (AEF) unit. We further propose a WDM-parallel archi- ing [3, 8]. While linear operators are comparatively tecture with a single PI feedback loop that realizes the amenable to photonic implementation [4–6], the softmax complete softmax function—including summation and function used in attention layers requires an exponen- normalization—without per-channel electronic process- tial mapping together with global normalization—both ing. difficult to realize in passive photonic circuits, where We extend the theoretical framework with three- transmission is fundamentally bounded by unity. Parallel dimensional FDTD simulations of a single X-cut TFLN digital-hardware studies treat the exponential/softmax add-drop micro-ring resonator. The simulated device stage as a bottleneck and propose dedicated approxima- parameters—quality factor, free spectral range, and tions [11–19]. Many integrated-photonic classifier demon- electro-optic sensitivity—calibrate the cascade design pa- strations still rely on electronic post-processing for the rameters, bridging analytical fitting and physically realiz- final nonlinear readout [10]; the resulting electro-optic able hardware. Two operating regimes emerge from this conversion overhead can negate the throughput and en- calibration: an FDTD-characterized regime with moder- ergy benefits of the photonic front-end. Notably, the ate drop-port depth (Dmax ≈ 0.36), where the analytic SOFTONIC architecture [11] explicitly argues that “the error stays below ∼5% for N ≤ 7 but the power bud- inability of MRRs and MZMs to handle SMA’s expo- get limits practical cascades to N ≤ 5; and a projected nential and division functions” necessitates alternative high-Q regime (Dmax ≥ 0.95), enabling deeper cascades approaches based on microdisk modulators and polyno- (N ≤ 30) with sub-percent error. Cascade performance is mial approximation, achieving 89.7% accuracy with a predicted analytically and validated by a five-ring cascade third-degree Chebyshev polynomial. Here we challenge 3D FDTD simulation (Sec. IV). this premise: we show that a passive Lorentzian cascade The paper is organized as follows: Section II presents of microring resonators can be tuned so that its logarithm the mapping, transfer model, and depth-design rules; Sec- tion III provides numerical fits and validation; Section IV describes the single-ring TFLN device design and FDTD validation; Section V assesses physical feasibility including ∗ yeonsang.park@cnu.ac.kr; Corresponding author voltage requirements, insertion loss, and energy efficiency; 2 Section VI discusses implementation scope, platform com- parisons, and limits; and Section VII concludes. 1 Tk (∆ωk ) =  . (9) ∆ωk 2 1+ Γ II. MODEL AND DESIGN FRAMEWORK In a control–probe architecture, a nonnegative control- signal amplitude I ≥ 0 shifts the ring resonance. Here I Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an denotes a generic control amplitude: for optical-pump op- arbitrary real-valued sequence (or vector). Directly gener- eration it maps to optical intensity, while for EO operation ating exp(xn ) as a passive optical transmission is impos- it maps to electrical control level (e.g., voltage). Across sible in general because exp(x) grows beyond unity while many physical mechanisms (optical pump via Kerr/XPM, a passive transmission satisfies 0 < T ≤ 1 [25]. However, EO drive via Pockels effect, thermal, carrier tuning), the for softmax, shift can be linearized on a working range [20, 26–30]: exn (0) softmax(x)n = P xj , (1) ω0,k (I) = ω0,k + ηI, (10) je (0) where ω0,k is the cold-cavity resonance and η is the control- a common shift cancels: to-resonance sensitivity. In practice, the control channel can be optical or electrical (optical pump, EO/Pockels exn +c exn drive, thermal, or carrier tuning); a quantitative EO P x +c = P x (∀c ∈ R). (2) feasibility example is given in the Discussion. With je je j j (0) ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be- Thus it suffices to generate comes exn −m , m ≡ max xj , (3) ∆ωk (I) = ∆ω0,k − ηI. (11) j Define dimensionless parameters since the global factor em cancels. To ensure a nonnegative control-signal amplitude, de- fine ∆ω0,k η ak ≡ , b≡− . (12) Γ Γ Then Eq. (9) yields the control-to-probe transfer of a un ≡ xn − m ≤ 0, L ≡ − min un = m − min xn ≥ 0, single ring, n n (4) and map each scalar to a nonnegative control-signal am- 1 plitude Tk (I) = . (13) 1 + (ak + bI)2 Physical meaning: ak is a static detuning in linewidth In ≡ un + L ∈ [0, L]. (5) units (set by heater/carrier tuning/fabrication), and |b| is the normalized sensitivity magnitude (linewidths of Then resonance shift per unit control-signal amplitude); the sign convention is absorbed into the detuning expression. For exn −m = eun = eIn −L . (6) “same-material/same-geometry” rings, b is often common, while ak can be tuned per ring. Hence the optical design task is to realize, for I ∈ [0, L], Sign convention. Simultaneously flipping (ak , b) 7→ (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0 without loss of generality. f (I) = eI−L ∈ [e−L , 1]. (7) Let N rings be cascaded in a serial add-drop topology: Tk (I) denotes the add-to-drop transmission of ring k, and Control–probe transfer. Consider a weak probe at the drop output of ring k feeds the add (input bus) port fixed angular frequency ωL . For the kth ring, let ω0,k of ring k+1. Assuming the probe is sufficiently weak so denote its resonance frequency and Γ > 0 its loaded half- the control channel dominates the resonance shift, the width at half maximum (HWHM). Define the detuning normalized probe output is the product ∆ωk ≡ ωL − ω0,k . (8) (probe) Pout (I) N Y N Y 1 y(I) ≡ = Tk (I) = . Near resonance, the normalized Lorentzian transmission (probe) Pin 1 + (ak + bI)2 k=1 k=1 is modeled as [20, 21] (14) 3 (a) Electronic Preprocessing Control In Find max: Shift: Bias: {xn } m = max(xn ) un = xn −m In = un +L EO tuning (b) N -MRR Cascade N stages Probe (fixed ωL ) MRR MRR MRR MRR MRR #1 #2 #3 #4 #5 (c) Output ỹ(In ) ≈ exp(In − L) → exp(xn − m) PD FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax. To focus on the shape of the approximation, we allow a global scale factor C > 0: E∞ ≡ sup ln ỹ(I) − (I − L) . (18) I∈[0,L] ỹ(I) ≡ C y(I). (15) If E∞ ≤ εlog , then for all I ∈ [0, L], In softmax, pn = CeIn −L / j CeIj −L , so C cancels P between numerator and denominator and is physically ỹ(I) ỹ(I) e−εlog ≤ ≤ eεlog ⇒ − 1 ≤ eεlog − 1. (19) inessential; nevertheless it is convenient for error analysis. f (I) f (I) For a fixed (N, b, {ak }), the optimal C for the minimax Thus achieving a prescribed worst-case relative error ε is log-error in Eq. (18) can be written in closed form. Let guaranteed by g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax- optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield- ing E∞ = (maxI g(I) − minI g(I))/2. E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε. (20) Taking logarithms, Depth scaling. We derive depth-related constraints and design rules for a prescribed approximation tolerance. N X Necessary slope condition. Differentiate Eq. (16): ln 1 + (ak + bI)2 .  ln ỹ(I) = ln C − (16) k=1 N d X 2b(ak + bI) The target ln f (I) = I − L is linear; hence exponential ln y(I) = − . (21) dI 1 + (ak + bI)2 approximation is equivalent to the log-linearization goal k=1 Since |2u/(1 + u2 )| ≤ 1 for all real u, ln ỹ(I) ≈ I − L uniformly on I ∈ [0, L]. (17) d ln y(I) ≤ N |b|. (22) Error metric. Define the worst-case log-error on [0, L]: dI 4 The target ln f (I) = I − L has constant slope +1, so a with a minimax refinement. After choosing N , set necessary condition to track it is b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa- tion, then refine (a, b) by a two-parameter minimax fit on [0, L]. N |b| ≳ 1. (23) A heuristic conservative screening bound N ≥ ⌈(L2 /4 + Near-optimal parameterization. The full design prob- 1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion lem can be written as a minimax fit in the log domain [31]: argument; see Supplementary Sec. S1) provides a quick upper estimate but is not a rigorous guarantee. min sup |r(I)|, a1 ,...,aN , ln C I∈[0,L] III. NUMERICAL FITS AND VALIDATION N X (24) ln 1 + (ak + bI)2 − (I − L).  r(I) ≡ ln C − We validate the analytical framework with minimax k=1 numerical fits and sampled robustness checks. Figure 2 This objective is permutation-invariant in the ak ’s (ring shows the fitted approximation quality at L = 8: the index k). In practice (and in numerical experiments top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the reported below), the optimizer frequently collapses to a middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8], permutation-symmetric solution and the bottom panel shows the pointwise relative error with the characteristic Chebyshev equioscillation pattern. We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L] a1 = · · · = aN ≡ a, (25) and compare several depths using a minimax criterion. Table I makes the accuracy–depth trade-off explicit reducing the design to two parameters (a, b) (plus C). at L = 8. A worked input-to-output example demon- With Eq. (25), strating the mapping from an arbitrary input sequence x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided  1 N in Supplementary Sec. S2. The example shows that the ỹ(I) = C y(I) = C . (26) N = 10 cascade keeps the worst-case relative error below 1 + (a + bI)2 2.7% across all channels. A robust initialization is obtained by placing the midpoint Empirical calibration. We calibrate the effective of the interval on the Lorentzian half-maximum flank and logit range Leff from autoregressive Transformers (dis- matching the slope: tilgpt2/gpt2) [1, 32–35] at context length 128, finding Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen- tary Sec. S2). A clipping threshold t∗ = −12 preserves L p99 softmax accuracy below 0.1%. Full protocol details, a+b ≈ −1, N b ≈ 1. (27) 2 clipping-sweep tables/plots, and per-run statistics are These two equations already yield a good design; a small provided in Supplementary Sec. S3. (two-parameter) refinement can then enforce the desired A synthetic design-space map (Supplementary Table S3) worst-case tolerance. shows that near L ≈ 8, moderate depth (N ≥ 10) reaches Local expansion and depth scaling. A Taylor few-percent error, whereas L ≳ 12 requires deeper cas- expansion of the log-domain residual around the flank- cades. All fits follow the same pipeline: minimize the centered point I0 = L/2 (with a + bI0 = −1 and N b = 1) worst-case log-error on a uniform grid, initialize from the shows that the quadratic term vanishes identically, leaving flank rules in Eq. (27), perform multi-start global search, a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L], and apply bounded local refinement; implementation de- this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed tails and scripts are provided in a public repository [36] √ (commit: 585e695). tolerance εlog requires N ∝ L3/2 / εlog , which explains the scaling used in Eq. (28). The full derivation is provided in Supplementary Sec. S0; an intuitive local-expansion summary appears in Sec. S1. Practical engineering estimate. Given L and a TABLE I: Depth comparison for L = 8 using fitted target worst-case relative error ε, define εlog = ln(1 + ε). ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all A heuristic engineering estimate (not a rigorous bound) N ). that matched our percent-level numerical designs is N a b max rel. err. mean rel. err. L3/2    1 N ≈ max , κ√ , (28) 5 −2.0789 0.21658 10.9% 6.43% bmax εlog 10 −1.4588 0.10202 2.68% 1.65% 20 −1.2135 0.05025 0.67% 0.42% where bmax is the physically achievable sensitivity bound 30 −1.1392 0.03341 0.30% 0.19% and κ ≃ 0.07 for the identical-detuning flank design 5 TABLE II: Waveguide and ring parameters of the X-cut TFLN micro-ring resonator. Electro-optic electrode parameters are listed separately in Table III. Parameter Symbol Value Unit Total TFLN thickness tTFLN 600 nm Etch depth tetch 500 nm Slab thickness tslab 100 nm Waveguide width w 1.4 µm Bend radius R 20 µm Coupling gap g 100 nm Circumference Lring 125.7 µm Free spectral range FSR 8.29 nm Effective index (TE0 ) neff 1.903 — Group index (TE0 ) ng 2.24 — Extraordinary index ne 2.138 — IV. TFLN SINGLE-RING DEVICE DESIGN AND FDTD VALIDATION A. Waveguide and ring geometry The device is based on an X-cut thin-film lithium nio- bate (LiNbO3 ) on insulator wafer with a 600 nm-thick LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines a 1.4 µm-wide single-mode waveguide with a 100 nm un- etched slab (Fig. 3). Lumerical MODE simulations yield neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda- mental TE0 mode. The ring resonator (R = 20 µm, Lring = 125.7 µm) is configured as an add-drop resonator with 100 nm coupling gaps (Fig. 4). The FDTD-measured free spectral range is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE value due to bend-induced dispersion. FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale: shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The target eI−L (black) is progressively better matched as N increases. (b) Log scale: depth comparison (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into I ∈ [6, 8] showing convergence. (c) Pointwise relative error showing the Chebyshev equioscillation pattern characteristic of minimax optimality. FIG. 3: Cross-section of the X-cut TFLN rib waveguide on a SiO2 substrate. The 600 nm LiNbO3 film is etched 500 nm to form a 1.4 µm-wide single-mode rib waveguide. Lateral signal (S) and ground (G) electrode positions are indicated; electrode design details are discussed in Sec. IV D. 6 Table II summarizes the waveguide and ring parame- ters. B. 3D FDTD Methodology The ring resonator response is simulated using Lumeri- cal 3D FDTD with conformal variant 1 meshing. A broad- band TE0 mode source (1530 nm to 1570 nm) is injected into the input bus waveguide, and through- and drop-port spectra are recorded. A “z-refined 3-fix” meshing strat- egy ensures convergence in the thin-film geometry [37]; detailed simulation setup is provided in Supplementary Sec. S4 (Table S6). FIG. 5: Simulated through-port (blue) and drop-port (red) transmission spectra of the single add-drop micro-ring resonator from 3D FDTD. Top: logarithmic scale; bottom: linear scale. Five resonances are visible with FSR ≈ 8.29 nm. 15,500, Dmax = 0.360); using the five-resonance mean would increase required voltages by ∼24% (see Table IV caption). The simulation time of 50 ps exceeds the loaded pho- ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex- tracted Qi may be slightly conservative. An independent eigenmode (FDE) analysis of the same cross-section at R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5× FIG. 4: Top view of the single add-drop micro-ring finer than the FDTD vertical grid)—yields Qrad+leak = resonator used in the 3D FDTD simulation. The ring 2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89) waveguide (R = 20 µm, w = 1.4 µm) is evanescently gives a theoretical Qi > 107 [37–42], confirming that coupled to input and drop bus waveguides through the gap between the numerical Qi and published val- 100 nm gaps at coupling points CP1 and CP2. ues (> 106 ) originates from mesh discretization (Sup- plementary S4.5, Table S8). In the CMT framework, Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present coupling gap, increasing Qi to 106 would raise Dmax from 0.36 to ∼0.95 and QL from 15,500 to ∼25,200. C. Single-Ring Add-Drop Results Figure 6(a) shows a Lorentzian fit to the best drop- Figure 5 shows the through- and drop-port spectra from port resonance at λ = 1566 nm, validating the cascade 3D FDTD. Five resonances are resolved across 1530 nm model (Eq. 9). Figure 6(b) demonstrates that cascading to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30). N copies of this FDTD-extracted Lorentzian reproduces the target exponential eI−L with increasing fidelity as N Lorentzian fitting of the drop-port peaks yields QL = grows. 10,300–15,500, with the best resonance at λ = 1566 nm reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360, To validate the cascade prediction directly, a five- −4.4 dB). The through-port extinction ratio is 1.6 dB to ring cascade 3D FDTD simulation was performed us- 2.6 dB, and the five-resonance mean is QL = 12,500 ± ing Tidy3D [43]; the full simulation notebook is publicly 1,800 (Dmax = 0.29–0.36). CMT √ analysis of the best available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)] resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈ confirms resonant excitation across all five rings. Map- 38,800, confirming that the 500 nm etch provides sufficient ping the drop-port spectrum onto the control variable I confinement and that the 100 nm gap places the ring yields 11 data points within the AEF operating range in the coupling-limited regime. The cascade analysis [Fig. 6(e, f)], with the FDTD transmission closely tracking below adopts the best-case FDTD calibration (QL = the N = 5 theoretical curve near I ≈ L = 8. 7 FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with 11 spectral FDTD data points. 8 D. X-cut electrode design and EO parameters TABLE III: Electro-optic electrode parameters for the X-cut TFLN micro-ring with lateral S–G arc electrodes. We employ lateral signal–ground (S–G) arc electrodes on the slab surface alongside the ring waveguide (Fig. 7). Parameter Symbol Value Unit In the X-cut orientation, the crystal Z-axis is at 45◦ from Crystal orientation — X-cut — the horizontal in the substrate plane, giving a lateral- EO coefficient r33 30.9 pm V−1 field projection proportional to cos(θ − 45◦ ) at azimuthal EO fill factor fEO 1/π ≈ 0.318 — angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ EO overlap factor ΓEO 0.7 — and 315◦ naturally separate the coupling regions from Electrode gap gel 5 µm Effective electrode distance deff 2.5 µm the electrode regions. Each ring carries a full semicir- cular arc electrode on the side opposite to its coupling points, engaging the large r33 = 30.9 pm V−1 Pockels co- efficient [37, 38]. The effective EO fill factor follows from ized voltage sensitivity is (Supplementary Sec. S4; here integrating | cos(θ − 45◦ )| over the semicircle: dλ/dV = 28.5 pm/V is the straight-section value and 1 fEO accounts for partial electrode coverage of the ring fEO = ≈ 0.318 (29) circumference): π (see Supplementary Sec. S4 for derivation). The electrode 2 Q (dλ/dV ) gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic bV = fEO ≈ 0.182 V−1 (30) overlap integral is ΓEO = 0.7. Table III lists the electrode λ0 parameters. at QL = 15,500. This estimate relies on a first-order electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30% variation in bV would shift the cascade depth by one to two rings at constant εmax (Table IV), leaving the quali- tative design conclusions unchanged. With the cascade framework of Sec. II (Eqs. 14–18), the N -ring drop-port transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates eI−L over I ∈ [0, L], with (a, b) optimized by minimax fitting for each N . Table IV presents the optimization results for the stan- dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB). TABLE IV: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (X-cut arc electrode, FDTD-calibrated best resonance QL = 15,500, ng = 2.30). The mean FDTD quality factor across five FIG. 7: Illustrative two-ring cascade layout showing the resonances is QL = 12,500 ± 1,800; using the mean would lateral S–G arc electrode placement on X-cut TFLN (the increase voltages by ∼24%. cascade design extends to N rings; this two-ring example clarifies the electrode geometry). The crystal Z-axis is N a b E∞ εmax (%) Vbias (V) Vctrl (V) oriented at 45◦ from the horizontal in the substrate 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and 10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 315◦ naturally separate the bus-waveguide coupling 12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 regions from the electrode semicircles: each ring carries a 20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 full semicircular arc electrode on the side opposite to its 30 −1.141 0.03340 0.0030 0.30 6.3 1.5 coupling points. The resulting effective EO fill factor is 32 −1.1301 0.03131 0.0026 0.26 6.2 1.4 fEO = 1/π ≈ 0.318. a The complete cascade optimization results for all N values are listed in Supplementary Table S7. E. FDTD-Calibrated bV and Cascade Optimization The approximation quality across different cascade depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g., From the device parameters in Tables II and III and ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete the FDTD-calibrated ng ≈ 2.30, the effective normal- optimization results are listed in Supplementary Sec. S4. 9 V. PHYSICAL FEASIBILITY TABLE V: Two-regime power budget for the MRR cascade. Pout assumes per-channel input Having established the cascade approximation theory Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW (Sec. II) and the FDTD-calibrated device parameters laser split across M = 10 parallel channels via a 1×M (Sec. IV), we now assess the physical feasibility of the splitter, or equivalently multiplexed as d WDM channels proposed architecture in terms of voltage requirements, sharing a single cascade) and accounts only for the ideal N insertion loss, and energy efficiency. on-resonance cascade transmission Dmax (upper bound); additional inter-ring coupling loss (ηcoupling ≈ 0.9 per stage, ∼0.46 dB/stage) and off-resonance propagation A. Electro-optic voltage requirements loss (0.08–0.25 dB/stage) are analyzed separately in Sec. V C. For the primary target of ε < 2% (N = 12), minimax N optimization gives a = −1.373, b = 0.0845. With the Dmax N Dmax (dB) Pout εmax FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the 0.36 3 0.0467 −13.3 4.67 µW ∼15% I required voltages are (FDTD) 0.36 5 0.00605 −22.2 0.61 µW 10.9% 0.36 7 7.84 × 10−4 −31.1 78 nW ∼5% |a| 1.373 0.95 10 0.599 −2.2 59.9 µW 2.68% Vbias = = = 7.5 V, (31) II (high-Q) 0.95 20 0.358 −4.5 35.8 µW 0.67% bV 0.182 0.95 30 0.215 −6.7 21.5 µW ∼0.30% bL 0.0845 × 8 Vctrl,max = = = 3.7 V. (32) Regime I: FDTD-characterized (Qi = 38,800). Regime II: bV 0.182 fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch . Since bV ∝ Q, voltage scales inversely with quality factor: bL bL λ0 independent evidence that intrinsic quality factors in Vctrl = = . (33) the projected range are physically achievable in TFLN— bV 2Q |dλ0 /dV | albeit with wider waveguides and larger ring radii than the CMOS-compatible control voltages (Vctrl < 3.3 V) are present design. Transferring comparable sidewall quality achievable at N ≥ 14 with QL = 15,500; at the design to our geometry (R = 20 µm, W = 1.4 µm) is an open point N = 30 (εmax = 0.30%), Vctrl = 1.47 V. fabrication challenge; the projections should be read as design targets contingent on achieving it. The total insertion loss comprises on-resonance N B. Power budget: two-regime analysis cascade transmission Dmax , inter-ring coupling loss (∼0.46 dB/stage for the present diagonal-bus layout), The on-resonance cascade transmission DmaxN is the off-resonance propagation loss (0.08–0.25 dB/stage), and dominant contribution to total insertion loss. Table V fiber-to-chip coupling (1.5–3.0 dB). For the fabricated presents two regimes: the FDTD-characterized regime high-Q regime (N = 30), the total ranges from ∼13 dB (Dmax = 0.36) and the fabricated high-Q regime (Dmax = (optimized layout) to ∼24 dB (current geometry); see 0.95, achievable with Qi > 106 and gap-optimized cou- Supplementary Sec. S6 for detailed scenarios. pling). In the FDTD-characterized regime, Dmax = 0.36 limits practical cascades to N ≤ 5: at N = 5 the output is D. Energy comparison 0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof- of-concept validation. In the fabricated high-Q regime For N = 30 X-cut TFLN micro-ring resonators in the (Dmax ≥ 0.95), deep cascades become practical: N = 30 fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup- yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%. plementary Sec. S5), the three energy components are EO The transition to fabricated high-Q devices is therefore tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ, critical for achieving both high accuracy and sufficient shared across M = 10 channels), and photodetector output power. (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva- tions in Supplementary Sec. S7). Including thermal stabi- lization for N = 30 rings (0.15–0.60 pJ; Supplementary C. Feasibility outlook Sec. S7), the total rises to 0.94–1.39 pJ. Table S12 compares the photonic cascade with digital Published TFLN micro-ring resonators achieve Qi ≥ implementations. Including thermal stabilization (0.94– 106 –108 using optimized fabrication [39–42]. At Qi = 1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, 106 with the present coupling geometry, CMT predicts while operating at 10 GHz bandwidth and 58× lower than Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5, digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO Tables S4–S7), enabling deep cascades (N ≤ 30) with drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; sub-percent error. The literature values provide strong Supplementary Table S11), recovering a 3.2× advantage 10 TABLE VI: Energy per exponential operation: with a distinct FSR order of the same ring set, traverse a single-channel comparison. single N -ring cascade simultaneously (Fig. 8). Because each channel λj sees its own Lorentzian detuning set by Implementation E/op (pJ) Bandwidth Notes an independent control QN voltage Vj , the cascade output Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACsd exponentials are computed in parallel on the same phys- Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† ical waveguide. Compared with a 1×M power-splitter † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. architecture that replicates the cascade for each channel, Self-consistent with fabricated high-Q regime (QL = 25,200); see the WDM approach reduces the total ring count from Supplementary Sec. S7. N × d to N (a factor-d saving) and eliminates the splitter insertion loss (10 log10 d dB). At the output, a WDM demultiplexer or wavelength-selective photodetector array over INT8. Since EEO ∝ 1/Q2 , improving Q beyond separates the channels for electrical readout. Figure 8 ∼30,000 yields diminishing energy returns but continues shows a representative chip layout for N = 5 cascade to relax CMOS driver voltage requirements. stages and d = 8 WDM channels, where alternating U- turn bus connections route the drop-port output of each stage into the input bus of the next. VI. DISCUSSION Why cascade helps. A single Lorentzian in I is too rigid to mimic the log-linear target over a wide interval. Practical design procedure. For a given input se- Cascading turns the transfer into a product; taking a quence x = (x1 , . . . , xK ), the design proceeds as follows: logarithm gives a sum of smooth terms, and the approx- imation improves as N increases. The slope constraint 1. Compute m = maxn xn , un = xn − m, and L = N |b| ≳ 1 is an immediate feasibility check. − minn un . Global softmax normalization via WDM feed- 2. Map to nonnegative control-signal amplitudes: In = back. The WDM-parallel architecture (Fig. 8) integrates un + L ∈ [0, L]. naturally with a closed-loop normalization scheme to com- plete the full softmax function. After the N -stage cascade, 3. Choose tolerance ε and set εlog = ln(1 + ε). a WDM demultiplexer (e.g., arrayed-waveguide grating or ring-filter bank) routes each channel λj to a dedicated pho- 4. Select a physically feasible bmax and estimate N todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj . using Eq. (28). The d photocurrents are summed electrically: 5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2, d d then refine (a, b) by a two-parameter minimax fit if X X S= Iλj ∝ C Pin eVj . (35) required. j=1 j=1 6. The optical block yields ỹ(In ) ≈ exn −m , and soft- A proportional–integral (PI) controller compares S with max weights follow as a fixed reference Sref and adjusts the shared WDM laser power Pin so that S → Sref [44, 45]. Because all d channels share the same probe source, scaling Pin multiplies every ỹ(In ) pn = P . (34) ỹj by the same factor; upon convergence j ỹ(Ij ) Iλj eVj pj = = Pd = softmax(V )j , (36) Scope and limits. The approximation is for a fi- Sref Vk k=1 e nite interval I ∈ [0, L], where L is determined by the input batch via Eq. (4). In practice, one designs for a realizing the complete softmax with a single feedback loop worst-case L expected in operation (or retunes a and and no per-channel normalization circuitry. Compared rescales the control signal to adapt L). Noise, insertion with the replicated-cascade approach (one AEF block per loss, and control-induced parasitics limit accuracy and channel), WDM feedback offers two additional benefits: dynamic range; we treat these effects as platform-specific (i) the splitter-induced power imbalance that would bias margins. Detailed non-ideality assumptions, parameter the Iλj ratios is absent, since all channels traverse the distributions, and robustness statistics are reported in same optical path; and (ii) a single laser control point Supplementary Sec. S8. With K channels in parallel, replaces d independent probe adjustments. Design de- one can form softmax by summing channel powers and tails and stability analysis of the PI loop are provided in applying a shared reciprocal scale factor, depending on Supplementary Sec. S9. the chosen mixed-signal normalization scheme. Beyond ring-resonator AEF implementations, the same WDM parallelism. A particularly hardware-efficient cascade principle can be extended to other cavity-based realization exploits wavelength-division multiplexing photonic platforms, such as serial 1D photonic-crystal cav- (WDM): d probe wavelengths λ1 , . . . , λd , each resonant ities and other cascaded resonant architectures [21, 46]. 11 What these platforms share is transfer-function shaping TABLE VII: Summary of evidence levels for key claims. through cascaded resonances; loss, tuning range, fabrica- tion tolerance, and calibration overhead remain platform- Claim Evidence Sec. dependent. Cascade → exp. approx. Analytic II The insertion loss budget (Sec. V C) and electro-optic Depth scaling Analytic + num. II, III voltage requirements (Sec. V A) suggest that the cas- QL , Dmax , bV 3-D FDTD IV cade architecture is feasible under optimized coupling 5-ring line shape 3-D FDTD IV and layout conditions. Using monolithic TFLN microring N ≤ 30 deep cascade CMT proj.∗ V Energy < 1 pJ Estimate V data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈ Full softmax (WDM + feedback) Conceptual + layout VI 9–20 pm/V), the normalized sensitivity bV ≃ 0.063– ∗ Based on published Q 0.14 V−1 , within the range required by the cascade design. i ≥ 10 6 values [39, 42] and CMT coupling model. Crystal orientation and electrode design. The X- cut TFLN platform was chosen for several reasons. First, X-cut is the prevailing industry standard for integrated tified in the Monte Carlo robustness analysis (Supple- TFLN modulators, with well-established fabrication pro- mentary Sec. S8). Monte Carlo simulations (Supplemen- cesses and commercial wafer availability [37, 38]. Second, tary Sec. S8) show that under nominal non-ideality levels the TE0 mode—which is strongly confined in the rib (σa = 0.020, σb,rel = 0.020), a single-point calibration of waveguide geometry—can engage the large r33 coefficient C per chip keeps the median softmax KL divergence below via lateral electric fields aligned with the crystal Z-axis. 2.2 × 10−4 , with 95th-percentile max probability error In contrast, Z-cut geometry with TE polarization can only under 0.32%. Even under stress conditions (σa = 0.032), access the smaller r13 coefficient (∼ 10 pm/V), resulting 95th-percentile errors remain below 0.42%, demonstrat- in significantly lower electro-optic efficiency. The arc elec- ing that the identical-detuning design is robust to realis- trode design (Sec. IV D) addresses the phase-cancellation tic fabrication variations provided a per-chip calibration problem inherent to X-cut circular rings [47] by orienting step is performed. Conversely, if coupling gaps are in- the crystal Z-axis at 45◦ from the horizontal in the sub- tentionally varied across rings, the per-ring parameters strate plane. This rotation places the cos(θ − 45◦ ) = 0 (ak , bk ) become independent degrees of freedom. A Taylor- boundaries at θ = 135◦ and 315◦ , naturally separating the expansion analysis shows that K non-identical rings can bus-waveguide coupling regions from the electrode regions. cancel curvature P terms up to order 2K in the Taylor series Each ring carries a full semicircular arc electrode on the of g(I) = k ln Tk , one order higher than K identical side opposite to its coupling points, yielding an effective rings, so that fewer rings suffice for a given error target. fill factor fEO = 1/π ≈ 0.318. While this reduces the round-trip EO efficiency compared to a hypothetical full- circumference design, it preserves the compact footprint of a circular ring resonator. The cascade performance can be further improved beyond the R = 20 µm circular- ring design presented here. Increasing the ring radius reduces bending loss and raises the intrinsic quality factor Qi , which directly increases bV (∝ Q) and lowers the required control voltage. Alternatively, adopting a race- track geometry with extended straight coupling sections strengthens the bus–ring coupling, pushing the drop-port maximum Dmax closer to critical coupling and improving the per-stage transfer efficiency. Either approach—or their combination—would yield higher bV and Dmax , enabling lower N or tighter approximation accuracy at reduced operating voltages. Fabrication considerations. The X-cut TFLN rib waveguide (600 nm total thickness, 500 nm etch, w = 1.4 µm) follows established fabrication processes for com- mercial TFLN wafers on SiO2 [37, 38]. The lateral signal– ground (SG) electrode configuration is fabricated in a single metal layer, which is standard in TFLN foundry processes. The primary fabrication challenge for the cascade architecture is maintaining uniform coupling gaps (g = 100 nm) across N rings to ensure identi- cal Lorentzian transfer functions. Post-fabrication trim- ming via UV exposure or localized thermal oxidation can compensate residual detuning variations [30], as quan- 12 Softmax Full Chip Layout – N = 5 × d = 8 (TFLN) d = 8 WDM channels Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8 WDM λ1−λ8 n=1 Pin n=2 N = 5 cascade n=3 stages n=4 n=5 WDM Demux (AWG / ring filter) Sref PD1 PD2 PD3 PD4 PD5 PD6 PD7 PD8 Iλ j S e Σ − PI p1 p2 p3 p4 p5 p6 p7 p8 Feedback: adjust Pin Iλj Output: pj = = softmax(V )j Sref FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields pj = Iλj /Sref = softmax(V )j (Eq. 36). 13 VII. CONCLUSION Dmax ≥ 0.95) are realized in the cascade geometry, deeper cascades (N ≈ 20–30) would reach sub-percent approx- We have presented a cascaded micro-ring resonator ar- imation error with an estimated per-operation energy chitecture that approximates the exponential function of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8 exn −m on a finite interval [0, L] using multiplicative MAC at the 7 nm node. Monte Carlo analysis shows that Lorentzian transfer functions. Increasing the cascade the identical-detuning design tolerates realistic fabrica- depth N systematically reduces the worst-case relative tion variations (σa = 0.020, σb,rel = 0.020) with a single error, and an identical-detuning design initialized by flank per-chip calibration, keeping the 95th-percentile softmax and slope matching provides a practical two-parameter probability error below 0.32%. design. Three-dimensional FDTD simulations of a single X-cut The formulation is not restricted to electro-optic tuning: TFLN add-drop ring (R = 20 µm, g = 100 nm) yield it requires only a controllable detuning coordinate with lo- QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the cal linearization, so both Pockels and optical (Kerr/XPM) cascade transfer model. A five-ring cascade 3D FDTD mechanisms are compatible [37, 38, 47, 48]. We demon- simulation directly validates the multi-ring framework: strate a photonic exponential block and present a WDM- all five rings exhibit resonant excitation, and mapping parallel chip architecture (Fig. 8) in which d wavelength the drop-port spectrum onto the dimensionless control channels share a single N -ring cascade, reducing the total variable reproduces the theoretical N = 5 curve with ring count by a factor of d and eliminating power-splitter ∼11% integrated relative-area error over the upper op- loss. Combined with a single-loop PI feedback that adjusts erating range (I ≥ 5.8), providing the first multi-ring the shared WDM laser power, the architecture realizes the confirmation of the cascade exponential approximation. complete softmax function—exponentiation, summation, At the present FDTD-characterized quality factor, practi- and normalization—without per-channel normalization cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q circuitry. Max-finding and digital interfacing remain open TFLN resonators reported in the literature (Qi ≥ 106 , for future experimental validation. [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A. Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Franken, Keith Powell, Hana Warner, Daniel Assumpcao, and Illia Polosukhin. Attention is all you need. In Dylan Renaud, Ying Wang, et al. Integrated lithium Advances in Neural Information Processing Systems 30 niobate photonic computing circuit based on efficient and (NeurIPS 2017), pages 5998–6008, 2017. high-speed electro-optic conversion. Nature Communica- [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, tions, 16:8178, 2025. and Christopher Ré. FlashAttention: Fast and memory- [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang. efficient exact attention with IO-awareness. In Advances SOFTONIC: A photonic design approach to softmax in Neural Information Processing Systems 35 (NeurIPS activation for high-speed fully analog AI acceleration. 2022), pages 16344–16359, 2022. In Proceedings of the Great Lakes Symposium on VLSI [3] Neil Savage. Light could lower AI’s appetite for power. (GLSVLSI ’25), pages 118–125, 2025. Nature Nanotechnology, 21:6–8, 2026. [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto- [4] Yichen Shen et al. Deep learning with coherent nanopho- electronic nonlinear softmax operator based on diffractive tonic circuits. Nature Photonics, 11(7):441–446, 2017. neural networks. Optics Express, 32(15):26458–26469, [5] Johannes Feldmann et al. Parallel convolutional process- 2024. ing using an integrated photonic tensor core. Nature, [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang, 589(7840):52–58, 2021. Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang, [6] Nicholas C. Harris et al. Linear programmable nanopho- Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans- tonic processors. Optica, 5(12):1623–1631, 2018. former chip: interference is all you need. PhotoniX, 6:45, [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre 2025. Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice, Brucek Khailany, and Anand Raghunathan. Softermax: and H. Bhaskaran. Higher-dimensional processing using Hardware/software co-design of an efficient softmax for a photonic tensor core with continuous-time data. Nature transformers. In Proceedings of the 58th ACM/IEEE Photonics, 17(12):1080–1088, 2023. Design Automation Conference (DAC), pages 469–474, [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, 2021. John E. Bowers, Michael Hochberg, Richard Soref, and [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong Bhavin J. Shastri. Roadmapping the next generation of Chang. Hardware-efficient softmax approximation for silicon photonics. Nature Communications, 15:751, 2024. self-attention networks. In Proceedings of the IEEE Inter- [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor national Symposium on Circuits and Systems (ISCAS), cores for machine learning. Applied Physics Reviews, pages 1–5, 2023. 7(3):031404, 2020. [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, [10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo, and Yongpan Liu. SOLE: Hardware-software co-design 14 of softmax and layernorm for efficient transformer infer- 2025. accessed 2026-02-21. ence. In Proceedings of the IEEE/ACM International [35] Jane Austen. Pride and prejudice. Project Gutenberg Conference on Computer-Aided Design (ICCAD), pages eBook No. 1342, 2025. accessed 2026-02-21. 1–9, 2023. [36] Hyoseok Park. MRR-AEF: reproducible MRR depth- [17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan, sweep fitting and supplementary validation scripts. Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft- GitHub repository, 2025. commit 585e695, accessed 2026- max function: Suitability for training and efficient hard- 02-21. ware implementation. IEEE Transactions on Circuits and [37] Di Zhu et al. Integrated photonics on thin-film lithium Systems I: Regular Papers, 69(9):3605–3618, 2022. niobate. Advances in Optics and Photonics, 13(2):242–352, [18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing 2021. Pan. TEA-S: A tiny and efficient architecture for PLAC- [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang based softmax in transformers. IEEE Transactions on Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, Circuits and Systems II: Express Briefs, 70:3594–3598, CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, 2023. Amirhassan Shams-Ansari, David Barton, Neil Sinclair, [19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and and Marko Loncar. Integrated electro-optics on thin-film Fabrizio Lombardi. Approximate softmax functions for lithium niobate. Nature Reviews Physics, 2025. energy-efficient deep neural networks. IEEE Transactions [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan on Very Large Scale Integration (VLSI) Systems, 31:4–16, Shams-Ansari, and Marko Lončar. Monolithic ultra-high- 2023. Q lithium niobate microring resonator. Optica, 4(12):1536– [20] Wim Bogaerts et al. Silicon microring resonators. Laser 1537, 2017. & Photonics Reviews, 6(1):47–73, 2012. [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q [21] John E. Heebner, Robert W. Boyd, and Q.-Han thin-film lithium niobate microrings fabricated with wet Park. Scissor solitons and other propagation effects in etching. Adv. Mater., 35(3):2208113, 2023. microresonator-modified waveguides. Journal of the Opti- [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. cal Society of America B, 19(4):722–731, 2002. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag- [22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and alhães, Amirhassan Shams-Ansari, Neil Sinclair, and Shanhui Fan. Microring-based programmable coherent Marko Lončar. Twenty-nine million intrinsic Q-factor optical neural networks. Optics Express, 31(12):18871, monolithic microresonators on thin-film lithium niobate. 2023. Photon. Res., 12(8):A63–A68, 2024. [23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo. [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian StarLight: a photonic neural network accelerator featur- Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. ing a hybrid mode-wavelength division multiplexing and Lithium niobate microring with ultra-high Q factor above photonic nonvolatile memory. Optics Express, 30:37051, 108 . Chin. Opt. Lett., 20(1):011902, 2022. 2022. [43] Flexcompute Inc. Tidy3D: electromagnetic simula- [24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang, tion software. https://www.flexcompute.com/tidy3d/, and Weifeng Zhang. Reconfigurable low-threshold all- 2024. v2.10; cloud GPU FDTD. Accompany- optical nonlinear activation functions based on an add- ing notebook: https://www.flexcompute.com/tidy3d/ drop silicon microring resonator. IEEE Photonics Journal, community/notebooks/CascadedMRRTFLN/. 14(6):1–7, 2022. [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights. [25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals Silicon photonic dynamic optical channel leveler with of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007. external feedback loop. Optics Express, 18(13):13805– [26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R. 13812, 2010. Panepucci, and Michal Lipson. All-optical control of light [45] Karl J. Åström and Richard M. Murray. Feedback Systems: on a silicon chip. Nature, 431(7012):1081–1084, 2004. An Introduction for Scientists and Engineers. Princeton [27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and University Press, Princeton, NJ, 2008. Michal Lipson. Micrometre-scale silicon electro-optic mod- [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel ulator. Nature, 435(7040):325–327, 2005. Scherer. Coupled-resonator optical waveguide: a proposal [28] Kishore Padmaraju and Keren Bergman. Resolving the and analysis. Optics Letters, 24(11):711–713, 1999. thermal challenges for silicon microring resonator devices. [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Nanophotonics, 3:269–281, 2014. Lynford L. Goddard, and Songbin Gong. Ultra-efficient [29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X. and fully isotropic monolithic microring modulators in Wang. Transparent conductive oxide-gated silicon mi- a thin-film lithium niobate photonics platform. Optics croring with extreme resonance wavelength tunability. Express, 28(20):29644–29661, 2020. Photonics Research, 7(4):473, 2019. [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki, [30] Lahiru Jayatilleka et al. Post-fabrication trimming of Peng Yao, and Dennis W. Prather. Tunable hybrid sil- silicon photonic ring resonators at wafer-scale. Journal icon nitride and thin-film lithium niobate electro-optic of Lightwave Technology, 39:5083–5088, 2021. microresonator. Optics Letters, 44(3):618, 2019. [31] Elliott W. Cheney. Introduction to Approximation Theory. McGraw–Hill, New York, 1966. [32] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. [33] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21. [34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 15 SUPPLEMENTARY INFORMATION Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators toward Softmax.” S0. RIGOROUS DERIVATION AND VALIDITY SCOPE This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma, Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration. S0.1 Assumptions Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude. Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced QN effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I). Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N . Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is negligible. Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for fixed, finite L. Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1 (midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear. S0.2 Rigorous results Throughout, define the log-domain residual r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L),  (S0.1) and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) +  minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout. Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0, d ln y(I) ≤ N |b|. dI Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating:  d 2b(a + bI) ln y(I) = −N . dI 1 + (a + bI)2 Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM). Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|. 16 Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the main text and is a rigorous (not heuristic) necessary condition. Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and δ = I − I0 . Then δ3 ln ỹ(I) = const + δ + + R4 (δ), (S0.2) 6N 2 where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic term vanishes identically at the flank point u0 = a + bI0 = −1. Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ)) and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1: 2u ϕ′ (u) = − , ϕ′ (−1) = 1, 1 + u2 2(u2 − 1) ϕ′′ (u) = , ϕ′′ (−1) = 0, (1 + u2 )2 4u(3 − u2 ) −4(−1)(3 − 1) ϕ′′′ (u) = , ϕ′′′ (−1) = = 1. (1 + u2 )3 (1 + 1)3 By the chain rule, writing F (δ) = N ϕ(u(δ)): F ′ (0) = N b ϕ′ (−1) = N b = 1, F ′′ (0) = N b2 ϕ′′ (−1) = 0, 1 F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 = , N2 where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C is δ2 1 δ3 ln ỹ(I) = const + δ + 0 · + 2· + R4 (δ). 2 N 6 Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is bounded by the standard Taylor remainder estimate. Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the leading-order worst-case log-error on I ∈ [0, L] satisfies  3 (leading) 1 L L3 E∞ ∼ = . (S0.3) 6N 2 2 48 N 2 (leading) Setting E∞ ≤ εlog = ln(1 + ε) and solving for N gives L3/2 N ≥ p . (S0.4) 48 εlog Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for |δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving: L3 L3/2 ≤ εlog =⇒ N≥p . 48 N 2 48 εlog √ With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the √ main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉. Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07). For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and √ higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by numerical evidence (Table I) but should be treated as an engineering design rule. 17 S0.3 Derivation of the conservative screening bound We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main text. Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write   ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) . The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and r(0) = N [ϕ(a) − ϕ(a + bL)] + L. For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e., ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies N ∥ϕ′′ ∥∞ b2 L2 N ϕ′ (a + bL) · b − 1 E∞ ≤ + · L, (S0.5) 8 2 where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|. Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem on [0, L]: Z L 1 − h′ (t) dt.     r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) = I RL Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound |h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5). Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0), and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to N b2 L 2 E∞ ≤ . (S0.6) 4 Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is L2 1 E∞ ≤ + 2 , (S0.7) 4N 2b N which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the conservative screening depth:  2 L /4 + 1/(2b2 )  Nsafe ≥ . (S0.8) ln(1 + ε) Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N ) arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary design-space exploration but not a certified minimax guarantee. S0.4 Validity scope and failure cases The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may break down. 18 (V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings. (V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward- propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade transfer function. (V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN . In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8 quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning. (0) (V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L requiring a control swing exceeding the linearity range of the tuning mechanism. (V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime. There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss budgets. (V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition. S0.5 Mapping to main-text equations For reference, the results derived here correspond to the following main-text equations: • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary condition. • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits. • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition. • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8) in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design. Summary of guarantee status: Result Status Main-text Eq. Slope bound N |b| ≥ 1 Rigorous (proved) (23) √ Scaling N ∼ κL3/2 / εlog Heuristic (Taylor truncation + empirical κ) (28) Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic (S0.7) Nsafe screening depth Conservative backstop (S0.8) S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND This section provides the detailed derivations underlying the depth-scaling relations and conservative screening bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0. 19 S1.1 Local expansion and exponential-like behavior To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives N 2 2 2 N (N 2 − 1) 3 3 ỹ(I) ≈ 1 + N b δ + b δ + b δ + O(δ 4 ), (S1.1) 2 6 so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining why the initialization is already close before refinement. S1.2 Log-domain analysis and scaling derivation For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1), expand around I0 = L/2 with δ = I − I0 to obtain N b3 3 ln ỹ(I) = const + N b δ + δ + O(δ 4 ). (S1.2) 6 At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives δ3 ln ỹ(I) = const + δ + + O(δ 4 ). (S1.3) 6N 2 Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 . Requiring E∞ ≤ εlog leads to L3/2 N∝√ , (S1.4) εlog which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a formal guarantee), and the prefactor remains platform- and fitting-criterion dependent. S1.3 Conservative upper bound and screening depth For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1). For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the worst-case log-error as L2 1 E∞ ≤ + 2 . (S1.5) 4N 2b N (This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening estimate (not a formal guarantee), one may use  2 L /4 + 1/(2b2 )  N ≥ . (S1.6) ln(1 + ε) While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions, is given in Sec. S0.3. 20 S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range calibration tables referenced in the main text (Sec. III). S2.1 Worked input-to-output mapping example As a worked example, consider x = [−3.2, 1.2, 4.8, −0.9]. (S2.1) Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped control-signal levels are I = u + L = [0, 4.4, 8.0, 2.3], (S2.2) and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly, N 1 Y Tk (In ) = , y(In ) = Tk (In ). 1 + (ak + bIn )2 k=1 Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes  N 1 ỹ(In ) = C y(In ) = C . 1 + (a + bIn )2 For the re-fitted parameters used in this example, a = −1.4588, b = 0.10202, (S2.3) N = 10, C = 3.0896 × 101 . which gives  N 1 ỹ(In ) = C , 1 + (a + bIn )2 (S2.4) ≈ [3.44 × 10−4 , 2.73 × 10−2 , 9.74 × 10−1 , 3.26 × 10−3 ]. For reference, the corresponding target terms are In − L = [−8.0, −3.6, 0, −5.7], (S2.5) and  In −L   e ≈ 3.35 × 10−4 , 2.73 × 10−2 , (S2.6) 1.00, 3.35 × 10−3 .  S2.2 Effective-range percentiles and clipping calibration We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context length 128, the effective range Leff,α = max(log pkept ) − min(log pkept ), α = 0.999, (S2.7) fell in a relatively narrow band, summarized in Table S2. 21 TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters re-fitted on I ∈ [0, 8.0] using the same minimax pipeline. xn In target exn −m approx ỹ(In ) rel. err. −4 −4 −3.2 0.0 3.3546 × 10 3.4443 × 10 2.673% 1.2 4.4 2.7324 × 10−2 2.7325 × 10−2 0.004% 4.8 8.0 1.0000 0.9739 2.608% −0.9 2.3 3.3460 × 10−3 3.2585 × 10−3 2.614% TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128. Percentile All runs (4 runs) GPT-2 p50 6.92–7.23 7.09–7.23 p90 8.60–8.75 8.73–8.75 p95 8.97–9.12 9.06–9.12 p99 9.50–9.69 9.58–9.69 We then test clipping on the same rows with Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 , (S2.8) u(t) = max(u, t), u = s − max(s). and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12). In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii) choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3. S2.3 Illustrative synthetic range map √ As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3 summarizes achievable approximation error versus depth. TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error (design-space illustration; not intended as corpus-level statistics). L regime N =5 N = 10 N = 20 N = 30 L=8 10.9% 2.68% 0.67% 0.30% L = 12 40.0% 9.25% 2.27% 1.01% L = 16 113% 23.0% 5.44% 2.41% Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8, moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1% error. We include Table S3 as a synthetic design map rather than an empirical benchmark. 22 S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention row, if p = softmax(s) then the raw range is Lraw = max(s) − min(s) = max(log p) − min(log p), (37) where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p), we additionally report an effective range: Leff,α = max(log pkept ) − min(log pkept ), (38) where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999. To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128), and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs. TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs effective (α = 0.999). Model Dataset raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99 distilgpt2 tiny shakespeare 22.82 69.00 7.10 8.60 8.97 9.50 distilgpt2 pride prejudice 21.76 68.60 6.92 8.60 9.03 9.57 gpt2 tiny shakespeare 25.48 43.34 7.23 8.73 9.06 9.58 gpt2 pride prejudice 24.13 40.92 7.09 8.75 9.12 9.69 For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50 = 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69. Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define u = s − max(s) and, for threshold t ≤ 0, u(t) = max(u, t), p(t) = softmax(u(t) ). (39) We report the cumulative softmax error 1 (t) p −p , Ecum (t) = (40) 2 1 then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows. TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t. t p50 p90 p95 p99 −5 −5 −5 −14 2.53 × 10 4.55 × 10 4.80 × 10 5.18 × 10−5 −5 −5 −5 −13 2.69 × 10 4.85 × 10 7.38 × 10 1.48 × 10−4 −5 −4 −4 −12 2.99 × 10 1.21 × 10 2.13 × 10 4.27 × 10−4 −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3 −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3 −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2 −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2 −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2 −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1 Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably, this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while keeping aggregate softmax distortion below 0.1%. 23 FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs. FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the 0.1% budget (10−3 ). 24 S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V). S4.1 z-refined 3-fix simulation strategy For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix” strategy: 1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation; 2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts; 3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied. The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm slab. This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation parameters. TABLE S6: 3D FDTD simulation parameters (Lumerical). Parameter Value Solver Lumerical 3D FDTD Mesh type Conformal variant 1 Mesh accuracy 3 (auto-mesh) z-mesh override None (auto-mesh) Simulation time 50 ps Auto shutoff 1 × 10−6 Wavelength range 1530 nm to 1570 nm Grid size 532 × 816 × 44 Source Broadband mode source (TE0 ) S4.2 X-cut arc electrode bV step-by-step derivation For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦ from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective EO fill factor is Z Z +π/2 1 1 1  +π/2 1 fEO = | cos(θ − 45◦ )| dθ = cos φ dφ = sin φ −π/2 = ≈ 0.318. (S4.1) 2π semicircle 2π −π/2 2π π The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor integral is identical to the standard cos θ case by the change of variable. The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm. The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration. The refractive index change per volt in the electrode-covered section is ∆neff 1 ΓEO 1 0.7 = − n3e r33 = − × 2.1383 × 30.9 × 10−12 × = −4.226 × 10−5 V−1 . (S4.2) V 2 deff 2 2.5 × 10−6 25 The corresponding resonance wavelength shift is dλ0 1550 × 4.226 × 10−5 = = 28.48 pm V−1 , (S4.3) dV straight 2.30 giving an intrinsic (straight-section) voltage sensitivity of 2QL dλ0 2 × 15,500 bstraight V = = × 0.02848 = 0.570 V−1 . (S4.4) λ0 dV straight 1550 However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The effective voltage sensitivity is therefore 1 bV = bstraight V × fEO = 0.570 × ≈ 0.182 V−1 . (S4.5) π A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318), the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations) and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length. S4.3 Full cascade optimization table Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30. TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated best resonance QL = 15,500). N a b E∞ εmax (%) Vbias (V) Vctrl (V) 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 8 −1.5959 0.12896 0.0412 4.20 8.8 5.7 10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 15 −1.2914 0.06726 0.0118 1.19 7.1 3.0 17 −1.2543 0.05923 0.0092 0.92 6.9 2.6 20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 30 −1.1392 0.03341 0.0030 0.30 6.3 1.5 Key thresholds for the minimum number of rings at various error targets are: • ε < 10%: N ≥ 6, • ε < 5%: N ≥ 8, • ε < 2%: N ≥ 12, • ε < 1%: N ≥ 17, • ε < 0.5%: N ≥ 24. These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning range, through bV . S4.4 Lorentzian fit validation Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer function model used in the cascade analysis. 26 FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The extracted quality factor is QL = 15,500 with FWHM = 101 pm. S4.5 Eigenmode (FDE) analysis of theoretical Qi To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions. a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch, waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300 cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially finer than the 3D FDTD auto-mesh (55 nm vertical). b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm: neff = 1.9653 + i (4.73 × 10−8 ), (41) 4π ni = 0.383 m−1 0.017 dB cm−1 .  αrad+leak = (42) λ Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the 100 nm slab. The corresponding quality factor is 2π ng Qrad+leak = = 2.43 × 107 , (43) αrad+leak λ where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the small difference arises from the straight-section approximation inherent to 2D FDE). c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from 1 1 1 = − , Qbend = 1.43 × 108 . (44) Qbend Qrad+leak Qleak This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the dominant geometric loss channel. d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is 2π ng Qabs = , (45) Γ αmat λ 27 where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for representative TFLN absorption values from the literature [6, 7]. TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ), substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded. The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 . Material condition αmat (dB/cm) Qabs Qi (total) Bulk LiNbO3 (pristine) 0.002 2.3 × 108 2.2 × 107 High-quality TFLN 0.01 4.7 × 107 1.6 × 107 Good TFLN 0.03 1.6 × 107 9.5 × 106 Typical TFLN 0.1 4.7 × 106 3.9 × 106 For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts (approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage through the thin slab (Qleak = 2.93 × 107 ). 28 S5. FABRICATED HIGH-Q DESIGN PROJECTIONS Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any realistic intrinsic loss. Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP- based processes. TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates whether the fabrication process preserves electrode patterning capability. Ref. Qi R (µm) w (µm) Etch Zhang [8] 107 80 ∼2 Ar+ mill Gao [9] 108 100 ∼3 CMP∗ Zhuang [10] 9×106 100 ∼2 Wet etch Song [11] 2.9×107 200 4.5 ICP-RIE+CMP All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi . To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality factors (Table S10). N TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on Qi ); at N = 20, L = 8: εmax = 0.67% (Table I). Projection Qi Dmax N =10 N =20 N =30 FDTD baseline 3.88×104 0.36 −44.3 −88.5 −132.8 Conservative 5×105 0.90 −4.4 −8.8 −13.2 Moderate 106 0.95 −2.2 −4.5 −6.7 Optimistic 5×106 0.99 −0.44 −0.88 −1.3 Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck; the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design. Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge; the projections in Table S10 should be read as design targets contingent on achieving it. 29 S6. INSERTION LOSS BUDGET DETAILS For a cascade of N rings, the total insertion loss is modeled as ILtot ≈ N · ILstage + ILcoupling , (S6.1) where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios: • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB (N = 20), and 3.80 dB (N = 30). • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10), 8.00 dB (N = 20), and 10.5 dB (N = 30). In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation closer to the detector noise floor, reducing usable dynamic range. e. Four-component loss breakdown. The total insertion loss of the cascade has four components: N 1. On-resonance cascade transmission Dmax (dominant; see Table V); 2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding to ∼0.46 dB per inter-ring stage; 3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16]; 4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15]. N Table V presents the ideal on-resonance budget (Dmax only). Including all four components for the present diagonal-bus layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈ 26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB. 30 S7. ENERGY EFFICIENCY DETAILED DERIVATION This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits and the photonic MRR cascade, as summarized in the main text (Sec. V). S7.1 Electrical analog exponential circuits Three main families of electrical circuits realize the exponential function in the analog domain: f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC = IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At a signal bandwidth of B ≈ 100 MHz, the energy per operation is P 300 µW EGilbert = = = 3 pJ. (S7.1) B 100 MHz g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by subthreshold fT ): 0.43 µW Esub-VT = = 0.43 pJ. (S7.2) 1 MHz This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz). h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations. Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ. (S7.3) At 8-bit precision (sufficient for inference): ∼2.3 pJ. S7.2 Photonic MRR cascade: single-channel energy derivation We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V (fully CMOS-compatible). The energy per exponential operation has three components: (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as Cel ≈ 18 fF, (S7.4) based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected QL = 25,200, which gives bV = 0.295 V−1 ): 2 Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ. (S7.5) For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ. Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because 2 EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N. (S7.6) The bias voltage (3.9 V) is static and does not contribute per-operation energy. (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10), requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ. (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ. The total single-channel energy is (1ch) Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ. (S7.7) 31 S7.3 Q-factor scaling of energy efficiency 2 Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at various quality factors. TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal stabilization (0.15–0.60 pJ for N = 30). Q Vctrl (V) Vbias (V) EEO (pJ) Etotal (pJ) 5,000 4.57 19.5 5.64 6.21 10,000 2.28 9.7 1.40 1.97 12,500 1.83 7.8 0.90 1.47 15,500† 1.47 6.3 0.58 1.15 20,000 1.14 4.9 0.35 0.92 25,200‡ 0.91 3.9 0.22 0.79 30,000 0.76 3.2 0.16 0.73 50,000 0.46 1.9 0.06 0.63 At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather than energy. i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate, the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to 0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy comparison should therefore be treated as an order-of-magnitude estimate. S7.4 Comparison with electronic implementations Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a 1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is N Pout = Pin,ch × Dmax , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30) (Table V). j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared: BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]). k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self- consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ. Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization. l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves 0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11), recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000× lower bandwidth. m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for 32 TABLE S12: Energy per exponential operation: single-channel comparison. Implementation E/op (pJ) Bandwidth Notes Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACs BJT Gilbert cell ∼3 100 MHz Analog Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACs Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† Subthreshold CMOS ∼0.43 1 MHz Analog † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see L Supplementary Sec. S7. TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical and Vctrl remains CMOS-compatible. 33 S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design (a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per- ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v) control-channel noise, and (vi) detector noise with one-point calibration at I = L. For ring k, we use 1 Tk (I) = 2, (46) 1 + (ak + bk I + dth + dxt I/L) with N Y y(I) = Tk (I) × 10−ILtot /10 , (47) k=1 and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance. TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps. Parameter Nominal Stress σa 0.020 0.032 σb,rel 0.020 0.032 σth 0.015 0.025 σxt 0.012 0.020 σI 0.004 0.007 ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05 σdet 3.0 × 10−6 6.0 × 10−6 TABLE S14: Monte Carlo summary (same run reported in main text). Metric Nominal Stress Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4 p95 KL(pref ∥papprox ) 5.92 × 10−4 2.21 × 10−3 Median max |∆p| 0.170% 0.193% p95 max |∆p| 0.319% 0.419% Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family with fixed b, define ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL), ϕ(u) = − ln(1 + u2 ), (48) so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening inequality L2 1 E∞ ≲ + 2 , (49) 4N 2b N which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening. 34 FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples. 35 S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION We model global normalization as a delayed PI-controlled loop: S(t) = G(t)P (t) + n(t), (50) dP τ = −P (t) + u(t − Td ), (51) dt Z u(t) = Kp e(t) + Ki e(t) dt, e(t) = Sref − S(t), (52) with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples. TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ). Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation 100 ns 20 ns 1.24 µs fast loop 1 µs 200 ns 12.4 µs moderate loop 5 µs 1 µs 62 µs slower loop Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed- signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip) for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices. TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9. Subsystem path Tsys Sources On-chip memory (L1/L2) 20–200 ns [25] Off-chip memory (DRAM) 200–700 ns [25, 26] ADC conversion 10–710 ns [27, 28] DAC + driver/settling 1–200 ns [29] On-chip interconnect (NoC) 5–100 ns [30] Off-chip I/O (PCIe/CXL) 1–10 µs [25, 31] Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy: Tnorm ≈ 12.4 τ. (53) This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening metric Tnorm ≥ β Tsys , (54) with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal dominance proof. The corresponding threshold is β Tsys τcrit (β) = . (55) 12.4 Table S17 gives the resulting numeric ranges. For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant 36 TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16. Subsystem Tsys range τcrit (β = 0.5) τcrit (β = 1) On-chip memory path 20–200 ns 0.81–8.06 ns 1.61–16.13 ns Off-chip memory path 200–700 ns 8.06–28.23 ns 16.13–56.45 ns ADC conversion 10–710 ns 0.40–28.63 ns 0.81–57.26 ns DAC+driver/settling 1–200 ns 0.04–8.06 ns 0.08–16.13 ns On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns Off-chip I/O fabric 1–10 µs 0.04–0.40 µs 0.08–0.81 µs as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture- specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16). Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations require full timing validation. TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ). Case (Kp , Ki , Td /τ ) Overshoot Settling Stable Stable (0.55, 0.8, 0.2) 25.6% ∼ 12.4τ Yes Marginal (0.95, 1.6, 0.45) 25.6% ∼ 12.8τ Yes Unstable (1.2, 2.2, 0.75) 45.1% not settled No TABLE S19: Stable-region fraction from gain-map scans at each delay ratio. Td /τ Stable fraction 0.0 88.1% 0.2 88.0% 0.5 72.4% 0.8 47.5% 37 FIG. S5: Step-response examples of the delayed PI normalization loop. 38 FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges. 39 S10. REPRODUCIBILITY Scripts used for this Supplementary validation: • scripts/nonideality montecarlo.py • scripts/feedback loop validation.py • scripts/extract logit range effective.py • scripts/analyze softmax clipping validity.py Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs are stored under: • paper/empirical L v3/ [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages 5998–6008, 2017. [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21. [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21. [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21. [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021. [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025. [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium niobate microring resonator. Optica, 4(12):1536–1537, 2017. [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022. [10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching. Adv. Mater., 35(3):2208113, 2023. [11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024. [12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J. Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024. [13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018. [14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi- cations, 14:2389, 2023. [15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on- insulator platform. Optics Express, 18(17):18278–18283, 2010. [16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS Benelux Chapter, 2001. [17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975. [18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989. [19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017. [20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics, 10(9):1004, 2021. [21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference (ISSCC), pages 10–14, 2014. [22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644– 29661, 2020. [23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog. Phys., 75(4):046402, 2012. 40 [24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol., 35(3):346–396, 2017. [25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826, 2018. [26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages 368–379, 2012. [27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22. [28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised August 2015), 2015. Accessed 2026-02-22. [29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22. [30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design Automation Conference (DAC), pages 684–689, 2001. [31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.