summaryrefslogtreecommitdiff
path: root/ep_run/extracted_paper.txt
diff options
context:
space:
mode:
Diffstat (limited to 'ep_run/extracted_paper.txt')
-rw-r--r--ep_run/extracted_paper.txt2039
1 files changed, 2039 insertions, 0 deletions
diff --git a/ep_run/extracted_paper.txt b/ep_run/extracted_paper.txt
new file mode 100644
index 0000000..4f521d8
--- /dev/null
+++ b/ep_run/extracted_paper.txt
@@ -0,0 +1,2039 @@
+ Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
+ toward Softmax
+ Hyoseok Park1 and Yeonsang Park1, ∗
+ 1
+ Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea
+ (Dated: March 26, 2026)
+ The rapid growth of large-scale AI models has intensified energy consumption and data-movement
+ challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear
+ matrix multiplications of transformer inference at high throughput and low energy. However, the
+ softmax attention layer—which requires element-wise exponentiation followed by normalization—still
+ relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much
+ of the potential photonic advantage.
+arXiv:2603.12934v3 [physics.optics] 25 Mar 2026
+
+
+
+
+ We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel
+ exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case
+ relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe
+ at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a
+ multiplicative transfer function whose logarithm is approximately linear.
+ We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate
+ the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN)
+ add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade
+ and confirms agreement with theory primarily over the upper operating range; deeper cascades and
+ higher quality factors are assessed analytically. The cascade implements the per-channel exponential
+ block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel
+ chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation,
+ summation, and normalization—on a single photonic chip without per-channel normalization circuitry.
+
+
+ I. INTRODUCTION is approximately linear over a finite interval, enabling
+ exponential-function synthesis with sub-2% worst-case
+ Transformer inference is often limited by power and error—an order of magnitude more accurate than SOFT-
+ memory traffic, motivating optical accelerators that ex- ONIC’s polynomial approach—while remaining compati-
+ ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9]. ble with integrated microring platforms [20–24]. We term
+ Recent perspective articles also discuss data-center power this cascade block an approximate exponential function
+ consumption as one motivation for optical comput- (AEF) unit. We further propose a WDM-parallel archi-
+ ing [3, 8]. While linear operators are comparatively tecture with a single PI feedback loop that realizes the
+ amenable to photonic implementation [4–6], the softmax complete softmax function—including summation and
+ function used in attention layers requires an exponen- normalization—without per-channel electronic process-
+ tial mapping together with global normalization—both ing.
+ difficult to realize in passive photonic circuits, where We extend the theoretical framework with three-
+ transmission is fundamentally bounded by unity. Parallel dimensional FDTD simulations of a single X-cut TFLN
+ digital-hardware studies treat the exponential/softmax add-drop micro-ring resonator. The simulated device
+ stage as a bottleneck and propose dedicated approxima- parameters—quality factor, free spectral range, and
+ tions [11–19]. Many integrated-photonic classifier demon- electro-optic sensitivity—calibrate the cascade design pa-
+ strations still rely on electronic post-processing for the rameters, bridging analytical fitting and physically realiz-
+ final nonlinear readout [10]; the resulting electro-optic able hardware. Two operating regimes emerge from this
+ conversion overhead can negate the throughput and en- calibration: an FDTD-characterized regime with moder-
+ ergy benefits of the photonic front-end. Notably, the ate drop-port depth (Dmax ≈ 0.36), where the analytic
+ SOFTONIC architecture [11] explicitly argues that “the error stays below ∼5% for N ≤ 7 but the power bud-
+ inability of MRRs and MZMs to handle SMA’s expo- get limits practical cascades to N ≤ 5; and a projected
+ nential and division functions” necessitates alternative high-Q regime (Dmax ≥ 0.95), enabling deeper cascades
+ approaches based on microdisk modulators and polyno- (N ≤ 30) with sub-percent error. Cascade performance is
+ mial approximation, achieving 89.7% accuracy with a predicted analytically and validated by a five-ring cascade
+ third-degree Chebyshev polynomial. Here we challenge 3D FDTD simulation (Sec. IV).
+ this premise: we show that a passive Lorentzian cascade The paper is organized as follows: Section II presents
+ of microring resonators can be tuned so that its logarithm the mapping, transfer model, and depth-design rules; Sec-
+ tion III provides numerical fits and validation; Section IV
+ describes the single-ring TFLN device design and FDTD
+ validation; Section V assesses physical feasibility including
+ ∗ yeonsang.park@cnu.ac.kr; Corresponding author
+ voltage requirements, insertion loss, and energy efficiency;
+ 2
+
+Section VI discusses implementation scope, platform com-
+parisons, and limits; and Section VII concludes. 1
+ Tk (∆ωk ) =  . (9)
+ ∆ωk 2
+ 1+ Γ
+ II. MODEL AND DESIGN FRAMEWORK
+ In a control–probe architecture, a nonnegative control-
+ signal amplitude I ≥ 0 shifts the ring resonance. Here I
+Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an denotes a generic control amplitude: for optical-pump op-
+arbitrary real-valued sequence (or vector). Directly gener- eration it maps to optical intensity, while for EO operation
+ating exp(xn ) as a passive optical transmission is impos- it maps to electrical control level (e.g., voltage). Across
+sible in general because exp(x) grows beyond unity while many physical mechanisms (optical pump via Kerr/XPM,
+a passive transmission satisfies 0 < T ≤ 1 [25]. However, EO drive via Pockels effect, thermal, carrier tuning), the
+for softmax, shift can be linearized on a working range [20, 26–30]:
+
+ exn (0)
+ softmax(x)n = P xj , (1) ω0,k (I) = ω0,k + ηI, (10)
+ je
+ (0)
+ where ω0,k is the cold-cavity resonance and η is the control-
+a common shift cancels: to-resonance sensitivity. In practice, the control channel
+ can be optical or electrical (optical pump, EO/Pockels
+ exn +c exn drive, thermal, or carrier tuning); a quantitative EO
+ P x +c = P x (∀c ∈ R). (2) feasibility example is given in the Discussion. With
+ je je
+ j j
+ (0)
+ ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be-
+Thus it suffices to generate comes
+
+
+ exn −m , m ≡ max xj , (3) ∆ωk (I) = ∆ω0,k − ηI. (11)
+ j
+ Define dimensionless parameters
+since the global factor em cancels.
+ To ensure a nonnegative control-signal amplitude, de-
+fine ∆ω0,k η
+ ak ≡ , b≡− . (12)
+ Γ Γ
+ Then Eq. (9) yields the control-to-probe transfer of a
+un ≡ xn − m ≤ 0, L ≡ − min un = m − min xn ≥ 0, single ring,
+ n n
+ (4)
+and map each scalar to a nonnegative control-signal am- 1
+plitude Tk (I) = . (13)
+ 1 + (ak + bI)2
+ Physical meaning: ak is a static detuning in linewidth
+ In ≡ un + L ∈ [0, L]. (5) units (set by heater/carrier tuning/fabrication), and |b|
+ is the normalized sensitivity magnitude (linewidths of
+Then
+ resonance shift per unit control-signal amplitude); the sign
+ convention is absorbed into the detuning expression. For
+ exn −m = eun = eIn −L . (6) “same-material/same-geometry” rings, b is often common,
+ while ak can be tuned per ring.
+Hence the optical design task is to realize, for I ∈ [0, L], Sign convention. Simultaneously flipping (ak , b) 7→
+ (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0
+ without loss of generality.
+ f (I) = eI−L ∈ [e−L , 1]. (7) Let N rings be cascaded in a serial add-drop topology:
+ Tk (I) denotes the add-to-drop transmission of ring k, and
+Control–probe transfer. Consider a weak probe at the drop output of ring k feeds the add (input bus) port
+fixed angular frequency ωL . For the kth ring, let ω0,k of ring k+1. Assuming the probe is sufficiently weak so
+denote its resonance frequency and Γ > 0 its loaded half- the control channel dominates the resonance shift, the
+width at half maximum (HWHM). Define the detuning normalized probe output is the product
+
+ ∆ωk ≡ ωL − ω0,k . (8) (probe)
+ Pout (I)
+ N
+ Y N
+ Y 1
+ y(I) ≡ = Tk (I) = .
+Near resonance, the normalized Lorentzian transmission
+ (probe)
+ Pin 1 + (ak + bI)2
+ k=1 k=1
+is modeled as [20, 21] (14)
+ 3
+
+
+ (a) Electronic Preprocessing
+ Control In
+ Find max: Shift: Bias:
+ {xn } m = max(xn ) un = xn −m In = un +L
+
+
+ EO tuning
+ (b) N -MRR Cascade
+
+ N stages
+ Probe
+ (fixed ωL )
+
+
+ MRR MRR MRR MRR MRR
+ #1 #2 #3 #4 #5
+
+
+
+
+ (c) Output
+
+ ỹ(In ) ≈ exp(In − L) → exp(xn − m) PD
+
+
+ FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing
+ maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and
+In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a
+ weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the
+ next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements
+ y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax.
+
+
+To focus on the shape of the approximation, we allow a
+global scale factor C > 0:
+ E∞ ≡ sup ln ỹ(I) − (I − L) . (18)
+ I∈[0,L]
+
+ ỹ(I) ≡ C y(I). (15) If E∞ ≤ εlog , then for all I ∈ [0, L],
+In softmax, pn = CeIn −L / j CeIj −L , so C cancels
+ P
+between numerator and denominator and is physically ỹ(I) ỹ(I)
+ e−εlog ≤ ≤ eεlog ⇒ − 1 ≤ eεlog − 1. (19)
+inessential; nevertheless it is convenient for error analysis. f (I) f (I)
+For a fixed (N, b, {ak }), the optimal C for the minimax
+ Thus achieving a prescribed worst-case relative error ε is
+log-error in Eq. (18) can be written in closed form. Let
+ guaranteed by
+g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax-
+optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield-
+ing E∞ = (maxI g(I) − minI g(I))/2. E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε. (20)
+ Taking logarithms,
+ Depth scaling. We derive depth-related constraints and
+ design rules for a prescribed approximation tolerance.
+ N
+ X Necessary slope condition. Differentiate Eq. (16):
+ ln 1 + (ak + bI)2 .
+ 
+ ln ỹ(I) = ln C − (16)
+ k=1
+ N
+ d X 2b(ak + bI)
+The target ln f (I) = I − L is linear; hence exponential ln y(I) = − . (21)
+ dI 1 + (ak + bI)2
+approximation is equivalent to the log-linearization goal k=1
+
+ Since |2u/(1 + u2 )| ≤ 1 for all real u,
+ ln ỹ(I) ≈ I − L uniformly on I ∈ [0, L]. (17)
+ d
+ ln y(I) ≤ N |b|. (22)
+Error metric. Define the worst-case log-error on [0, L]: dI
+ 4
+
+The target ln f (I) = I − L has constant slope +1, so a with a minimax refinement. After choosing N , set
+necessary condition to track it is b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa-
+ tion, then refine (a, b) by a two-parameter minimax fit on
+ [0, L].
+ N |b| ≳ 1. (23) A heuristic conservative screening bound N ≥ ⌈(L2 /4 +
+Near-optimal parameterization. The full design prob- 1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion
+lem can be written as a minimax fit in the log domain [31]: argument; see Supplementary Sec. S1) provides a quick
+ upper estimate but is not a rigorous guarantee.
+
+ min sup |r(I)|,
+ a1 ,...,aN , ln C I∈[0,L]
+ III. NUMERICAL FITS AND VALIDATION
+ N
+ X (24)
+ ln 1 + (ak + bI)2 − (I − L).
+ 
+ r(I) ≡ ln C − We validate the analytical framework with minimax
+ k=1 numerical fits and sampled robustness checks. Figure 2
+This objective is permutation-invariant in the ak ’s (ring shows the fitted approximation quality at L = 8: the
+index k). In practice (and in numerical experiments top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the
+reported below), the optimizer frequently collapses to a middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8],
+permutation-symmetric solution and the bottom panel shows the pointwise relative error
+ with the characteristic Chebyshev equioscillation pattern.
+ We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L]
+ a1 = · · · = aN ≡ a, (25) and compare several depths using a minimax criterion.
+ Table I makes the accuracy–depth trade-off explicit
+reducing the design to two parameters (a, b) (plus C). at L = 8. A worked input-to-output example demon-
+With Eq. (25), strating the mapping from an arbitrary input sequence
+ x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided
+ 
+ 1
+ N in Supplementary Sec. S2. The example shows that the
+ ỹ(I) = C y(I) = C . (26) N = 10 cascade keeps the worst-case relative error below
+ 1 + (a + bI)2 2.7% across all channels.
+A robust initialization is obtained by placing the midpoint Empirical calibration. We calibrate the effective
+of the interval on the Lorentzian half-maximum flank and logit range Leff from autoregressive Transformers (dis-
+matching the slope: tilgpt2/gpt2) [1, 32–35] at context length 128, finding
+ Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen-
+ tary Sec. S2). A clipping threshold t∗ = −12 preserves
+ L p99 softmax accuracy below 0.1%. Full protocol details,
+ a+b ≈ −1, N b ≈ 1. (27)
+ 2 clipping-sweep tables/plots, and per-run statistics are
+These two equations already yield a good design; a small provided in Supplementary Sec. S3.
+(two-parameter) refinement can then enforce the desired A synthetic design-space map (Supplementary Table S3)
+worst-case tolerance. shows that near L ≈ 8, moderate depth (N ≥ 10) reaches
+ Local expansion and depth scaling. A Taylor few-percent error, whereas L ≳ 12 requires deeper cas-
+expansion of the log-domain residual around the flank- cades. All fits follow the same pipeline: minimize the
+centered point I0 = L/2 (with a + bI0 = −1 and N b = 1) worst-case log-error on a uniform grid, initialize from the
+shows that the quadratic term vanishes identically, leaving flank rules in Eq. (27), perform multi-start global search,
+a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L], and apply bounded local refinement; implementation de-
+this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed tails and scripts are provided in a public repository [36]
+ √ (commit: 585e695).
+tolerance εlog requires N ∝ L3/2 / εlog , which explains
+the scaling used in Eq. (28). The full derivation is provided
+in Supplementary Sec. S0; an intuitive local-expansion
+summary appears in Sec. S1.
+ Practical engineering estimate. Given L and a TABLE I: Depth comparison for L = 8 using fitted
+target worst-case relative error ε, define εlog = ln(1 + ε). ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all
+A heuristic engineering estimate (not a rigorous bound) N ).
+that matched our percent-level numerical designs is
+ N a b max rel. err. mean rel. err.
+ L3/2
+   
+ 1
+ N ≈ max , κ√ , (28) 5 −2.0789 0.21658 10.9% 6.43%
+ bmax εlog 10 −1.4588 0.10202 2.68% 1.65%
+ 20 −1.2135 0.05025 0.67% 0.42%
+where bmax is the physically achievable sensitivity bound 30 −1.1392 0.03341 0.30% 0.19%
+and κ ≃ 0.07 for the identical-detuning flank design
+ 5
+
+ TABLE II: Waveguide and ring parameters of the X-cut
+ TFLN micro-ring resonator. Electro-optic electrode
+ parameters are listed separately in Table III.
+
+ Parameter Symbol Value Unit
+ Total TFLN thickness tTFLN 600 nm
+ Etch depth tetch 500 nm
+ Slab thickness tslab 100 nm
+ Waveguide width w 1.4 µm
+ Bend radius R 20 µm
+ Coupling gap g 100 nm
+ Circumference Lring 125.7 µm
+ Free spectral range FSR 8.29 nm
+ Effective index (TE0 ) neff 1.903 —
+ Group index (TE0 ) ng 2.24 —
+ Extraordinary index ne 2.138 —
+
+
+
+ IV. TFLN SINGLE-RING DEVICE DESIGN AND
+ FDTD VALIDATION
+
+ A. Waveguide and ring geometry
+
+
+ The device is based on an X-cut thin-film lithium nio-
+ bate (LiNbO3 ) on insulator wafer with a 600 nm-thick
+ LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines
+ a 1.4 µm-wide single-mode waveguide with a 100 nm un-
+ etched slab (Fig. 3). Lumerical MODE simulations yield
+ neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda-
+ mental TE0 mode.
+ The ring resonator (R = 20 µm, Lring = 125.7 µm) is
+ configured as an add-drop resonator with 100 nm coupling
+ gaps (Fig. 4). The FDTD-measured free spectral range
+ is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE
+ value due to bend-induced dispersion.
+
+
+
+
+FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale:
+ shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The
+target eI−L (black) is progressively better matched as N
+ increases. (b) Log scale: depth comparison
+ (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into
+ I ∈ [6, 8] showing convergence. (c) Pointwise relative
+ error showing the Chebyshev equioscillation pattern
+ characteristic of minimax optimality.
+ FIG. 3: Cross-section of the X-cut TFLN rib waveguide
+ on a SiO2 substrate. The 600 nm LiNbO3 film is etched
+ 500 nm to form a 1.4 µm-wide single-mode rib waveguide.
+ Lateral signal (S) and ground (G) electrode positions are
+ indicated; electrode design details are discussed in
+ Sec. IV D.
+ 6
+
+ Table II summarizes the waveguide and ring parame-
+ters.
+
+
+ B. 3D FDTD Methodology
+
+ The ring resonator response is simulated using Lumeri-
+cal 3D FDTD with conformal variant 1 meshing. A broad-
+band TE0 mode source (1530 nm to 1570 nm) is injected
+into the input bus waveguide, and through- and drop-port
+spectra are recorded. A “z-refined 3-fix” meshing strat-
+egy ensures convergence in the thin-film geometry [37];
+detailed simulation setup is provided in Supplementary
+Sec. S4 (Table S6).
+
+
+ FIG. 5: Simulated through-port (blue) and drop-port
+ (red) transmission spectra of the single add-drop
+ micro-ring resonator from 3D FDTD. Top: logarithmic
+ scale; bottom: linear scale. Five resonances are visible
+ with FSR ≈ 8.29 nm.
+
+
+
+ 15,500, Dmax = 0.360); using the five-resonance mean
+ would increase required voltages by ∼24% (see Table IV
+ caption).
+ The simulation time of 50 ps exceeds the loaded pho-
+ ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but
+ the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex-
+ tracted Qi may be slightly conservative. An independent
+ eigenmode (FDE) analysis of the same cross-section at
+ R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5×
+ FIG. 4: Top view of the single add-drop micro-ring finer than the FDTD vertical grid)—yields Qrad+leak =
+ resonator used in the 3D FDTD simulation. The ring 2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89)
+ waveguide (R = 20 µm, w = 1.4 µm) is evanescently gives a theoretical Qi > 107 [37–42], confirming that
+ coupled to input and drop bus waveguides through the gap between the numerical Qi and published val-
+ 100 nm gaps at coupling points CP1 and CP2. ues (> 106 ) originates from mesh discretization (Sup-
+ plementary S4.5, Table S8). In the CMT framework,
+ Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present
+ coupling gap, increasing Qi to 106 would raise Dmax from
+ 0.36 to ∼0.95 and QL from 15,500 to ∼25,200.
+ C. Single-Ring Add-Drop Results
+ Figure 6(a) shows a Lorentzian fit to the best drop-
+ Figure 5 shows the through- and drop-port spectra from port resonance at λ = 1566 nm, validating the cascade
+3D FDTD. Five resonances are resolved across 1530 nm model (Eq. 9). Figure 6(b) demonstrates that cascading
+to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30). N copies of this FDTD-extracted Lorentzian reproduces
+ the target exponential eI−L with increasing fidelity as N
+ Lorentzian fitting of the drop-port peaks yields QL =
+ grows.
+10,300–15,500, with the best resonance at λ = 1566 nm
+reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360, To validate the cascade prediction directly, a five-
+−4.4 dB). The through-port extinction ratio is 1.6 dB to ring cascade 3D FDTD simulation was performed us-
+2.6 dB, and the five-resonance mean is QL = 12,500 ± ing Tidy3D [43]; the full simulation notebook is publicly
+1,800 (Dmax = 0.29–0.36). CMT √ analysis of the best available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)]
+resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈ confirms resonant excitation across all five rings. Map-
+38,800, confirming that the 500 nm etch provides sufficient ping the drop-port spectrum onto the control variable I
+confinement and that the 100 nm gap places the ring yields 11 data points within the AEF operating range
+in the coupling-limited regime. The cascade analysis [Fig. 6(e, f)], with the FDTD transmission closely tracking
+below adopts the best-case FDTD calibration (QL = the N = 5 theoretical curve near I ≈ L = 8.
+ 7
+
+
+
+
+FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD
+ (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near
+ λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR
+layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD
+ simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with
+ 11 spectral FDTD data points.
+ 8
+
+ D. X-cut electrode design and EO parameters TABLE III: Electro-optic electrode parameters for the
+ X-cut TFLN micro-ring with lateral S–G arc electrodes.
+ We employ lateral signal–ground (S–G) arc electrodes
+on the slab surface alongside the ring waveguide (Fig. 7). Parameter Symbol Value Unit
+In the X-cut orientation, the crystal Z-axis is at 45◦ from Crystal orientation — X-cut —
+the horizontal in the substrate plane, giving a lateral- EO coefficient r33 30.9 pm V−1
+field projection proportional to cos(θ − 45◦ ) at azimuthal EO fill factor fEO 1/π ≈ 0.318 —
+angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ EO overlap factor ΓEO 0.7 —
+and 315◦ naturally separate the coupling regions from Electrode gap gel 5 µm
+ Effective electrode distance deff 2.5 µm
+the electrode regions. Each ring carries a full semicir-
+cular arc electrode on the side opposite to its coupling
+points, engaging the large r33 = 30.9 pm V−1 Pockels co-
+efficient [37, 38]. The effective EO fill factor follows from ized voltage sensitivity is (Supplementary Sec. S4; here
+integrating | cos(θ − 45◦ )| over the semicircle: dλ/dV = 28.5 pm/V is the straight-section value and
+ 1 fEO accounts for partial electrode coverage of the ring
+ fEO = ≈ 0.318 (29) circumference):
+ π
+(see Supplementary Sec. S4 for derivation). The electrode 2 Q (dλ/dV )
+gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic bV = fEO ≈ 0.182 V−1 (30)
+overlap integral is ΓEO = 0.7. Table III lists the electrode λ0
+parameters.
+ at QL = 15,500. This estimate relies on a first-order
+ electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30%
+ variation in bV would shift the cascade depth by one to
+ two rings at constant εmax (Table IV), leaving the quali-
+ tative design conclusions unchanged. With the cascade
+ framework of Sec. II (Eqs. 14–18), the N -ring drop-port
+ transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates
+ eI−L over I ∈ [0, L], with (a, b) optimized by minimax
+ fitting for each N .
+ Table IV presents the optimization results for the stan-
+ dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB).
+
+ TABLE IV: Cascade optimization results for L = 8. The
+ bias voltage Vbias = |a|/bV sets the DC offset, and
+ Vctrl = bL/bV is the maximum control voltage at I = L.
+ Voltages computed with bV = 0.182 V−1 (X-cut arc
+ electrode, FDTD-calibrated best resonance QL = 15,500,
+ ng = 2.30). The mean FDTD quality factor across five
+FIG. 7: Illustrative two-ring cascade layout showing the resonances is QL = 12,500 ± 1,800; using the mean would
+lateral S–G arc electrode placement on X-cut TFLN (the increase voltages by ∼24%.
+cascade design extends to N rings; this two-ring example
+ clarifies the electrode geometry). The crystal Z-axis is N a b E∞ εmax (%) Vbias (V) Vctrl (V)
+ oriented at 45◦ from the horizontal in the substrate 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5
+plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and 10 −1.4588 0.10202 0.0265 2.68 8.0 4.5
+ 315◦ naturally separate the bus-waveguide coupling 12 −1.3731 0.08450 0.0184 1.86 7.5 3.7
+regions from the electrode semicircles: each ring carries a 20 −1.2136 0.05025 0.0067 0.67 6.7 2.2
+ 25 −1.1685 0.04013 0.0043 0.43 6.4 1.8
+full semicircular arc electrode on the side opposite to its
+ 30 −1.141 0.03340 0.0030 0.30 6.3 1.5
+ coupling points. The resulting effective EO fill factor is 32 −1.1301 0.03131 0.0026 0.26 6.2 1.4
+ fEO = 1/π ≈ 0.318.
+ a The complete cascade optimization results for all N values are
+
+ listed in Supplementary Table S7.
+
+
+E. FDTD-Calibrated bV and Cascade Optimization The approximation quality across different cascade
+ depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g.,
+ From the device parameters in Tables II and III and ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete
+the FDTD-calibrated ng ≈ 2.30, the effective normal- optimization results are listed in Supplementary Sec. S4.
+ 9
+
+ V. PHYSICAL FEASIBILITY TABLE V: Two-regime power budget for the MRR
+ cascade. Pout assumes per-channel input
+ Having established the cascade approximation theory Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW
+(Sec. II) and the FDTD-calibrated device parameters laser split across M = 10 parallel channels via a 1×M
+(Sec. IV), we now assess the physical feasibility of the splitter, or equivalently multiplexed as d WDM channels
+proposed architecture in terms of voltage requirements, sharing a single cascade) and accounts only for the ideal
+ N
+insertion loss, and energy efficiency. on-resonance cascade transmission Dmax (upper bound);
+ additional inter-ring coupling loss (ηcoupling ≈ 0.9 per
+ stage, ∼0.46 dB/stage) and off-resonance propagation
+ A. Electro-optic voltage requirements loss (0.08–0.25 dB/stage) are analyzed separately in
+ Sec. V C.
+ For the primary target of ε < 2% (N = 12), minimax
+ N
+optimization gives a = −1.373, b = 0.0845. With the Dmax N Dmax (dB) Pout εmax
+FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the 0.36 3 0.0467 −13.3 4.67 µW ∼15%
+ I
+required voltages are (FDTD) 0.36 5 0.00605 −22.2 0.61 µW 10.9%
+ 0.36 7 7.84 × 10−4 −31.1 78 nW ∼5%
+ |a| 1.373 0.95 10 0.599 −2.2 59.9 µW 2.68%
+ Vbias = = = 7.5 V, (31) II
+ (high-Q) 0.95 20 0.358 −4.5 35.8 µW 0.67%
+ bV 0.182
+ 0.95 30 0.215 −6.7 21.5 µW ∼0.30%
+ bL 0.0845 × 8
+ Vctrl,max = = = 3.7 V. (32) Regime I: FDTD-characterized (Qi = 38,800). Regime II:
+ bV 0.182 fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch .
+
+Since bV ∝ Q, voltage scales inversely with quality factor:
+
+ bL bL λ0 independent evidence that intrinsic quality factors in
+ Vctrl = = . (33) the projected range are physically achievable in TFLN—
+ bV 2Q |dλ0 /dV |
+ albeit with wider waveguides and larger ring radii than the
+CMOS-compatible control voltages (Vctrl < 3.3 V) are present design. Transferring comparable sidewall quality
+achievable at N ≥ 14 with QL = 15,500; at the design to our geometry (R = 20 µm, W = 1.4 µm) is an open
+point N = 30 (εmax = 0.30%), Vctrl = 1.47 V. fabrication challenge; the projections should be read as
+ design targets contingent on achieving it.
+ The total insertion loss comprises on-resonance
+ N
+ B. Power budget: two-regime analysis cascade transmission Dmax , inter-ring coupling loss
+ (∼0.46 dB/stage for the present diagonal-bus layout),
+ The on-resonance cascade transmission DmaxN
+ is the off-resonance propagation loss (0.08–0.25 dB/stage), and
+dominant contribution to total insertion loss. Table V fiber-to-chip coupling (1.5–3.0 dB). For the fabricated
+presents two regimes: the FDTD-characterized regime high-Q regime (N = 30), the total ranges from ∼13 dB
+(Dmax = 0.36) and the fabricated high-Q regime (Dmax = (optimized layout) to ∼24 dB (current geometry); see
+0.95, achievable with Qi > 106 and gap-optimized cou- Supplementary Sec. S6 for detailed scenarios.
+pling).
+ In the FDTD-characterized regime, Dmax = 0.36 limits
+practical cascades to N ≤ 5: at N = 5 the output is D. Energy comparison
+0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof-
+of-concept validation. In the fabricated high-Q regime For N = 30 X-cut TFLN micro-ring resonators in the
+(Dmax ≥ 0.95), deep cascades become practical: N = 30 fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup-
+yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%. plementary Sec. S5), the three energy components are EO
+The transition to fabricated high-Q devices is therefore tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ,
+critical for achieving both high accuracy and sufficient shared across M = 10 channels), and photodetector
+output power. (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva-
+ tions in Supplementary Sec. S7). Including thermal stabi-
+ lization for N = 30 rings (0.15–0.60 pJ; Supplementary
+ C. Feasibility outlook Sec. S7), the total rises to 0.94–1.39 pJ.
+ Table S12 compares the photonic cascade with digital
+ Published TFLN micro-ring resonators achieve Qi ≥ implementations. Including thermal stabilization (0.94–
+106 –108 using optimized fabrication [39–42]. At Qi = 1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×,
+106 with the present coupling geometry, CMT predicts while operating at 10 GHz bandwidth and 58× lower than
+Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5, digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO
+Tables S4–S7), enabling deep cascades (N ≤ 30) with drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal;
+sub-percent error. The literature values provide strong Supplementary Table S11), recovering a 3.2× advantage
+ 10
+
+ TABLE VI: Energy per exponential operation: with a distinct FSR order of the same ring set, traverse a
+ single-channel comparison. single N -ring cascade simultaneously (Fig. 8). Because
+ each channel λj sees its own Lorentzian detuning set by
+ Implementation E/op (pJ) Bandwidth Notes an independent control QN
+ voltage Vj , the cascade output
+ Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all
+ Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACsd exponentials are computed in parallel on the same phys-
+ Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† ical waveguide. Compared with a 1×M power-splitter
+ † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. architecture that replicates the cascade for each channel,
+ Self-consistent with fabricated high-Q regime (QL = 25,200); see the WDM approach reduces the total ring count from
+ Supplementary Sec. S7. N × d to N (a factor-d saving) and eliminates the splitter
+ insertion loss (10 log10 d dB). At the output, a WDM
+ demultiplexer or wavelength-selective photodetector array
+over INT8. Since EEO ∝ 1/Q2 , improving Q beyond separates the channels for electrical readout. Figure 8
+∼30,000 yields diminishing energy returns but continues shows a representative chip layout for N = 5 cascade
+to relax CMOS driver voltage requirements. stages and d = 8 WDM channels, where alternating U-
+ turn bus connections route the drop-port output of each
+ stage into the input bus of the next.
+ VI. DISCUSSION Why cascade helps. A single Lorentzian in I is too
+ rigid to mimic the log-linear target over a wide interval.
+ Practical design procedure. For a given input se- Cascading turns the transfer into a product; taking a
+quence x = (x1 , . . . , xK ), the design proceeds as follows: logarithm gives a sum of smooth terms, and the approx-
+ imation improves as N increases. The slope constraint
+ 1. Compute m = maxn xn , un = xn − m, and L = N |b| ≳ 1 is an immediate feasibility check.
+ − minn un . Global softmax normalization via WDM feed-
+ 2. Map to nonnegative control-signal amplitudes: In = back. The WDM-parallel architecture (Fig. 8) integrates
+ un + L ∈ [0, L]. naturally with a closed-loop normalization scheme to com-
+ plete the full softmax function. After the N -stage cascade,
+ 3. Choose tolerance ε and set εlog = ln(1 + ε). a WDM demultiplexer (e.g., arrayed-waveguide grating or
+ ring-filter bank) routes each channel λj to a dedicated pho-
+ 4. Select a physically feasible bmax and estimate N todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj .
+ using Eq. (28). The d photocurrents are summed electrically:
+ 5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2, d d
+ then refine (a, b) by a two-parameter minimax fit if
+ X X
+ S= Iλj ∝ C Pin eVj . (35)
+ required. j=1 j=1
+
+ 6. The optical block yields ỹ(In ) ≈ exn −m , and soft- A proportional–integral (PI) controller compares S with
+ max weights follow as a fixed reference Sref and adjusts the shared WDM laser
+ power Pin so that S → Sref [44, 45]. Because all d channels
+ share the same probe source, scaling Pin multiplies every
+ ỹ(In )
+ pn = P . (34) ỹj by the same factor; upon convergence
+ j ỹ(Ij )
+ Iλj eVj
+ pj = = Pd = softmax(V )j , (36)
+ Scope and limits. The approximation is for a fi- Sref Vk
+ k=1 e
+nite interval I ∈ [0, L], where L is determined by the
+input batch via Eq. (4). In practice, one designs for a realizing the complete softmax with a single feedback loop
+worst-case L expected in operation (or retunes a and and no per-channel normalization circuitry. Compared
+rescales the control signal to adapt L). Noise, insertion with the replicated-cascade approach (one AEF block per
+loss, and control-induced parasitics limit accuracy and channel), WDM feedback offers two additional benefits:
+dynamic range; we treat these effects as platform-specific (i) the splitter-induced power imbalance that would bias
+margins. Detailed non-ideality assumptions, parameter the Iλj ratios is absent, since all channels traverse the
+distributions, and robustness statistics are reported in same optical path; and (ii) a single laser control point
+Supplementary Sec. S8. With K channels in parallel, replaces d independent probe adjustments. Design de-
+one can form softmax by summing channel powers and tails and stability analysis of the PI loop are provided in
+applying a shared reciprocal scale factor, depending on Supplementary Sec. S9.
+the chosen mixed-signal normalization scheme. Beyond ring-resonator AEF implementations, the same
+ WDM parallelism. A particularly hardware-efficient cascade principle can be extended to other cavity-based
+realization exploits wavelength-division multiplexing photonic platforms, such as serial 1D photonic-crystal cav-
+(WDM): d probe wavelengths λ1 , . . . , λd , each resonant ities and other cascaded resonant architectures [21, 46].
+ 11
+
+What these platforms share is transfer-function shaping TABLE VII: Summary of evidence levels for key claims.
+through cascaded resonances; loss, tuning range, fabrica-
+tion tolerance, and calibration overhead remain platform- Claim Evidence Sec.
+dependent. Cascade → exp. approx. Analytic II
+ The insertion loss budget (Sec. V C) and electro-optic Depth scaling Analytic + num. II, III
+voltage requirements (Sec. V A) suggest that the cas- QL , Dmax , bV 3-D FDTD IV
+cade architecture is feasible under optimized coupling 5-ring line shape 3-D FDTD IV
+and layout conditions. Using monolithic TFLN microring N ≤ 30 deep cascade CMT proj.∗ V
+ Energy < 1 pJ Estimate V
+data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈
+ Full softmax (WDM + feedback) Conceptual + layout VI
+9–20 pm/V), the normalized sensitivity bV ≃ 0.063–
+ ∗ Based on published Q
+0.14 V−1 , within the range required by the cascade design. i ≥ 10
+ 6 values [39, 42] and CMT coupling
+
+ model.
+Crystal orientation and electrode design. The X-
+cut TFLN platform was chosen for several reasons. First,
+X-cut is the prevailing industry standard for integrated tified in the Monte Carlo robustness analysis (Supple-
+TFLN modulators, with well-established fabrication pro- mentary Sec. S8). Monte Carlo simulations (Supplemen-
+cesses and commercial wafer availability [37, 38]. Second, tary Sec. S8) show that under nominal non-ideality levels
+the TE0 mode—which is strongly confined in the rib (σa = 0.020, σb,rel = 0.020), a single-point calibration of
+waveguide geometry—can engage the large r33 coefficient C per chip keeps the median softmax KL divergence below
+via lateral electric fields aligned with the crystal Z-axis. 2.2 × 10−4 , with 95th-percentile max probability error
+In contrast, Z-cut geometry with TE polarization can only under 0.32%. Even under stress conditions (σa = 0.032),
+access the smaller r13 coefficient (∼ 10 pm/V), resulting 95th-percentile errors remain below 0.42%, demonstrat-
+in significantly lower electro-optic efficiency. The arc elec- ing that the identical-detuning design is robust to realis-
+trode design (Sec. IV D) addresses the phase-cancellation tic fabrication variations provided a per-chip calibration
+problem inherent to X-cut circular rings [47] by orienting step is performed. Conversely, if coupling gaps are in-
+the crystal Z-axis at 45◦ from the horizontal in the sub- tentionally varied across rings, the per-ring parameters
+strate plane. This rotation places the cos(θ − 45◦ ) = 0 (ak , bk ) become independent degrees of freedom. A Taylor-
+boundaries at θ = 135◦ and 315◦ , naturally separating the expansion analysis shows that K non-identical rings can
+bus-waveguide coupling regions from the electrode regions. cancel curvature
+ P terms up to order 2K in the Taylor series
+Each ring carries a full semicircular arc electrode on the of g(I) = k ln Tk , one order higher than K identical
+side opposite to its coupling points, yielding an effective rings, so that fewer rings suffice for a given error target.
+fill factor fEO = 1/π ≈ 0.318. While this reduces the
+round-trip EO efficiency compared to a hypothetical full-
+circumference design, it preserves the compact footprint
+of a circular ring resonator. The cascade performance
+can be further improved beyond the R = 20 µm circular-
+ring design presented here. Increasing the ring radius
+reduces bending loss and raises the intrinsic quality factor
+Qi , which directly increases bV (∝ Q) and lowers the
+required control voltage. Alternatively, adopting a race-
+track geometry with extended straight coupling sections
+strengthens the bus–ring coupling, pushing the drop-port
+maximum Dmax closer to critical coupling and improving
+the per-stage transfer efficiency. Either approach—or their
+combination—would yield higher bV and Dmax , enabling
+lower N or tighter approximation accuracy at reduced
+operating voltages.
+Fabrication considerations. The X-cut TFLN rib
+waveguide (600 nm total thickness, 500 nm etch, w =
+1.4 µm) follows established fabrication processes for com-
+mercial TFLN wafers on SiO2 [37, 38]. The lateral signal–
+ground (SG) electrode configuration is fabricated in a
+single metal layer, which is standard in TFLN foundry
+processes. The primary fabrication challenge for the
+cascade architecture is maintaining uniform coupling
+gaps (g = 100 nm) across N rings to ensure identi-
+cal Lorentzian transfer functions. Post-fabrication trim-
+ming via UV exposure or localized thermal oxidation can
+compensate residual detuning variations [30], as quan-
+ 12
+
+
+
+
+ Softmax Full Chip Layout – N = 5 × d = 8 (TFLN)
+ d = 8 WDM channels
+
+
+ Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8
+
+ WDM
+ λ1−λ8 n=1
+ Pin
+
+
+ n=2
+ N = 5
+ cascade
+ n=3 stages
+
+
+
+
+ n=4
+
+
+ n=5
+
+
+
+
+ WDM Demux (AWG / ring filter)
+
+ Sref
+ PD1 PD2 PD3 PD4 PD5 PD6 PD7 PD8
+ Iλ
+ j S e
+ Σ − PI
+ p1 p2 p3 p4 p5 p6 p7 p8
+
+
+
+
+ Feedback: adjust Pin
+ Iλj
+ Output: pj = = softmax(V )j
+ Sref
+
+FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM
+ channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a
+ Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s
+input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output
+ passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel
+ photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller
+ adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields
+ pj = Iλj /Sref = softmax(V )j (Eq. 36).
+ 13
+
+ VII. CONCLUSION Dmax ≥ 0.95) are realized in the cascade geometry, deeper
+ cascades (N ≈ 20–30) would reach sub-percent approx-
+ We have presented a cascaded micro-ring resonator ar- imation error with an estimated per-operation energy
+chitecture that approximates the exponential function of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8
+exn −m on a finite interval [0, L] using multiplicative MAC at the 7 nm node. Monte Carlo analysis shows that
+Lorentzian transfer functions. Increasing the cascade the identical-detuning design tolerates realistic fabrica-
+depth N systematically reduces the worst-case relative tion variations (σa = 0.020, σb,rel = 0.020) with a single
+error, and an identical-detuning design initialized by flank per-chip calibration, keeping the 95th-percentile softmax
+and slope matching provides a practical two-parameter probability error below 0.32%.
+design.
+ Three-dimensional FDTD simulations of a single X-cut The formulation is not restricted to electro-optic tuning:
+TFLN add-drop ring (R = 20 µm, g = 100 nm) yield it requires only a controllable detuning coordinate with lo-
+QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the cal linearization, so both Pockels and optical (Kerr/XPM)
+cascade transfer model. A five-ring cascade 3D FDTD mechanisms are compatible [37, 38, 47, 48]. We demon-
+simulation directly validates the multi-ring framework: strate a photonic exponential block and present a WDM-
+all five rings exhibit resonant excitation, and mapping parallel chip architecture (Fig. 8) in which d wavelength
+the drop-port spectrum onto the dimensionless control channels share a single N -ring cascade, reducing the total
+variable reproduces the theoretical N = 5 curve with ring count by a factor of d and eliminating power-splitter
+∼11% integrated relative-area error over the upper op- loss. Combined with a single-loop PI feedback that adjusts
+erating range (I ≥ 5.8), providing the first multi-ring the shared WDM laser power, the architecture realizes the
+confirmation of the cascade exponential approximation. complete softmax function—exponentiation, summation,
+At the present FDTD-characterized quality factor, practi- and normalization—without per-channel normalization
+cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q circuitry. Max-finding and digital interfacing remain open
+TFLN resonators reported in the literature (Qi ≥ 106 , for future experimental validation.
+
+
+
+
+ [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A.
+ Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Franken, Keith Powell, Hana Warner, Daniel Assumpcao,
+ and Illia Polosukhin. Attention is all you need. In Dylan Renaud, Ying Wang, et al. Integrated lithium
+ Advances in Neural Information Processing Systems 30 niobate photonic computing circuit based on efficient and
+ (NeurIPS 2017), pages 5998–6008, 2017. high-speed electro-optic conversion. Nature Communica-
+ [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, tions, 16:8178, 2025.
+ and Christopher Ré. FlashAttention: Fast and memory- [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang.
+ efficient exact attention with IO-awareness. In Advances SOFTONIC: A photonic design approach to softmax
+ in Neural Information Processing Systems 35 (NeurIPS activation for high-speed fully analog AI acceleration.
+ 2022), pages 16344–16359, 2022. In Proceedings of the Great Lakes Symposium on VLSI
+ [3] Neil Savage. Light could lower AI’s appetite for power. (GLSVLSI ’25), pages 118–125, 2025.
+ Nature Nanotechnology, 21:6–8, 2026. [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto-
+ [4] Yichen Shen et al. Deep learning with coherent nanopho- electronic nonlinear softmax operator based on diffractive
+ tonic circuits. Nature Photonics, 11(7):441–446, 2017. neural networks. Optics Express, 32(15):26458–26469,
+ [5] Johannes Feldmann et al. Parallel convolutional process- 2024.
+ ing using an integrated photonic tensor core. Nature, [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang,
+ 589(7840):52–58, 2021. Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang,
+ [6] Nicholas C. Harris et al. Linear programmable nanopho- Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans-
+ tonic processors. Optica, 5(12):1623–1631, 2018. former chip: interference is all you need. PhotoniX, 6:45,
+ [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre 2025.
+ Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai,
+ Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice, Brucek Khailany, and Anand Raghunathan. Softermax:
+ and H. Bhaskaran. Higher-dimensional processing using Hardware/software co-design of an efficient softmax for
+ a photonic tensor core with continuous-time data. Nature transformers. In Proceedings of the 58th ACM/IEEE
+ Photonics, 17(12):1080–1088, 2023. Design Automation Conference (DAC), pages 469–474,
+ [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, 2021.
+ John E. Bowers, Michael Hochberg, Richard Soref, and [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong
+ Bhavin J. Shastri. Roadmapping the next generation of Chang. Hardware-efficient softmax approximation for
+ silicon photonics. Nature Communications, 15:751, 2024. self-attention networks. In Proceedings of the IEEE Inter-
+ [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor national Symposium on Circuits and Systems (ISCAS),
+ cores for machine learning. Applied Physics Reviews, pages 1–5, 2023.
+ 7(3):031404, 2020. [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun,
+[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo, and Yongpan Liu. SOLE: Hardware-software co-design
+ 14
+
+ of softmax and layernorm for efficient transformer infer- 2025. accessed 2026-02-21.
+ ence. In Proceedings of the IEEE/ACM International [35] Jane Austen. Pride and prejudice. Project Gutenberg
+ Conference on Computer-Aided Design (ICCAD), pages eBook No. 1342, 2025. accessed 2026-02-21.
+ 1–9, 2023. [36] Hyoseok Park. MRR-AEF: reproducible MRR depth-
+[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan, sweep fitting and supplementary validation scripts.
+ Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft- GitHub repository, 2025. commit 585e695, accessed 2026-
+ max function: Suitability for training and efficient hard- 02-21.
+ ware implementation. IEEE Transactions on Circuits and [37] Di Zhu et al. Integrated photonics on thin-film lithium
+ Systems I: Regular Papers, 69(9):3605–3618, 2022. niobate. Advances in Optics and Photonics, 13(2):242–352,
+[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing 2021.
+ Pan. TEA-S: A tiny and efficient architecture for PLAC- [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang
+ based softmax in transformers. IEEE Transactions on Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
+ Circuits and Systems II: Express Briefs, 70:3594–3598, CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo,
+ 2023. Amirhassan Shams-Ansari, David Barton, Neil Sinclair,
+[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and and Marko Loncar. Integrated electro-optics on thin-film
+ Fabrizio Lombardi. Approximate softmax functions for lithium niobate. Nature Reviews Physics, 2025.
+ energy-efficient deep neural networks. IEEE Transactions [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan
+ on Very Large Scale Integration (VLSI) Systems, 31:4–16, Shams-Ansari, and Marko Lončar. Monolithic ultra-high-
+ 2023. Q lithium niobate microring resonator. Optica, 4(12):1536–
+[20] Wim Bogaerts et al. Silicon microring resonators. Laser 1537, 2017.
+ & Photonics Reviews, 6(1):47–73, 2012. [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q
+[21] John E. Heebner, Robert W. Boyd, and Q.-Han thin-film lithium niobate microrings fabricated with wet
+ Park. Scissor solitons and other propagation effects in etching. Adv. Mater., 35(3):2208113, 2023.
+ microresonator-modified waveguides. Journal of the Opti- [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K.
+ cal Society of America B, 19(4):722–731, 2002. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag-
+[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and alhães, Amirhassan Shams-Ansari, Neil Sinclair, and
+ Shanhui Fan. Microring-based programmable coherent Marko Lončar. Twenty-nine million intrinsic Q-factor
+ optical neural networks. Optics Express, 31(12):18871, monolithic microresonators on thin-film lithium niobate.
+ 2023. Photon. Res., 12(8):A63–A68, 2024.
+[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo. [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian
+ StarLight: a photonic neural network accelerator featur- Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng.
+ ing a hybrid mode-wavelength division multiplexing and Lithium niobate microring with ultra-high Q factor above
+ photonic nonvolatile memory. Optics Express, 30:37051, 108 . Chin. Opt. Lett., 20(1):011902, 2022.
+ 2022. [43] Flexcompute Inc. Tidy3D: electromagnetic simula-
+[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang, tion software. https://www.flexcompute.com/tidy3d/,
+ and Weifeng Zhang. Reconfigurable low-threshold all- 2024. v2.10; cloud GPU FDTD. Accompany-
+ optical nonlinear activation functions based on an add- ing notebook: https://www.flexcompute.com/tidy3d/
+ drop silicon microring resonator. IEEE Photonics Journal, community/notebooks/CascadedMRRTFLN/.
+ 14(6):1–7, 2022. [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights.
+[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals Silicon photonic dynamic optical channel leveler with
+ of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007. external feedback loop. Optics Express, 18(13):13805–
+[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R. 13812, 2010.
+ Panepucci, and Michal Lipson. All-optical control of light [45] Karl J. Åström and Richard M. Murray. Feedback Systems:
+ on a silicon chip. Nature, 431(7012):1081–1084, 2004. An Introduction for Scientists and Engineers. Princeton
+[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and University Press, Princeton, NJ, 2008.
+ Michal Lipson. Micrometre-scale silicon electro-optic mod- [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel
+ ulator. Nature, 435(7040):325–327, 2005. Scherer. Coupled-resonator optical waveguide: a proposal
+[28] Kishore Padmaraju and Keren Bergman. Resolving the and analysis. Optics Letters, 24(11):711–713, 1999.
+ thermal challenges for silicon microring resonator devices. [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien,
+ Nanophotonics, 3:269–281, 2014. Lynford L. Goddard, and Songbin Gong. Ultra-efficient
+[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X. and fully isotropic monolithic microring modulators in
+ Wang. Transparent conductive oxide-gated silicon mi- a thin-film lithium niobate photonics platform. Optics
+ croring with extreme resonance wavelength tunability. Express, 28(20):29644–29661, 2020.
+ Photonics Research, 7(4):473, 2019. [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki,
+[30] Lahiru Jayatilleka et al. Post-fabrication trimming of Peng Yao, and Dennis W. Prather. Tunable hybrid sil-
+ silicon photonic ring resonators at wafer-scale. Journal icon nitride and thin-film lithium niobate electro-optic
+ of Lightwave Technology, 39:5083–5088, 2021. microresonator. Optics Letters, 44(3):618, 2019.
+[31] Elliott W. Cheney. Introduction to Approximation Theory.
+ McGraw–Hill, New York, 1966.
+[32] Alec Radford et al. Language models are unsupervised
+ multitask learners. Technical report, OpenAI, 2019.
+[33] Hugging Face. distilgpt2 model card, 2025. accessed
+ 2026-02-21.
+[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn),
+ 15
+
+ SUPPLEMENTARY INFORMATION
+
+Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators
+toward Softmax.”
+
+
+ S0. RIGOROUS DERIVATION AND VALIDITY SCOPE
+
+ This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions
+under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma,
+Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration.
+
+
+ S0.1 Assumptions
+
+Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form
+Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized
+sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude.
+Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of
+ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced
+ QN
+effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I).
+Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces
+the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N .
+Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is
+a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is
+negligible.
+Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval
+I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for
+fixed, finite L.
+Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1
+(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the
+steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear.
+
+
+ S0.2 Rigorous results
+
+ Throughout, define the log-domain residual
+
+ r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L),
+ 
+ (S0.1)
+
+and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) +
+ 
+minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout.
+Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0,
+
+ d
+ ln y(I) ≤ N |b|.
+ dI
+
+Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating:
+ 
+
+ d 2b(a + bI)
+ ln y(I) = −N .
+ dI 1 + (a + bI)2
+
+Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM).
+Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|.
+ 16
+
+Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a
+necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the
+main text and is a rigorous (not heuristic) necessary condition.
+Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and
+δ = I − I0 . Then
+ δ3
+ ln ỹ(I) = const + δ + + R4 (δ), (S0.2)
+ 6N 2
+where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic
+term vanishes identically at the flank point u0 = a + bI0 = −1.
+Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ))
+and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1:
+ 2u
+ ϕ′ (u) = − , ϕ′ (−1) = 1,
+ 1 + u2
+ 2(u2 − 1)
+ ϕ′′ (u) = , ϕ′′ (−1) = 0,
+ (1 + u2 )2
+ 4u(3 − u2 ) −4(−1)(3 − 1)
+ ϕ′′′ (u) = , ϕ′′′ (−1) = = 1.
+ (1 + u2 )3 (1 + 1)3
+By the chain rule, writing F (δ) = N ϕ(u(δ)):
+ F ′ (0) = N b ϕ′ (−1) = N b = 1,
+ F ′′ (0) = N b2 ϕ′′ (−1) = 0,
+ 1
+ F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 =
+ ,
+ N2
+where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C
+is
+ δ2 1 δ3
+ ln ỹ(I) = const + δ + 0 · + 2· + R4 (δ).
+ 2 N 6
+Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is
+bounded by the standard Taylor remainder estimate.
+Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the
+leading-order worst-case log-error on I ∈ [0, L] satisfies
+  3
+ (leading) 1 L L3
+ E∞ ∼ = . (S0.3)
+ 6N 2 2 48 N 2
+ (leading)
+Setting E∞ ≤ εlog = ln(1 + ε) and solving for N gives
+ L3/2
+ N ≥ p . (S0.4)
+ 48 εlog
+Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for
+|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving:
+ L3 L3/2
+ ≤ εlog =⇒ N≥p .
+ 48 N 2 48 εlog
+ √
+With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the
+one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the
+ √
+main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉.
+Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The
+derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07).
+For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and
+ √
+higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by
+numerical evidence (Table I) but should be treated as an engineering design rule.
+ 17
+
+ S0.3 Derivation of the conservative screening bound
+
+ We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main
+text.
+Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the
+flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write
+  
+ ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) .
+
+The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and
+r(0) = N [ϕ(a) − ϕ(a + bL)] + L.
+ For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e.,
+ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies
+
+ N ∥ϕ′′ ∥∞ b2 L2 N ϕ′ (a + bL) · b − 1
+ E∞ ≤ + · L, (S0.5)
+ 8 2
+
+where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|.
+Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to
+match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem
+on [0, L]:
+ Z L
+ 1 − h′ (t) dt.
+    
+ r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) =
+ I
+ RL
+Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound
+|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5).
+Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0),
+and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to
+
+ N b2 L 2
+ E∞ ≤ . (S0.6)
+ 4
+Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add
+the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is
+
+ L2 1
+ E∞ ≤ + 2 , (S0.7)
+ 4N 2b N
+which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the
+conservative screening depth:
+  2
+ L /4 + 1/(2b2 )
+ 
+ Nsafe ≥ . (S0.8)
+ ln(1 + ε)
+
+Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is
+conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit
+the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating
+range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N )
+arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper
+bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary
+design-space exploration but not a certified minimax guarantee.
+
+
+ S0.4 Validity scope and failure cases
+
+ The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may
+break down.
+ 18
+
+(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation
+ valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order
+ dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large
+ detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings.
+
+(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward-
+ propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring
+ count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade
+ transfer function.
+
+(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN .
+ In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8
+ quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning.
+ (0)
+(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic
+ error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide
+ ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L
+ requiring a control swing exceeding the linearity range of the tuning mechanism.
+
+(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound
+ and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime.
+ There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at
+ p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss
+ budgets.
+
+(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation
+ ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing
+ a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N
+ rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition.
+
+
+ S0.5 Mapping to main-text equations
+
+For reference, the results derived here correspond to the following main-text equations:
+
+ • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary
+ condition.
+
+ • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to
+ main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits.
+
+ • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived
+ as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition.
+
+ • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8)
+ in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design.
+
+Summary of guarantee status:
+Result Status Main-text Eq.
+Slope bound N |b| ≥ 1 Rigorous (proved) (23)
+ √
+Scaling N ∼ κL3/2 / εlog Heuristic (Taylor truncation + empirical κ) (28)
+Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic (S0.7)
+Nsafe screening depth Conservative backstop (S0.8)
+
+
+ S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND
+
+ This section provides the detailed derivations underlying the depth-scaling relations and conservative screening
+bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0.
+ 19
+
+ S1.1 Local expansion and exponential-like behavior
+
+ To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the
+flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a
+third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives
+
+ N 2 2 2 N (N 2 − 1) 3 3
+ ỹ(I) ≈ 1 + N b δ + b δ + b δ + O(δ 4 ), (S1.1)
+ 2 6
+so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining
+why the initialization is already close before refinement.
+
+
+ S1.2 Log-domain analysis and scaling derivation
+
+ For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1),
+expand around I0 = L/2 with δ = I − I0 to obtain
+
+ N b3 3
+ ln ỹ(I) = const + N b δ + δ + O(δ 4 ). (S1.2)
+ 6
+At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives
+
+ δ3
+ ln ỹ(I) = const + δ + + O(δ 4 ). (S1.3)
+ 6N 2
+Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 .
+Requiring E∞ ≤ εlog leads to
+
+ L3/2
+ N∝√ , (S1.4)
+ εlog
+
+which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a
+formal guarantee), and the prefactor remains platform- and fitting-criterion dependent.
+
+
+ S1.3 Conservative upper bound and screening depth
+
+ For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition
+for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1).
+For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the
+worst-case log-error as
+
+ L2 1
+ E∞ ≤ + 2 . (S1.5)
+ 4N 2b N
+(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening
+estimate (not a formal guarantee), one may use
+  2
+ L /4 + 1/(2b2 )
+ 
+ N ≥ . (S1.6)
+ ln(1 + ε)
+
+While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design
+screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions,
+is given in Sec. S0.3.
+ 20
+
+ S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION
+
+ This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range
+calibration tables referenced in the main text (Sec. III).
+
+
+ S2.1 Worked input-to-output mapping example
+
+ As a worked example, consider
+
+ x = [−3.2, 1.2, 4.8, −0.9]. (S2.1)
+
+Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped
+control-signal levels are
+
+ I = u + L = [0, 4.4, 8.0, 2.3], (S2.2)
+
+and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly,
+ N
+ 1 Y
+ Tk (In ) = , y(In ) = Tk (In ).
+ 1 + (ak + bIn )2
+ k=1
+
+Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes
+  N
+ 1
+ ỹ(In ) = C y(In ) = C .
+ 1 + (a + bIn )2
+For the re-fitted parameters used in this example,
+
+ a = −1.4588, b = 0.10202,
+ (S2.3)
+ N = 10, C = 3.0896 × 101 .
+
+which gives
+  N
+ 1
+ ỹ(In ) = C ,
+ 1 + (a + bIn )2
+ (S2.4)
+ ≈ [3.44 × 10−4 , 2.73 × 10−2 ,
+ 9.74 × 10−1 , 3.26 × 10−3 ].
+
+ For reference, the corresponding target terms are
+
+ In − L = [−8.0, −3.6, 0, −5.7], (S2.5)
+
+and
+  In −L  
+ e ≈ 3.35 × 10−4 , 2.73 × 10−2 ,
+ (S2.6)
+ 1.00, 3.35 × 10−3 .
+ 
+
+
+
+
+ S2.2 Effective-range percentiles and clipping calibration
+
+ We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive
+Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context
+length 128, the effective range
+
+ Leff,α = max(log pkept ) − min(log pkept ), α = 0.999, (S2.7)
+
+fell in a relatively narrow band, summarized in Table S2.
+ 21
+
+ TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters
+ re-fitted on I ∈ [0, 8.0] using the same minimax pipeline.
+
+ xn In target exn −m approx ỹ(In ) rel. err.
+ −4 −4
+−3.2 0.0 3.3546 × 10 3.4443 × 10 2.673%
+ 1.2 4.4 2.7324 × 10−2 2.7325 × 10−2 0.004%
+ 4.8 8.0 1.0000 0.9739 2.608%
+−0.9 2.3 3.3460 × 10−3 3.2585 × 10−3 2.614%
+
+
+ TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128.
+
+ Percentile All runs (4 runs) GPT-2
+ p50 6.92–7.23 7.09–7.23
+ p90 8.60–8.75 8.73–8.75
+ p95 8.97–9.12 9.06–9.12
+ p99 9.50–9.69 9.58–9.69
+
+
+ We then test clipping on the same rows with
+
+ Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 ,
+ (S2.8)
+ u(t) = max(u, t), u = s − max(s).
+
+and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated
+at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12).
+ In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii)
+choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol
+details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3.
+
+
+ S2.3 Illustrative synthetic range map
+ √
+ As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk
+scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3
+summarizes achievable approximation error versus depth.
+
+ TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error
+ (design-space illustration; not intended as corpus-level statistics).
+
+L regime N =5 N = 10 N = 20 N = 30
+ L=8 10.9% 2.68% 0.67% 0.30%
+ L = 12 40.0% 9.25% 2.27% 1.01%
+ L = 16 113% 23.0% 5.44% 2.41%
+
+
+ Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8,
+moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1%
+error.
+ We include Table S3 as a synthetic design map rather than an empirical benchmark.
+ 22
+
+ S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS
+
+ We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in
+the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora
+(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention
+row, if p = softmax(s) then the raw range is
+ Lraw = max(s) − min(s) = max(log p) − min(log p), (37)
+where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p),
+we additionally report an effective range:
+ Leff,α = max(log pkept ) − min(log pkept ), (38)
+where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999.
+ To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128),
+and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs.
+
+ TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs
+ effective (α = 0.999).
+
+ Model Dataset raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99
+ distilgpt2 tiny shakespeare 22.82 69.00 7.10 8.60 8.97 9.50
+ distilgpt2 pride prejudice 21.76 68.60 6.92 8.60 9.03 9.57
+ gpt2 tiny shakespeare 25.48 43.34 7.23 8.73 9.06 9.58
+ gpt2 pride prejudice 24.13 40.92 7.09 8.75 9.12 9.69
+
+ For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this
+table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50
+= 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69.
+Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used
+without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define
+u = s − max(s) and, for threshold t ≤ 0,
+ u(t) = max(u, t), p(t) = softmax(u(t) ). (39)
+We report the cumulative softmax error
+ 1 (t)
+ p −p ,
+ Ecum (t) = (40)
+ 2 1
+then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows.
+
+ TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t.
+
+ t p50 p90 p95 p99
+ −5 −5 −5
+ −14 2.53 × 10 4.55 × 10 4.80 × 10 5.18 × 10−5
+ −5 −5 −5
+ −13 2.69 × 10 4.85 × 10 7.38 × 10 1.48 × 10−4
+ −5 −4 −4
+ −12 2.99 × 10 1.21 × 10 2.13 × 10 4.27 × 10−4
+ −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3
+ −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3
+ −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2
+ −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2
+ −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2
+ −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1
+
+ Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep
+is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably,
+this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating
+that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range
+budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while
+keeping aggregate softmax distortion below 0.1%.
+ 23
+
+
+
+
+ FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs.
+
+
+
+
+FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the
+ 0.1% budget (10−3 ).
+ 24
+
+ S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION
+
+ This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage
+sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V).
+
+
+ S4.1 z-refined 3-fix simulation strategy
+
+ For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast
+between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix”
+strategy:
+ 1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate
+ for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation;
+ 2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient
+ substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts;
+ 3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied.
+ The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm
+ slab.
+This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral
+features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation
+parameters.
+
+ TABLE S6: 3D FDTD simulation parameters (Lumerical).
+
+Parameter Value
+Solver Lumerical 3D FDTD
+Mesh type Conformal variant 1
+Mesh accuracy 3 (auto-mesh)
+z-mesh override None (auto-mesh)
+Simulation time 50 ps
+Auto shutoff 1 × 10−6
+Wavelength range 1530 nm to 1570 nm
+Grid size 532 × 816 × 44
+Source Broadband mode source (TE0 )
+
+
+
+
+ S4.2 X-cut arc electrode bV step-by-step derivation
+
+ For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦
+from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral
+electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and
+θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full
+semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective
+EO fill factor is
+ Z Z +π/2
+ 1 1 1  +π/2 1
+ fEO = | cos(θ − 45◦ )| dθ = cos φ dφ = sin φ −π/2 = ≈ 0.318. (S4.1)
+ 2π semicircle 2π −π/2 2π π
+The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor
+integral is identical to the standard cos θ case by the change of variable.
+ The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm.
+The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration.
+ The refractive index change per volt in the electrode-covered section is
+ ∆neff 1 ΓEO 1 0.7
+ = − n3e r33 = − × 2.1383 × 30.9 × 10−12 × = −4.226 × 10−5 V−1 . (S4.2)
+ V 2 deff 2 2.5 × 10−6
+ 25
+
+The corresponding resonance wavelength shift is
+ dλ0 1550 × 4.226 × 10−5
+ = = 28.48 pm V−1 , (S4.3)
+ dV straight 2.30
+
+giving an intrinsic (straight-section) voltage sensitivity of
+ 2QL dλ0 2 × 15,500
+ bstraight
+ V = = × 0.02848 = 0.570 V−1 . (S4.4)
+ λ0 dV straight 1550
+However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The
+effective voltage sensitivity is therefore
+ 1
+ bV = bstraight
+ V × fEO = 0.570 × ≈ 0.182 V−1 . (S4.5)
+ π
+A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318),
+the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations)
+and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length.
+
+
+ S4.3 Full cascade optimization table
+
+ Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to
+e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30.
+
+ TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and
+Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated
+ best resonance QL = 15,500).
+
+N a b E∞ εmax (%) Vbias (V) Vctrl (V)
+ 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5
+ 8 −1.5959 0.12896 0.0412 4.20 8.8 5.7
+10 −1.4588 0.10202 0.0265 2.68 8.0 4.5
+12 −1.3731 0.08450 0.0184 1.86 7.5 3.7
+15 −1.2914 0.06726 0.0118 1.19 7.1 3.0
+17 −1.2543 0.05923 0.0092 0.92 6.9 2.6
+20 −1.2136 0.05025 0.0067 0.67 6.7 2.2
+25 −1.1685 0.04013 0.0043 0.43 6.4 1.8
+30 −1.1392 0.03341 0.0030 0.30 6.3 1.5
+
+
+ Key thresholds for the minimum number of rings at various error targets are:
+ • ε < 10%: N ≥ 6,
+ • ε < 5%: N ≥ 8,
+ • ε < 2%: N ≥ 12,
+ • ε < 1%: N ≥ 17,
+ • ε < 0.5%: N ≥ 24.
+These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in
+normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning
+range, through bV .
+
+
+ S4.4 Lorentzian fit validation
+
+ Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian
+Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer
+function model used in the cascade analysis.
+ 26
+
+
+
+
+ FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The
+ extracted quality factor is QL = 15,500 with FWHM = 101 pm.
+
+
+ S4.5 Eigenmode (FDE) analysis of theoretical Qi
+
+ To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional
+finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions.
+ a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm
+and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is
+identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch,
+waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300
+cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially
+finer than the 3D FDTD auto-mesh (55 nm vertical).
+ b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided
+mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm:
+ neff = 1.9653 + i (4.73 × 10−8 ), (41)
+ 4π ni
+ = 0.383 m−1 0.017 dB cm−1 .
+ 
+ αrad+leak = (42)
+ λ
+Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the
+100 nm slab. The corresponding quality factor is
+ 2π ng
+ Qrad+leak = = 2.43 × 107 , (43)
+ αrad+leak λ
+where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the
+small difference arises from the straight-section approximation inherent to 2D FDE).
+ c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields
+Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from
+ 1 1 1
+ = − , Qbend = 1.43 × 108 . (44)
+ Qbend Qrad+leak Qleak
+This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the
+dominant geometric loss channel.
+ d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical
+intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is
+ 2π ng
+ Qabs = , (45)
+ Γ αmat λ
+ 27
+
+where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for
+representative TFLN absorption values from the literature [6, 7].
+
+TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ),
+ substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded.
+ The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 .
+
+Material condition αmat (dB/cm) Qabs Qi (total)
+Bulk LiNbO3 (pristine) 0.002 2.3 × 108 2.2 × 107
+High-quality TFLN 0.01 4.7 × 107 1.6 × 107
+Good TFLN 0.03 1.6 × 107 9.5 × 106
+Typical TFLN 0.1 4.7 × 106 3.9 × 106
+
+
+ For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the
+FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts
+(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm
+is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage
+through the thin slab (Qleak = 2.93 × 107 ).
+ 28
+
+ S5. FABRICATED HIGH-Q DESIGN PROJECTIONS
+
+ Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm
+slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The
+numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional
+MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any
+realistic intrinsic loss.
+ Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been
+demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP-
+based processes.
+
+ TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates
+ whether the fabrication process preserves electrode patterning capability.
+
+Ref. Qi R (µm) w (µm) Etch
+Zhang [8] 107 80 ∼2 Ar+ mill
+Gao [9] 108 100 ∼3 CMP∗
+Zhuang [10] 9×106 100 ∼2 Wet etch
+Song [11] 2.9×107 200 4.5 ICP-RIE+CMP
+ All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi .
+
+ To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling
+quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality
+factors (Table S10).
+
+ N
+ TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal
+on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on
+ Qi ); at N = 20, L = 8: εmax = 0.67% (Table I).
+
+Projection Qi Dmax N =10 N =20 N =30
+FDTD baseline 3.88×104 0.36 −44.3 −88.5 −132.8
+Conservative 5×105 0.90 −4.4 −8.8 −13.2
+Moderate 106 0.95 −2.2 −4.5 −6.7
+Optimistic 5×106 0.99 −0.44 −0.88 −1.3
+
+
+ Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an
+order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated
+high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck;
+the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature
+values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable
+in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design.
+Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge;
+the projections in Table S10 should be read as design targets contingent on achieving it.
+ 29
+
+ S6. INSERTION LOSS BUDGET DETAILS
+
+ For a cascade of N rings, the total insertion loss is modeled as
+
+ ILtot ≈ N · ILstage + ILcoupling , (S6.1)
+
+where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and
+chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios:
+
+ • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB
+ (N = 20), and 3.80 dB (N = 30).
+ • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10),
+ 8.00 dB (N = 20), and 10.5 dB (N = 30).
+
+ In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter
+power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation
+closer to the detector noise floor, reducing usable dynamic range.
+ e. Four-component loss breakdown. The total insertion loss of the cascade has four components:
+ N
+ 1. On-resonance cascade transmission Dmax (dominant; see Table V);
+ 2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each
+ inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding
+ to ∼0.46 dB per inter-ring stage;
+ 3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16];
+ 4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15].
+ N
+Table V presents the ideal on-resonance budget (Dmax only). Including all four components for the present diagonal-bus
+layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈
+26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The
+inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or
+straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized
+layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB.
+ 30
+
+ S7. ENERGY EFFICIENCY DETAILED DERIVATION
+
+ This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits
+and the photonic MRR cascade, as summarized in the main text (Sec. V).
+
+
+ S7.1 Electrical analog exponential circuits
+
+ Three main families of electrical circuits realize the exponential function in the analog domain:
+ f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC =
+IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building
+block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At
+a signal bandwidth of B ≈ 100 MHz, the energy per operation is
+ P 300 µW
+ EGilbert = = = 3 pJ. (S7.1)
+ B 100 MHz
+ g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling
+direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements
+a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by
+subthreshold fT ):
+ 0.43 µW
+ Esub-VT = = 0.43 pJ. (S7.2)
+ 1 MHz
+This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz).
+ h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations.
+Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving
+ Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ. (S7.3)
+At 8-bit precision (sufficient for inference): ∼2.3 pJ.
+
+
+ S7.2 Photonic MRR cascade: single-channel energy derivation
+
+ We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated
+high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V
+(fully CMOS-compatible). The energy per exponential operation has three components:
+ (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral
+S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as
+ Cel ≈ 18 fF, (S7.4)
+based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported
+by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected
+QL = 25,200, which gives bV = 0.295 V−1 ):
+ 2
+ Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ. (S7.5)
+For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ.
+ Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because
+ 2
+ EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N. (S7.6)
+The bias voltage (3.9 V) is static and does not contribute per-operation energy.
+ (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single
+CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug
+efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10),
+requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ.
+ (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ.
+ The total single-channel energy is
+ (1ch)
+ Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ. (S7.7)
+ 31
+
+ S7.3 Q-factor scaling of energy efficiency
+
+ 2
+ Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at
+various quality factors.
+
+TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV
+ scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the
+FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal
+ stabilization (0.15–0.60 pJ for N = 30).
+
+ Q Vctrl (V) Vbias (V) EEO (pJ) Etotal (pJ)
+ 5,000 4.57 19.5 5.64 6.21
+ 10,000 2.28 9.7 1.40 1.97
+ 12,500 1.83 7.8 0.90 1.47
+15,500† 1.47 6.3 0.58 1.15
+ 20,000 1.14 4.9 0.35 0.92
+25,200‡ 0.91 3.9 0.22 0.79
+ 30,000 0.76 3.2 0.16 0.73
+ 50,000 0.46 1.9 0.06 0.63
+
+
+ At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the
+design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total
+energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather
+than energy.
+ i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy
+for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal
+stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than
+silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate,
+the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to
+0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy
+comparison should therefore be treated as an order-of-magnitude estimate.
+
+
+ S7.4 Comparison with electronic implementations
+
+ Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our
+photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume
+a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a
+1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is
+ N
+Pout = Pin,ch × Dmax , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30)
+(Table V).
+ j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared:
+BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and
+digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]).
+ k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self-
+consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized
+laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding
+Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ.
+Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization.
+ l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves
+0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including
+thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz
+bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11),
+recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000×
+lower bandwidth.
+ m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic
+estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared
+with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for
+ 32
+
+ TABLE S12: Energy per exponential operation: single-channel comparison.
+
+Implementation E/op (pJ) Bandwidth Notes
+Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACs
+BJT Gilbert cell ∼3 100 MHz Analog
+Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACs
+Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog†
+Subthreshold CMOS ∼0.43 1 MHz Analog
+ † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see
+ L
+ Supplementary Sec. S7.
+
+
+TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective
+precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage
+over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical
+and Vctrl remains CMOS-compatible.
+ 33
+
+ S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES
+
+ This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design
+(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per-
+ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v)
+control-channel noise, and (vi) detector noise with one-point calibration at I = L.
+ For ring k, we use
+ 1
+ Tk (I) = 2, (46)
+ 1 + (ak + bk I + dth + dxt I/L)
+
+with
+ N
+ Y
+ y(I) = Tk (I) × 10−ILtot /10 , (47)
+ k=1
+
+and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance.
+
+ TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps.
+
+ Parameter Nominal Stress
+ σa 0.020 0.032
+ σb,rel 0.020 0.032
+ σth 0.015 0.025
+ σxt 0.012 0.020
+ σI 0.004 0.007
+ ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05
+ σdet 3.0 × 10−6 6.0 × 10−6
+
+
+
+ TABLE S14: Monte Carlo summary (same run reported in main text).
+
+ Metric Nominal Stress
+ Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4
+ p95 KL(pref ∥papprox ) 5.92 × 10−4 2.21 × 10−3
+ Median max |∆p| 0.170% 0.193%
+ p95 max |∆p| 0.319% 0.419%
+
+Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family
+with fixed b, define
+
+ ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL), ϕ(u) = − ln(1 + u2 ), (48)
+
+so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder
+argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the
+flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening
+inequality
+
+ L2 1
+ E∞ ≲ + 2 , (49)
+ 4N 2b N
+which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a
+conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening.
+ 34
+
+
+
+
+FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples.
+ 35
+
+ S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION
+
+ We model global normalization as a delayed PI-controlled loop:
+
+ S(t) = G(t)P (t) + n(t), (50)
+ dP
+ τ = −P (t) + u(t − Td ), (51)
+ dt Z
+ u(t) = Kp e(t) + Ki e(t) dt, e(t) = Sref − S(t), (52)
+
+with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical
+intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples.
+
+TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case
+ ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ).
+
+ Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation
+ 100 ns 20 ns 1.24 µs fast loop
+ 1 µs 200 ns 12.4 µs moderate loop
+ 5 µs 1 µs 62 µs slower loop
+
+Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed-
+signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip)
+for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude
+ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices.
+
+ TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9.
+
+ Subsystem path Tsys Sources
+ On-chip memory (L1/L2) 20–200 ns [25]
+ Off-chip memory (DRAM) 200–700 ns [25, 26]
+ ADC conversion 10–710 ns [27, 28]
+ DAC + driver/settling 1–200 ns [29]
+ On-chip interconnect (NoC) 5–100 ns [30]
+ Off-chip I/O (PCIe/CXL) 1–10 µs [25, 31]
+
+Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from
+one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time
+entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy:
+
+ Tnorm ≈ 12.4 τ. (53)
+
+This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling
+times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening
+metric
+
+ Tnorm ≥ β Tsys , (54)
+
+with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal
+dominance proof. The corresponding threshold is
+ β Tsys
+ τcrit (β) = . (55)
+ 12.4
+Table S17 gives the resulting numeric ranges.
+For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs
+gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the
+normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant
+ 36
+
+ TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16.
+
+ Subsystem Tsys range τcrit (β = 0.5) τcrit (β = 1)
+ On-chip memory path 20–200 ns 0.81–8.06 ns 1.61–16.13 ns
+ Off-chip memory path 200–700 ns 8.06–28.23 ns 16.13–56.45 ns
+ ADC conversion 10–710 ns 0.40–28.63 ns 0.81–57.26 ns
+ DAC+driver/settling 1–200 ns 0.04–8.06 ns 0.08–16.13 ns
+ On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns
+ Off-chip I/O fabric 1–10 µs 0.04–0.40 µs 0.08–0.81 µs
+
+
+as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture-
+specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16).
+Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations
+require full timing validation.
+
+TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ).
+
+ Case (Kp , Ki , Td /τ ) Overshoot Settling Stable
+ Stable (0.55, 0.8, 0.2) 25.6% ∼ 12.4τ Yes
+ Marginal (0.95, 1.6, 0.45) 25.6% ∼ 12.8τ Yes
+ Unstable (1.2, 2.2, 0.75) 45.1% not settled No
+
+
+
+ TABLE S19: Stable-region fraction from gain-map scans at each delay ratio.
+
+ Td /τ Stable fraction
+ 0.0 88.1%
+ 0.2 88.0%
+ 0.5 72.4%
+ 0.8 47.5%
+ 37
+
+
+
+
+FIG. S5: Step-response examples of the delayed PI normalization loop.
+ 38
+
+
+
+
+FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges.
+ 39
+
+ S10. REPRODUCIBILITY
+
+ Scripts used for this Supplementary validation:
+ • scripts/nonideality montecarlo.py
+
+ • scripts/feedback loop validation.py
+
+ • scripts/extract logit range effective.py
+
+ • scripts/analyze softmax clipping validity.py
+Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs
+are stored under:
+ • paper/empirical L v3/
+
+
+
+
+ [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
+ Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages
+ 5998–6008, 2017.
+ [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019.
+ [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21.
+ [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21.
+ [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21.
+ [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021.
+ [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng,
+ CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko
+ Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025.
+ [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium
+ niobate microring resonator. Optica, 4(12):1536–1537, 2017.
+ [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium
+ niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022.
+[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching.
+ Adv. Mater., 35(3):2208113, 2023.
+[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan
+ Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on
+ thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024.
+[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J.
+ Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024.
+[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018.
+[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi-
+ cations, 14:2389, 2023.
+[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on-
+ insulator platform. Optics Express, 18(17):18278–18283, 2010.
+[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of
+ the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS
+ Benelux Chapter, 2001.
+[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975.
+[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989.
+[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017.
+[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and
+ Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics,
+ 10(9):1004, 2021.
+[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State
+ Circuits Conference (ISSCC), pages 10–14, 2014.
+[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully
+ isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644–
+ 29661, 2020.
+[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog.
+ Phys., 75(4):046402, 2012.
+ 40
+
+[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol.,
+ 35(3):346–396, 2017.
+[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via
+ microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
+[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism
+ (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages
+ 368–379, 2012.
+[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital
+ converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22.
+[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised
+ August 2015), 2015. Accessed 2026-02-22.
+[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet
+ (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22.
+[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design
+ Automation Conference (DAC), pages 684–689, 2001.
+[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and
+ Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23
+ Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023.
+ \ No newline at end of file