diff options
| author | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
|---|---|---|
| committer | Yuren Hao <yurenh2@illinois.edu> | 2026-07-03 05:56:50 -0500 |
| commit | b83947778e2c776f757a07d4719b7ce961d7ed55 (patch) | |
| tree | b9cc01d7adda691d9156d9d04f4fb2f644674e96 /ep_run/extracted_paper.txt | |
Initial commit: ept — backprop-free equilibrium transformer (EP)
Code (ep_run/), organized docs (docs/{method,campaign,hardware,outreach,paper}),
analysis scripts (scripts/), ONBOARDING.md entry point. Large data/checkpoints
git-ignored (share separately).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_014FAPDWQ49M5Ye3NpTndTpn
Diffstat (limited to 'ep_run/extracted_paper.txt')
| -rw-r--r-- | ep_run/extracted_paper.txt | 2039 |
1 files changed, 2039 insertions, 0 deletions
diff --git a/ep_run/extracted_paper.txt b/ep_run/extracted_paper.txt new file mode 100644 index 0000000..4f521d8 --- /dev/null +++ b/ep_run/extracted_paper.txt @@ -0,0 +1,2039 @@ + Photonic Exponential Approximation via Cascaded TFLN Microring Resonators + toward Softmax + Hyoseok Park1 and Yeonsang Park1, ∗ + 1 + Department of Physics, Chungnam National University, Daejeon 34134, Republic of Korea + (Dated: March 26, 2026) + The rapid growth of large-scale AI models has intensified energy consumption and data-movement + challenges in modern datacenters. Photonic accelerators offer a promising path by executing the linear + matrix multiplications of transformer inference at high throughput and low energy. However, the + softmax attention layer—which requires element-wise exponentiation followed by normalization—still + relies on electronic post-processing, creating an electro-optic conversion bottleneck that negates much + of the potential photonic advantage. +arXiv:2603.12934v3 [physics.optics] 25 Mar 2026 + + + + + We present a cascaded micro-ring resonator (MRR) architecture that synthesizes the per-channel + exponential function required by softmax, exn −max(x) , over a finite interval with tunable worst-case + relative error. A control signal detunes each ring via an electro-optic mechanism; a weak probe + at fixed frequency experiences Lorentzian transmission, and cascading N identical stages yields a + multiplicative transfer function whose logarithm is approximately linear. + We derive mapping rules, depth-scaling estimates, and a minimax fitting formulation, and validate + the framework with three-dimensional FDTD simulations of X-cut thin-film lithium niobate (TFLN) + add-drop micro-ring resonators. Direct multi-ring FDTD validation extends to a five-ring cascade + and confirms agreement with theory primarily over the upper operating range; deeper cascades and + higher quality factors are assessed analytically. The cascade implements the per-channel exponential + block—the key missing nonlinearity for photonic softmax. We further present a WDM-parallel + chip architecture with closed-loop PI feedback that completes the full softmax—exponentiation, + summation, and normalization—on a single photonic chip without per-channel normalization circuitry. + + + I. INTRODUCTION is approximately linear over a finite interval, enabling + exponential-function synthesis with sub-2% worst-case + Transformer inference is often limited by power and error—an order of magnitude more accurate than SOFT- + memory traffic, motivating optical accelerators that ex- ONIC’s polynomial approach—while remaining compati- + ploit parallel propagation and multiplexing [1, 2, 4, 5, 7, 9]. ble with integrated microring platforms [20–24]. We term + Recent perspective articles also discuss data-center power this cascade block an approximate exponential function + consumption as one motivation for optical comput- (AEF) unit. We further propose a WDM-parallel archi- + ing [3, 8]. While linear operators are comparatively tecture with a single PI feedback loop that realizes the + amenable to photonic implementation [4–6], the softmax complete softmax function—including summation and + function used in attention layers requires an exponen- normalization—without per-channel electronic process- + tial mapping together with global normalization—both ing. + difficult to realize in passive photonic circuits, where We extend the theoretical framework with three- + transmission is fundamentally bounded by unity. Parallel dimensional FDTD simulations of a single X-cut TFLN + digital-hardware studies treat the exponential/softmax add-drop micro-ring resonator. The simulated device + stage as a bottleneck and propose dedicated approxima- parameters—quality factor, free spectral range, and + tions [11–19]. Many integrated-photonic classifier demon- electro-optic sensitivity—calibrate the cascade design pa- + strations still rely on electronic post-processing for the rameters, bridging analytical fitting and physically realiz- + final nonlinear readout [10]; the resulting electro-optic able hardware. Two operating regimes emerge from this + conversion overhead can negate the throughput and en- calibration: an FDTD-characterized regime with moder- + ergy benefits of the photonic front-end. Notably, the ate drop-port depth (Dmax ≈ 0.36), where the analytic + SOFTONIC architecture [11] explicitly argues that “the error stays below ∼5% for N ≤ 7 but the power bud- + inability of MRRs and MZMs to handle SMA’s expo- get limits practical cascades to N ≤ 5; and a projected + nential and division functions” necessitates alternative high-Q regime (Dmax ≥ 0.95), enabling deeper cascades + approaches based on microdisk modulators and polyno- (N ≤ 30) with sub-percent error. Cascade performance is + mial approximation, achieving 89.7% accuracy with a predicted analytically and validated by a five-ring cascade + third-degree Chebyshev polynomial. Here we challenge 3D FDTD simulation (Sec. IV). + this premise: we show that a passive Lorentzian cascade The paper is organized as follows: Section II presents + of microring resonators can be tuned so that its logarithm the mapping, transfer model, and depth-design rules; Sec- + tion III provides numerical fits and validation; Section IV + describes the single-ring TFLN device design and FDTD + validation; Section V assesses physical feasibility including + ∗ yeonsang.park@cnu.ac.kr; Corresponding author + voltage requirements, insertion loss, and energy efficiency; + 2 + +Section VI discusses implementation scope, platform com- +parisons, and limits; and Section VII concludes. 1 + Tk (∆ωk ) = . (9) + ∆ωk 2 + 1+ Γ + II. MODEL AND DESIGN FRAMEWORK + In a control–probe architecture, a nonnegative control- + signal amplitude I ≥ 0 shifts the ring resonance. Here I +Target mapping. Let x = (x1 , . . . , xK ) ∈ RK be an denotes a generic control amplitude: for optical-pump op- +arbitrary real-valued sequence (or vector). Directly gener- eration it maps to optical intensity, while for EO operation +ating exp(xn ) as a passive optical transmission is impos- it maps to electrical control level (e.g., voltage). Across +sible in general because exp(x) grows beyond unity while many physical mechanisms (optical pump via Kerr/XPM, +a passive transmission satisfies 0 < T ≤ 1 [25]. However, EO drive via Pockels effect, thermal, carrier tuning), the +for softmax, shift can be linearized on a working range [20, 26–30]: + + exn (0) + softmax(x)n = P xj , (1) ω0,k (I) = ω0,k + ηI, (10) + je + (0) + where ω0,k is the cold-cavity resonance and η is the control- +a common shift cancels: to-resonance sensitivity. In practice, the control channel + can be optical or electrical (optical pump, EO/Pockels + exn +c exn drive, thermal, or carrier tuning); a quantitative EO + P x +c = P x (∀c ∈ R). (2) feasibility example is given in the Discussion. With + je je + j j + (0) + ∆ω0,k ≡ ωL − ω0,k , the control-dependent detuning be- +Thus it suffices to generate comes + + + exn −m , m ≡ max xj , (3) ∆ωk (I) = ∆ω0,k − ηI. (11) + j + Define dimensionless parameters +since the global factor em cancels. + To ensure a nonnegative control-signal amplitude, de- +fine ∆ω0,k η + ak ≡ , b≡− . (12) + Γ Γ + Then Eq. (9) yields the control-to-probe transfer of a +un ≡ xn − m ≤ 0, L ≡ − min un = m − min xn ≥ 0, single ring, + n n + (4) +and map each scalar to a nonnegative control-signal am- 1 +plitude Tk (I) = . (13) + 1 + (ak + bI)2 + Physical meaning: ak is a static detuning in linewidth + In ≡ un + L ∈ [0, L]. (5) units (set by heater/carrier tuning/fabrication), and |b| + is the normalized sensitivity magnitude (linewidths of +Then + resonance shift per unit control-signal amplitude); the sign + convention is absorbed into the detuning expression. For + exn −m = eun = eIn −L . (6) “same-material/same-geometry” rings, b is often common, + while ak can be tuned per ring. +Hence the optical design task is to realize, for I ∈ [0, L], Sign convention. Simultaneously flipping (ak , b) 7→ + (−ak , −b) leaves Tk (I) unchanged, so we may take b > 0 + without loss of generality. + f (I) = eI−L ∈ [e−L , 1]. (7) Let N rings be cascaded in a serial add-drop topology: + Tk (I) denotes the add-to-drop transmission of ring k, and +Control–probe transfer. Consider a weak probe at the drop output of ring k feeds the add (input bus) port +fixed angular frequency ωL . For the kth ring, let ω0,k of ring k+1. Assuming the probe is sufficiently weak so +denote its resonance frequency and Γ > 0 its loaded half- the control channel dominates the resonance shift, the +width at half maximum (HWHM). Define the detuning normalized probe output is the product + + ∆ωk ≡ ωL − ω0,k . (8) (probe) + Pout (I) + N + Y N + Y 1 + y(I) ≡ = Tk (I) = . +Near resonance, the normalized Lorentzian transmission + (probe) + Pin 1 + (ak + bI)2 + k=1 k=1 +is modeled as [20, 21] (14) + 3 + + + (a) Electronic Preprocessing + Control In + Find max: Shift: Bias: + {xn } m = max(xn ) un = xn −m In = un +L + + + EO tuning + (b) N -MRR Cascade + + N stages + Probe + (fixed ωL ) + + + MRR MRR MRR MRR MRR + #1 #2 #3 #4 #5 + + + + + (c) Output + + ỹ(In ) ≈ exp(In − L) → exp(xn − m) PD + + + FIG. 1: Overview of the control–probe add-drop cascade N -MRR exponential block. (a) Electronic preprocessing + maps an arbitrary input sequence {xn } to a nonnegative control signal via m = maxn xn , un = xn − m, and +In = un + L with L = m − minn xn . (b) The control signal In induces resonance shifts in a cascade of N rings, while a + weak fixed-frequency probe propagates through the serial add-drop cascade (the drop output of each ring feeds the + next stage), experiencing multiplicative transmission. (c) After photodetection, the block implements + y(In ) ≈ exp(In − L) ≈ exp(xn − m), i.e., the normalized exponential used in softmax. + + +To focus on the shape of the approximation, we allow a +global scale factor C > 0: + E∞ ≡ sup ln ỹ(I) − (I − L) . (18) + I∈[0,L] + + ỹ(I) ≡ C y(I). (15) If E∞ ≤ εlog , then for all I ∈ [0, L], +In softmax, pn = CeIn −L / j CeIj −L , so C cancels + P +between numerator and denominator and is physically ỹ(I) ỹ(I) + e−εlog ≤ ≤ eεlog ⇒ − 1 ≤ eεlog − 1. (19) +inessential; nevertheless it is convenient for error analysis. f (I) f (I) +For a fixed (N, b, {ak }), the optimal C for the minimax + Thus achieving a prescribed worst-case relative error ε is +log-error in Eq. (18) can be written in closed form. Let + guaranteed by +g(I) ≡ ln y(I) − (I − L) on [0, L]. Then the minimax- +optimal shift is ln C ⋆ = −(maxI g(I)+minI g(I))/2, yield- +ing E∞ = (maxI g(I) − minI g(I))/2. E∞ ≤ εlog ≡ ln(1 + ε) ≈ ε. (20) + Taking logarithms, + Depth scaling. We derive depth-related constraints and + design rules for a prescribed approximation tolerance. + N + X Necessary slope condition. Differentiate Eq. (16): + ln 1 + (ak + bI)2 . + + ln ỹ(I) = ln C − (16) + k=1 + N + d X 2b(ak + bI) +The target ln f (I) = I − L is linear; hence exponential ln y(I) = − . (21) + dI 1 + (ak + bI)2 +approximation is equivalent to the log-linearization goal k=1 + + Since |2u/(1 + u2 )| ≤ 1 for all real u, + ln ỹ(I) ≈ I − L uniformly on I ∈ [0, L]. (17) + d + ln y(I) ≤ N |b|. (22) +Error metric. Define the worst-case log-error on [0, L]: dI + 4 + +The target ln f (I) = I − L has constant slope +1, so a with a minimax refinement. After choosing N , set +necessary condition to track it is b = min(bmax , 1/N ) and a = −1 − bL/2 as initializa- + tion, then refine (a, b) by a two-parameter minimax fit on + [0, L]. + N |b| ≳ 1. (23) A heuristic conservative screening bound N ≥ ⌈(L2 /4 + +Near-optimal parameterization. The full design prob- 1/(2b2 ))/ ln(1 + ε)⌉ (derived via the same local-expansion +lem can be written as a minimax fit in the log domain [31]: argument; see Supplementary Sec. S1) provides a quick + upper estimate but is not a rigorous guarantee. + + min sup |r(I)|, + a1 ,...,aN , ln C I∈[0,L] + III. NUMERICAL FITS AND VALIDATION + N + X (24) + ln 1 + (ak + bI)2 − (I − L). + + r(I) ≡ ln C − We validate the analytical framework with minimax + k=1 numerical fits and sampled robustness checks. Figure 2 +This objective is permutation-invariant in the ak ’s (ring shows the fitted approximation quality at L = 8: the +index k). In practice (and in numerical experiments top (linear) panel plots N = 1, 3, 5, 7 over I ∈ [0, 20], the +reported below), the optimizer frequently collapses to a middle (log) panel compares N = 5, 10, 20, 30 on I ∈ [0, 8], +permutation-symmetric solution and the bottom panel shows the pointwise relative error + with the characteristic Chebyshev equioscillation pattern. + We fit identical-detuning cascades (Eq. 25) on I ∈ [0, L] + a1 = · · · = aN ≡ a, (25) and compare several depths using a minimax criterion. + Table I makes the accuracy–depth trade-off explicit +reducing the design to two parameters (a, b) (plus C). at L = 8. A worked input-to-output example demon- +With Eq. (25), strating the mapping from an arbitrary input sequence + x = [−3.2, 1.2, 4.8, −0.9] through the cascade is provided + + 1 + N in Supplementary Sec. S2. The example shows that the + ỹ(I) = C y(I) = C . (26) N = 10 cascade keeps the worst-case relative error below + 1 + (a + bI)2 2.7% across all channels. +A robust initialization is obtained by placing the midpoint Empirical calibration. We calibrate the effective +of the interval on the Lorentzian half-maximum flank and logit range Leff from autoregressive Transformers (dis- +matching the slope: tilgpt2/gpt2) [1, 32–35] at context length 128, finding + Leff,0.999 ≈ 7–9 at the 50th–90th percentiles (Supplemen- + tary Sec. S2). A clipping threshold t∗ = −12 preserves + L p99 softmax accuracy below 0.1%. Full protocol details, + a+b ≈ −1, N b ≈ 1. (27) + 2 clipping-sweep tables/plots, and per-run statistics are +These two equations already yield a good design; a small provided in Supplementary Sec. S3. +(two-parameter) refinement can then enforce the desired A synthetic design-space map (Supplementary Table S3) +worst-case tolerance. shows that near L ≈ 8, moderate depth (N ≥ 10) reaches + Local expansion and depth scaling. A Taylor few-percent error, whereas L ≳ 12 requires deeper cas- +expansion of the log-domain residual around the flank- cades. All fits follow the same pipeline: minimize the +centered point I0 = L/2 (with a + bI0 = −1 and N b = 1) worst-case log-error on a uniform grid, initialize from the +shows that the quadratic term vanishes identically, leaving flank rules in Eq. (27), perform multi-start global search, +a leading cubic residual r(δ) ∼ δ 3 /(6N 2 ). Over I ∈ [0, L], and apply bounded local refinement; implementation de- +this implies E∞ ∼ L3 /N 2 , so that achieving a prescribed tails and scripts are provided in a public repository [36] + √ (commit: 585e695). +tolerance εlog requires N ∝ L3/2 / εlog , which explains +the scaling used in Eq. (28). The full derivation is provided +in Supplementary Sec. S0; an intuitive local-expansion +summary appears in Sec. S1. + Practical engineering estimate. Given L and a TABLE I: Depth comparison for L = 8 using fitted +target worst-case relative error ε, define εlog = ln(1 + ε). ỹ(I) = C[1 + (a + bI)2 ]−N (same fitting pipeline for all +A heuristic engineering estimate (not a rigorous bound) N ). +that matched our percent-level numerical designs is + N a b max rel. err. mean rel. err. + L3/2 + + 1 + N ≈ max , κ√ , (28) 5 −2.0789 0.21658 10.9% 6.43% + bmax εlog 10 −1.4588 0.10202 2.68% 1.65% + 20 −1.2135 0.05025 0.67% 0.42% +where bmax is the physically achievable sensitivity bound 30 −1.1392 0.03341 0.30% 0.19% +and κ ≃ 0.07 for the identical-detuning flank design + 5 + + TABLE II: Waveguide and ring parameters of the X-cut + TFLN micro-ring resonator. Electro-optic electrode + parameters are listed separately in Table III. + + Parameter Symbol Value Unit + Total TFLN thickness tTFLN 600 nm + Etch depth tetch 500 nm + Slab thickness tslab 100 nm + Waveguide width w 1.4 µm + Bend radius R 20 µm + Coupling gap g 100 nm + Circumference Lring 125.7 µm + Free spectral range FSR 8.29 nm + Effective index (TE0 ) neff 1.903 — + Group index (TE0 ) ng 2.24 — + Extraordinary index ne 2.138 — + + + + IV. TFLN SINGLE-RING DEVICE DESIGN AND + FDTD VALIDATION + + A. Waveguide and ring geometry + + + The device is based on an X-cut thin-film lithium nio- + bate (LiNbO3 ) on insulator wafer with a 600 nm-thick + LiNbO3 film on SiO2 . A 500 nm-deep rib etch defines + a 1.4 µm-wide single-mode waveguide with a 100 nm un- + etched slab (Fig. 3). Lumerical MODE simulations yield + neff = 1.903 and ng = 2.24 at λ = 1550 nm for the funda- + mental TE0 mode. + The ring resonator (R = 20 µm, Lring = 125.7 µm) is + configured as an add-drop resonator with 100 nm coupling + gaps (Fig. 4). The FDTD-measured free spectral range + is FSR = 8.29 nm (ng ≈ 2.30), slightly above the MODE + value due to bend-induced dispersion. + + + + +FIG. 2: Minimax cascade fits at L = 8. (a) Linear scale: + shallow cascades (N = 1, 3, 5, 7) over I ∈ [0, 20]. The +target eI−L (black) is progressively better matched as N + increases. (b) Log scale: depth comparison + (N = 5, 10, 20, 30) on I ∈ [0, 8]. Inset zooms into + I ∈ [6, 8] showing convergence. (c) Pointwise relative + error showing the Chebyshev equioscillation pattern + characteristic of minimax optimality. + FIG. 3: Cross-section of the X-cut TFLN rib waveguide + on a SiO2 substrate. The 600 nm LiNbO3 film is etched + 500 nm to form a 1.4 µm-wide single-mode rib waveguide. + Lateral signal (S) and ground (G) electrode positions are + indicated; electrode design details are discussed in + Sec. IV D. + 6 + + Table II summarizes the waveguide and ring parame- +ters. + + + B. 3D FDTD Methodology + + The ring resonator response is simulated using Lumeri- +cal 3D FDTD with conformal variant 1 meshing. A broad- +band TE0 mode source (1530 nm to 1570 nm) is injected +into the input bus waveguide, and through- and drop-port +spectra are recorded. A “z-refined 3-fix” meshing strat- +egy ensures convergence in the thin-film geometry [37]; +detailed simulation setup is provided in Supplementary +Sec. S4 (Table S6). + + + FIG. 5: Simulated through-port (blue) and drop-port + (red) transmission spectra of the single add-drop + micro-ring resonator from 3D FDTD. Top: logarithmic + scale; bottom: linear scale. Five resonances are visible + with FSR ≈ 8.29 nm. + + + + 15,500, Dmax = 0.360); using the five-resonance mean + would increase required voltages by ∼24% (see Table IV + caption). + The simulation time of 50 ps exceeds the loaded pho- + ton lifetime τL = QL λ0 /(2πc) ≈ 12.7 ps by ∼4×, but + the intrinsic lifetime τi ≈ 32 ps is comparable, so the ex- + tracted Qi may be slightly conservative. An independent + eigenmode (FDE) analysis of the same cross-section at + R = 20 µm—using a 300 × 300 mesh (∆y ≈ 10 nm, 5× + FIG. 4: Top view of the single add-drop micro-ring finer than the FDTD vertical grid)—yields Qrad+leak = + resonator used in the 3D FDTD simulation. The ring 2.4 × 107 ; including bulk LiNbO3 absorption (Γ = 0.89) + waveguide (R = 20 µm, w = 1.4 µm) is evanescently gives a theoretical Qi > 107 [37–42], confirming that + coupled to input and drop bus waveguides through the gap between the numerical Qi and published val- + 100 nm gaps at coupling points CP1 and CP2. ues (> 106 ) originates from mesh discretization (Sup- + plementary S4.5, Table S8). In the CMT framework, + Dmax = [2κ/(2κ+γ)]2 increases as Qi rises; at the present + coupling gap, increasing Qi to 106 would raise Dmax from + 0.36 to ∼0.95 and QL from 15,500 to ∼25,200. + C. Single-Ring Add-Drop Results + Figure 6(a) shows a Lorentzian fit to the best drop- + Figure 5 shows the through- and drop-port spectra from port resonance at λ = 1566 nm, validating the cascade +3D FDTD. Five resonances are resolved across 1530 nm model (Eq. 9). Figure 6(b) demonstrates that cascading +to 1570 nm with FSR = 8.29 nm (ng ≈ 2.30). N copies of this FDTD-extracted Lorentzian reproduces + the target exponential eI−L with increasing fidelity as N + Lorentzian fitting of the drop-port peaks yields QL = + grows. +10,300–15,500, with the best resonance at λ = 1566 nm +reaching QL = 15,500 (FWHM = 101 pm, Dmax = 0.360, To validate the cascade prediction directly, a five- +−4.4 dB). The through-port extinction ratio is 1.6 dB to ring cascade 3D FDTD simulation was performed us- +2.6 dB, and the five-resonance mean is QL = 12,500 ± ing Tidy3D [43]; the full simulation notebook is publicly +1,800 (Dmax = 0.29–0.36). CMT √ analysis of the best available [43]. The |E|2 field at λ = 1549 nm [Fig. 6(d)] +resonance gives Qi = QL /(1 − Dmax ) = 15,500/0.400 ≈ confirms resonant excitation across all five rings. Map- +38,800, confirming that the 500 nm etch provides sufficient ping the drop-port spectrum onto the control variable I +confinement and that the 100 nm gap places the ring yields 11 data points within the AEF operating range +in the coupling-limited regime. The cascade analysis [Fig. 6(e, f)], with the FDTD transmission closely tracking +below adopts the best-case FDTD calibration (QL = the N = 5 theoretical curve near I ≈ L = 8. + 7 + + + + +FIG. 6: FDTD-based AEF validation. (a) Lorentzian fit to the drop-port resonance at λ = 1566 nm from 3D FDTD + (Lumerical) (QL = 15,500, Dmax = 0.360, bV = 0.180 V−1 ). (b) Five-ring cascade drop-port spectrum near + λ0 ≈ 1550 nm with Lorentzian5 fit (red curve), confirming the expected T 5 line shape. (c) Five-ring cascade MRR +layout with diagonal zigzag bus waveguides. (d) |E|2 field profile at λ = 1549 nm from a five-ring cascade 3D FDTD + simulation (Tidy3D [43]). (e, f ) AEF validation of the five-ring cascade on log (e) and linear (f) scales with + 11 spectral FDTD data points. + 8 + + D. X-cut electrode design and EO parameters TABLE III: Electro-optic electrode parameters for the + X-cut TFLN micro-ring with lateral S–G arc electrodes. + We employ lateral signal–ground (S–G) arc electrodes +on the slab surface alongside the ring waveguide (Fig. 7). Parameter Symbol Value Unit +In the X-cut orientation, the crystal Z-axis is at 45◦ from Crystal orientation — X-cut — +the horizontal in the substrate plane, giving a lateral- EO coefficient r33 30.9 pm V−1 +field projection proportional to cos(θ − 45◦ ) at azimuthal EO fill factor fEO 1/π ≈ 0.318 — +angle θ. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ EO overlap factor ΓEO 0.7 — +and 315◦ naturally separate the coupling regions from Electrode gap gel 5 µm + Effective electrode distance deff 2.5 µm +the electrode regions. Each ring carries a full semicir- +cular arc electrode on the side opposite to its coupling +points, engaging the large r33 = 30.9 pm V−1 Pockels co- +efficient [37, 38]. The effective EO fill factor follows from ized voltage sensitivity is (Supplementary Sec. S4; here +integrating | cos(θ − 45◦ )| over the semicircle: dλ/dV = 28.5 pm/V is the straight-section value and + 1 fEO accounts for partial electrode coverage of the ring + fEO = ≈ 0.318 (29) circumference): + π +(see Supplementary Sec. S4 for derivation). The electrode 2 Q (dλ/dV ) +gap is gel = 5 µm (deff ≈ 2.5 µm), and the electro-optic bV = fEO ≈ 0.182 V−1 (30) +overlap integral is ΓEO = 0.7. Table III lists the electrode λ0 +parameters. + at QL = 15,500. This estimate relies on a first-order + electrostatic model (deff ≈ 2.5 µm, ΓEO = 0.7); a ±30% + variation in bV would shift the cascade depth by one to + two rings at constant εmax (Table IV), leaving the quali- + tative design conclusions unchanged. With the cascade + framework of Sec. II (Eqs. 14–18), the N -ring drop-port + transmission ỹ(I) = C [1 + (a + bI)2 ]−N approximates + eI−L over I ∈ [0, L], with (a, b) optimized by minimax + fitting for each N . + Table IV presents the optimization results for the stan- + dard dynamic range L = 8 (e8 ≈ 2981, 34.7 dB). + + TABLE IV: Cascade optimization results for L = 8. The + bias voltage Vbias = |a|/bV sets the DC offset, and + Vctrl = bL/bV is the maximum control voltage at I = L. + Voltages computed with bV = 0.182 V−1 (X-cut arc + electrode, FDTD-calibrated best resonance QL = 15,500, + ng = 2.30). The mean FDTD quality factor across five +FIG. 7: Illustrative two-ring cascade layout showing the resonances is QL = 12,500 ± 1,800; using the mean would +lateral S–G arc electrode placement on X-cut TFLN (the increase voltages by ∼24%. +cascade design extends to N rings; this two-ring example + clarifies the electrode geometry). The crystal Z-axis is N a b E∞ εmax (%) Vbias (V) Vctrl (V) + oriented at 45◦ from the horizontal in the substrate 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 +plane. The cos(θ − 45◦ ) = 0 boundaries at θ = 135◦ and 10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 + 315◦ naturally separate the bus-waveguide coupling 12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 +regions from the electrode semicircles: each ring carries a 20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 + 25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 +full semicircular arc electrode on the side opposite to its + 30 −1.141 0.03340 0.0030 0.30 6.3 1.5 + coupling points. The resulting effective EO fill factor is 32 −1.1301 0.03131 0.0026 0.26 6.2 1.4 + fEO = 1/π ≈ 0.318. + a The complete cascade optimization results for all N values are + + listed in Supplementary Table S7. + + +E. FDTD-Calibrated bV and Cascade Optimization The approximation quality across different cascade + depths is shown in Fig. 2 (Sec. III). Key thresholds (e.g., + From the device parameters in Tables II and III and ε < 2% at N ≥ 12, ε < 1% at N ≥ 17) and the complete +the FDTD-calibrated ng ≈ 2.30, the effective normal- optimization results are listed in Supplementary Sec. S4. + 9 + + V. PHYSICAL FEASIBILITY TABLE V: Two-regime power budget for the MRR + cascade. Pout assumes per-channel input + Having established the cascade approximation theory Pin,ch = 100 µW (from a shared Pin,tot = 1 mW CW +(Sec. II) and the FDTD-calibrated device parameters laser split across M = 10 parallel channels via a 1×M +(Sec. IV), we now assess the physical feasibility of the splitter, or equivalently multiplexed as d WDM channels +proposed architecture in terms of voltage requirements, sharing a single cascade) and accounts only for the ideal + N +insertion loss, and energy efficiency. on-resonance cascade transmission Dmax (upper bound); + additional inter-ring coupling loss (ηcoupling ≈ 0.9 per + stage, ∼0.46 dB/stage) and off-resonance propagation + A. Electro-optic voltage requirements loss (0.08–0.25 dB/stage) are analyzed separately in + Sec. V C. + For the primary target of ε < 2% (N = 12), minimax + N +optimization gives a = −1.373, b = 0.0845. With the Dmax N Dmax (dB) Pout εmax +FDTD-calibrated QL = 15,500 (bV = 0.182 V−1 ), the 0.36 3 0.0467 −13.3 4.67 µW ∼15% + I +required voltages are (FDTD) 0.36 5 0.00605 −22.2 0.61 µW 10.9% + 0.36 7 7.84 × 10−4 −31.1 78 nW ∼5% + |a| 1.373 0.95 10 0.599 −2.2 59.9 µW 2.68% + Vbias = = = 7.5 V, (31) II + (high-Q) 0.95 20 0.358 −4.5 35.8 µW 0.67% + bV 0.182 + 0.95 30 0.215 −6.7 21.5 µW ∼0.30% + bL 0.0845 × 8 + Vctrl,max = = = 3.7 V. (32) Regime I: FDTD-characterized (Qi = 38,800). Regime II: + bV 0.182 fabricated high-Q (Qi > 106 ). Pout scales linearly with Pin,ch . + +Since bV ∝ Q, voltage scales inversely with quality factor: + + bL bL λ0 independent evidence that intrinsic quality factors in + Vctrl = = . (33) the projected range are physically achievable in TFLN— + bV 2Q |dλ0 /dV | + albeit with wider waveguides and larger ring radii than the +CMOS-compatible control voltages (Vctrl < 3.3 V) are present design. Transferring comparable sidewall quality +achievable at N ≥ 14 with QL = 15,500; at the design to our geometry (R = 20 µm, W = 1.4 µm) is an open +point N = 30 (εmax = 0.30%), Vctrl = 1.47 V. fabrication challenge; the projections should be read as + design targets contingent on achieving it. + The total insertion loss comprises on-resonance + N + B. Power budget: two-regime analysis cascade transmission Dmax , inter-ring coupling loss + (∼0.46 dB/stage for the present diagonal-bus layout), + The on-resonance cascade transmission DmaxN + is the off-resonance propagation loss (0.08–0.25 dB/stage), and +dominant contribution to total insertion loss. Table V fiber-to-chip coupling (1.5–3.0 dB). For the fabricated +presents two regimes: the FDTD-characterized regime high-Q regime (N = 30), the total ranges from ∼13 dB +(Dmax = 0.36) and the fabricated high-Q regime (Dmax = (optimized layout) to ∼24 dB (current geometry); see +0.95, achievable with Qi > 106 and gap-optimized cou- Supplementary Sec. S6 for detailed scenarios. +pling). + In the FDTD-characterized regime, Dmax = 0.36 limits +practical cascades to N ≤ 5: at N = 5 the output is D. Energy comparison +0.61 µW (−22.2 dB) with ε = 10.9%, suited for proof- +of-concept validation. In the fabricated high-Q regime For N = 30 X-cut TFLN micro-ring resonators in the +(Dmax ≥ 0.95), deep cascades become practical: N = 30 fabricated high-Q regime (QL ≈ 25,200 at Qi = 106 ; Sup- +yields Pout = 21.5 µW (−6.7 dB) with εmax ≈ 0.30%. plementary Sec. S5), the three energy components are EO +The transition to fabricated high-Q devices is therefore tuning (EEO = 0.22 pJ), amortized laser (Elaser = 0.07 pJ, +critical for achieving both high accuracy and sufficient shared across M = 10 channels), and photodetector +output power. (EPD = 0.50 pJ), yielding Ephotonic = 0.79 pJ (deriva- + tions in Supplementary Sec. S7). Including thermal stabi- + lization for N = 30 rings (0.15–0.60 pJ; Supplementary + C. Feasibility outlook Sec. S7), the total rises to 0.94–1.39 pJ. + Table S12 compares the photonic cascade with digital + Published TFLN micro-ring resonators achieve Qi ≥ implementations. Including thermal stabilization (0.94– +106 –108 using optimized fabrication [39–42]. At Qi = 1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, +106 with the present coupling geometry, CMT predicts while operating at 10 GHz bandwidth and 58× lower than +Dmax ≈ 0.95 and QL ≈ 25,200 (Supplementary Sec. S5, digital FP32 (46 pJ). At fabricated Q ≥ 30,000, EEO +Tables S4–S7), enabling deep cascades (N ≤ 30) with drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; +sub-percent error. The literature values provide strong Supplementary Table S11), recovering a 3.2× advantage + 10 + + TABLE VI: Energy per exponential operation: with a distinct FSR order of the same ring set, traverse a + single-channel comparison. single N -ring cascade simultaneously (Fig. 8). Because + each channel λj sees its own Lorentzian detuning set by + Implementation E/op (pJ) Bandwidth Notes an independent control QN + voltage Vj , the cascade output + Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACsper channel is ỹj = C k=1 Tdrop,k (λj , Vj ) ≈ eVj , and all + Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACsd exponentials are computed in parallel on the same phys- + Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† ical waveguide. Compared with a 1×M power-splitter + † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. architecture that replicates the cascade for each channel, + Self-consistent with fabricated high-Q regime (QL = 25,200); see the WDM approach reduces the total ring count from + Supplementary Sec. S7. N × d to N (a factor-d saving) and eliminates the splitter + insertion loss (10 log10 d dB). At the output, a WDM + demultiplexer or wavelength-selective photodetector array +over INT8. Since EEO ∝ 1/Q2 , improving Q beyond separates the channels for electrical readout. Figure 8 +∼30,000 yields diminishing energy returns but continues shows a representative chip layout for N = 5 cascade +to relax CMOS driver voltage requirements. stages and d = 8 WDM channels, where alternating U- + turn bus connections route the drop-port output of each + stage into the input bus of the next. + VI. DISCUSSION Why cascade helps. A single Lorentzian in I is too + rigid to mimic the log-linear target over a wide interval. + Practical design procedure. For a given input se- Cascading turns the transfer into a product; taking a +quence x = (x1 , . . . , xK ), the design proceeds as follows: logarithm gives a sum of smooth terms, and the approx- + imation improves as N increases. The slope constraint + 1. Compute m = maxn xn , un = xn − m, and L = N |b| ≳ 1 is an immediate feasibility check. + − minn un . Global softmax normalization via WDM feed- + 2. Map to nonnegative control-signal amplitudes: In = back. The WDM-parallel architecture (Fig. 8) integrates + un + L ∈ [0, L]. naturally with a closed-loop normalization scheme to com- + plete the full softmax function. After the N -stage cascade, + 3. Choose tolerance ε and set εlog = ln(1 + ε). a WDM demultiplexer (e.g., arrayed-waveguide grating or + ring-filter bank) routes each channel λj to a dedicated pho- + 4. Select a physically feasible bmax and estimate N todetector, producing photocurrents Iλj ∝ ỹj ≈ C Pin eVj . + using Eq. (28). The d photocurrents are summed electrically: + 5. Initialize b = min(bmax , 1/N ) and a = −1 − bL/2, d d + then refine (a, b) by a two-parameter minimax fit if + X X + S= Iλj ∝ C Pin eVj . (35) + required. j=1 j=1 + + 6. The optical block yields ỹ(In ) ≈ exn −m , and soft- A proportional–integral (PI) controller compares S with + max weights follow as a fixed reference Sref and adjusts the shared WDM laser + power Pin so that S → Sref [44, 45]. Because all d channels + share the same probe source, scaling Pin multiplies every + ỹ(In ) + pn = P . (34) ỹj by the same factor; upon convergence + j ỹ(Ij ) + Iλj eVj + pj = = Pd = softmax(V )j , (36) + Scope and limits. The approximation is for a fi- Sref Vk + k=1 e +nite interval I ∈ [0, L], where L is determined by the +input batch via Eq. (4). In practice, one designs for a realizing the complete softmax with a single feedback loop +worst-case L expected in operation (or retunes a and and no per-channel normalization circuitry. Compared +rescales the control signal to adapt L). Noise, insertion with the replicated-cascade approach (one AEF block per +loss, and control-induced parasitics limit accuracy and channel), WDM feedback offers two additional benefits: +dynamic range; we treat these effects as platform-specific (i) the splitter-induced power imbalance that would bias +margins. Detailed non-ideality assumptions, parameter the Iλj ratios is absent, since all channels traverse the +distributions, and robustness statistics are reported in same optical path; and (ii) a single laser control point +Supplementary Sec. S8. With K channels in parallel, replaces d independent probe adjustments. Design de- +one can form softmax by summing channel powers and tails and stability analysis of the PI loop are provided in +applying a shared reciprocal scale factor, depending on Supplementary Sec. S9. +the chosen mixed-signal normalization scheme. Beyond ring-resonator AEF implementations, the same + WDM parallelism. A particularly hardware-efficient cascade principle can be extended to other cavity-based +realization exploits wavelength-division multiplexing photonic platforms, such as serial 1D photonic-crystal cav- +(WDM): d probe wavelengths λ1 , . . . , λd , each resonant ities and other cascaded resonant architectures [21, 46]. + 11 + +What these platforms share is transfer-function shaping TABLE VII: Summary of evidence levels for key claims. +through cascaded resonances; loss, tuning range, fabrica- +tion tolerance, and calibration overhead remain platform- Claim Evidence Sec. +dependent. Cascade → exp. approx. Analytic II + The insertion loss budget (Sec. V C) and electro-optic Depth scaling Analytic + num. II, III +voltage requirements (Sec. V A) suggest that the cas- QL , Dmax , bV 3-D FDTD IV +cade architecture is feasible under optimized coupling 5-ring line shape 3-D FDTD IV +and layout conditions. Using monolithic TFLN microring N ≤ 30 deep cascade CMT proj.∗ V + Energy < 1 pJ Estimate V +data from Bahadori et al. [47] (Q ≈ 5432, dλ0 /dV ≈ + Full softmax (WDM + feedback) Conceptual + layout VI +9–20 pm/V), the normalized sensitivity bV ≃ 0.063– + ∗ Based on published Q +0.14 V−1 , within the range required by the cascade design. i ≥ 10 + 6 values [39, 42] and CMT coupling + + model. +Crystal orientation and electrode design. The X- +cut TFLN platform was chosen for several reasons. First, +X-cut is the prevailing industry standard for integrated tified in the Monte Carlo robustness analysis (Supple- +TFLN modulators, with well-established fabrication pro- mentary Sec. S8). Monte Carlo simulations (Supplemen- +cesses and commercial wafer availability [37, 38]. Second, tary Sec. S8) show that under nominal non-ideality levels +the TE0 mode—which is strongly confined in the rib (σa = 0.020, σb,rel = 0.020), a single-point calibration of +waveguide geometry—can engage the large r33 coefficient C per chip keeps the median softmax KL divergence below +via lateral electric fields aligned with the crystal Z-axis. 2.2 × 10−4 , with 95th-percentile max probability error +In contrast, Z-cut geometry with TE polarization can only under 0.32%. Even under stress conditions (σa = 0.032), +access the smaller r13 coefficient (∼ 10 pm/V), resulting 95th-percentile errors remain below 0.42%, demonstrat- +in significantly lower electro-optic efficiency. The arc elec- ing that the identical-detuning design is robust to realis- +trode design (Sec. IV D) addresses the phase-cancellation tic fabrication variations provided a per-chip calibration +problem inherent to X-cut circular rings [47] by orienting step is performed. Conversely, if coupling gaps are in- +the crystal Z-axis at 45◦ from the horizontal in the sub- tentionally varied across rings, the per-ring parameters +strate plane. This rotation places the cos(θ − 45◦ ) = 0 (ak , bk ) become independent degrees of freedom. A Taylor- +boundaries at θ = 135◦ and 315◦ , naturally separating the expansion analysis shows that K non-identical rings can +bus-waveguide coupling regions from the electrode regions. cancel curvature + P terms up to order 2K in the Taylor series +Each ring carries a full semicircular arc electrode on the of g(I) = k ln Tk , one order higher than K identical +side opposite to its coupling points, yielding an effective rings, so that fewer rings suffice for a given error target. +fill factor fEO = 1/π ≈ 0.318. While this reduces the +round-trip EO efficiency compared to a hypothetical full- +circumference design, it preserves the compact footprint +of a circular ring resonator. The cascade performance +can be further improved beyond the R = 20 µm circular- +ring design presented here. Increasing the ring radius +reduces bending loss and raises the intrinsic quality factor +Qi , which directly increases bV (∝ Q) and lowers the +required control voltage. Alternatively, adopting a race- +track geometry with extended straight coupling sections +strengthens the bus–ring coupling, pushing the drop-port +maximum Dmax closer to critical coupling and improving +the per-stage transfer efficiency. Either approach—or their +combination—would yield higher bV and Dmax , enabling +lower N or tighter approximation accuracy at reduced +operating voltages. +Fabrication considerations. The X-cut TFLN rib +waveguide (600 nm total thickness, 500 nm etch, w = +1.4 µm) follows established fabrication processes for com- +mercial TFLN wafers on SiO2 [37, 38]. The lateral signal– +ground (SG) electrode configuration is fabricated in a +single metal layer, which is standard in TFLN foundry +processes. The primary fabrication challenge for the +cascade architecture is maintaining uniform coupling +gaps (g = 100 nm) across N rings to ensure identi- +cal Lorentzian transfer functions. Post-fabrication trim- +ming via UV exposure or localized thermal oxidation can +compensate residual detuning variations [30], as quan- + 12 + + + + + Softmax Full Chip Layout – N = 5 × d = 8 (TFLN) + d = 8 WDM channels + + + Vλ1 Vλ2 Vλ3 Vλ4 Vλ5 Vλ6 Vλ7 Vλ8 + + WDM + λ1−λ8 n=1 + Pin + + + n=2 + N = 5 + cascade + n=3 stages + + + + + n=4 + + + n=5 + + + + + WDM Demux (AWG / ring filter) + + Sref + PD1 PD2 PD3 PD4 PD5 PD6 PD7 PD8 + Iλ + j S e + Σ − PI + p1 p2 p3 p4 p5 p6 p7 p8 + + + + + Feedback: adjust Pin + Iλj + Output: pj = = softmax(V )j + Sref + +FIG. 8: WDM-parallel MRR-AEF system with closed-loop softmax normalization (N = 5 cascade stages, d = 8 WDM + channels) on X-cut TFLN. A single WDM source (λ1 –λ8 ) enters the top input bus waveguide; each stage applies a + Lorentzian drop-port transfer, and alternating U-turn connections route the drop-port output into the next stage’s +input bus. Per-channel EO bias voltages (Vλ1 –Vλ8 ) independently tune each column of rings. The final drop output + passes through a WDM demultiplexer (AWG / ring filter) and is detected by a PD array, producing per-channel + photocurrents Iλj ∝ eVj . The photocurrents are summed (Σ) and compared with a reference Sref ; a PI controller + adjusts the shared laser power Pin until S = Sref , at which point each PD output directly yields + pj = Iλj /Sref = softmax(V )j (Eq. 36). + 13 + + VII. CONCLUSION Dmax ≥ 0.95) are realized in the cascade geometry, deeper + cascades (N ≈ 20–30) would reach sub-percent approx- + We have presented a cascaded micro-ring resonator ar- imation error with an estimated per-operation energy +chitecture that approximates the exponential function of 0.79–1.39 pJ, which is 1.7–2.4× lower than an INT8 +exn −m on a finite interval [0, L] using multiplicative MAC at the 7 nm node. Monte Carlo analysis shows that +Lorentzian transfer functions. Increasing the cascade the identical-detuning design tolerates realistic fabrica- +depth N systematically reduces the worst-case relative tion variations (σa = 0.020, σb,rel = 0.020) with a single +error, and an identical-detuning design initialized by flank per-chip calibration, keeping the 95th-percentile softmax +and slope matching provides a practical two-parameter probability error below 0.32%. +design. + Three-dimensional FDTD simulations of a single X-cut The formulation is not restricted to electro-optic tuning: +TFLN add-drop ring (R = 20 µm, g = 100 nm) yield it requires only a controllable detuning coordinate with lo- +QL = 10,300–15,500 and Dmax ≈ 0.36, calibrating the cal linearization, so both Pockels and optical (Kerr/XPM) +cascade transfer model. A five-ring cascade 3D FDTD mechanisms are compatible [37, 38, 47, 48]. We demon- +simulation directly validates the multi-ring framework: strate a photonic exponential block and present a WDM- +all five rings exhibit resonant excitation, and mapping parallel chip architecture (Fig. 8) in which d wavelength +the drop-port spectrum onto the dimensionless control channels share a single N -ring cascade, reducing the total +variable reproduces the theoretical N = 5 curve with ring count by a factor of d and eliminating power-splitter +∼11% integrated relative-area error over the upper op- loss. Combined with a single-loop PI feedback that adjusts +erating range (I ≥ 5.8), providing the first multi-ring the shared WDM laser power, the architecture realizes the +confirmation of the cascade exponential approximation. complete softmax function—exponentiation, summation, +At the present FDTD-characterized quality factor, practi- and normalization—without per-channel normalization +cal cascades are limited to N = 5–7 (ε ≲ 11%). If high-Q circuitry. Max-finding and digital interfacing remain open +TFLN resonators reported in the literature (Qi ≥ 106 , for future experimental validation. + + + + + [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Shengyuan Lu, Qihang Zhang, Lingyan He, C. A. A. + Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Franken, Keith Powell, Hana Warner, Daniel Assumpcao, + and Illia Polosukhin. Attention is all you need. In Dylan Renaud, Ying Wang, et al. Integrated lithium + Advances in Neural Information Processing Systems 30 niobate photonic computing circuit based on efficient and + (NeurIPS 2017), pages 5998–6008, 2017. high-speed electro-optic conversion. Nature Communica- + [2] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, tions, 16:8178, 2025. + and Christopher Ré. FlashAttention: Fast and memory- [11] Priyabrata Dash, Anxiao Jiang, and Dharanidhar Dang. + efficient exact attention with IO-awareness. In Advances SOFTONIC: A photonic design approach to softmax + in Neural Information Processing Systems 35 (NeurIPS activation for high-speed fully analog AI acceleration. + 2022), pages 16344–16359, 2022. In Proceedings of the Great Lakes Symposium on VLSI + [3] Neil Savage. Light could lower AI’s appetite for power. (GLSVLSI ’25), pages 118–125, 2025. + Nature Nanotechnology, 21:6–8, 2026. [12] Ziyu Zhan, Hao Wang, Qiang Liu, and Xing Fu. Opto- + [4] Yichen Shen et al. Deep learning with coherent nanopho- electronic nonlinear softmax operator based on diffractive + tonic circuits. Nature Photonics, 11(7):441–446, 2017. neural networks. Optics Express, 32(15):26458–26469, + [5] Johannes Feldmann et al. Parallel convolutional process- 2024. + ing using an integrated photonic tensor core. Nature, [13] Ye Tian, Shuiying Xiang, Xingxing Guo, Yahui Zhang, + 589(7840):52–58, 2021. Jiashang Xu, Shangxuan Shi, Haowen Zhao, Yizhi Wang, + [6] Nicholas C. Harris et al. Linear programmable nanopho- Xinran Niu, Wenzhuo Liu, and Yue Hao. Photonic trans- + tonic processors. Optica, 5(12):1623–1631, 2018. former chip: interference is all you need. PhotoniX, 6:45, + [7] Bowei Dong, Samarth Aggarwal, Wen Zhou, Utku Emre 2025. + Ali, Nikolaos Farmakidis, June Sang Lee, Yuhan He, Xuan [14] Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, + Li, Dim-Lee Kwong, C. D. Wright, Wolfram H. P. Pernice, Brucek Khailany, and Anand Raghunathan. Softermax: + and H. Bhaskaran. Higher-dimensional processing using Hardware/software co-design of an efficient softmax for + a photonic tensor core with continuous-time data. Nature transformers. In Proceedings of the 58th ACM/IEEE + Photonics, 17(12):1080–1088, 2023. Design Automation Conference (DAC), pages 469–474, + [8] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, 2021. + John E. Bowers, Michael Hochberg, Richard Soref, and [15] Nazim Altar Koca, Anh Tuan Do, and Chip-Hong + Bhavin J. Shastri. Roadmapping the next generation of Chang. Hardware-efficient softmax approximation for + silicon photonics. Nature Communications, 15:751, 2024. self-attention networks. In Proceedings of the IEEE Inter- + [9] Mario Miscuglio and Volker J. Sorger. Photonic tensor national Symposium on Circuits and Systems (ISCAS), + cores for machine learning. Applied Physics Reviews, pages 1–5, 2023. + 7(3):031404, 2020. [16] Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, +[10] Yaowen Hu, Yunxiang Song, Xinrui Zhu, Xiangwen Guo, and Yongpan Liu. SOLE: Hardware-software co-design + 14 + + of softmax and layernorm for efficient transformer infer- 2025. accessed 2026-02-21. + ence. In Proceedings of the IEEE/ACM International [35] Jane Austen. Pride and prejudice. Project Gutenberg + Conference on Computer-Aided Design (ICCAD), pages eBook No. 1342, 2025. accessed 2026-02-21. + 1–9, 2023. [36] Hyoseok Park. MRR-AEF: reproducible MRR depth- +[17] Yuan Zhang, Yonggang Zhang, Lele Peng, Lianghua Quan, sweep fitting and supplementary validation scripts. + Shubin Zheng, Zhonghai Lu, and Hui Chen. Base-2 soft- GitHub repository, 2025. commit 585e695, accessed 2026- + max function: Suitability for training and efficient hard- 02-21. + ware implementation. IEEE Transactions on Circuits and [37] Di Zhu et al. Integrated photonics on thin-film lithium + Systems I: Regular Papers, 69(9):3605–3618, 2022. niobate. Advances in Optics and Photonics, 13(2):242–352, +[18] Zhengyu Mei, Hongxi Dong, Yuxuan Wang, and Hongbing 2021. + Pan. TEA-S: A tiny and efficient architecture for PLAC- [38] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang + based softmax in transformers. IEEE Transactions on Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, + Circuits and Systems II: Express Briefs, 70:3594–3598, CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, + 2023. Amirhassan Shams-Ansari, David Barton, Neil Sinclair, +[19] Ke Chen, Yue Gao, Haroon Waris, Weiqiang Liu, and and Marko Loncar. Integrated electro-optics on thin-film + Fabrizio Lombardi. Approximate softmax functions for lithium niobate. Nature Reviews Physics, 2025. + energy-efficient deep neural networks. IEEE Transactions [39] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan + on Very Large Scale Integration (VLSI) Systems, 31:4–16, Shams-Ansari, and Marko Lončar. Monolithic ultra-high- + 2023. Q lithium niobate microring resonator. Optica, 4(12):1536– +[20] Wim Bogaerts et al. Silicon microring resonators. Laser 1537, 2017. + & Photonics Reviews, 6(1):47–73, 2012. [40] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q +[21] John E. Heebner, Robert W. Boyd, and Q.-Han thin-film lithium niobate microrings fabricated with wet + Park. Scissor solitons and other propagation effects in etching. Adv. Mater., 35(3):2208113, 2023. + microresonator-modified waveguides. Journal of the Opti- [41] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. + cal Society of America B, 19(4):722–731, 2002. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Mag- +[22] Jiahui Wang, Sean P. Rodrigues, Ercan M. Dede, and alhães, Amirhassan Shams-Ansari, Neil Sinclair, and + Shanhui Fan. Microring-based programmable coherent Marko Lončar. Twenty-nine million intrinsic Q-factor + optical neural networks. Optics Express, 31(12):18871, monolithic microresonators on thin-film lithium niobate. + 2023. Photon. Res., 12(8):A63–A68, 2024. +[23] Pengxing Guo, Niujie Zhou, Weigang Hou, and Lei Guo. [42] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian + StarLight: a photonic neural network accelerator featur- Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. + ing a hybrid mode-wavelength division multiplexing and Lithium niobate microring with ultra-high Q factor above + photonic nonvolatile memory. Optics Express, 30:37051, 108 . Chin. Opt. Lett., 20(1):011902, 2022. + 2022. [43] Flexcompute Inc. Tidy3D: electromagnetic simula- +[24] Weizhen Yu, Shuang Zheng, Zhenyu Zhao, Bin Wang, tion software. https://www.flexcompute.com/tidy3d/, + and Weifeng Zhang. Reconfigurable low-threshold all- 2024. v2.10; cloud GPU FDTD. Accompany- + optical nonlinear activation functions based on an add- ing notebook: https://www.flexcompute.com/tidy3d/ + drop silicon microring resonator. IEEE Photonics Journal, community/notebooks/CascadedMRRTFLN/. + 14(6):1–7, 2022. [44] John K. Doylend, Paul E. Jessop, and Andrew P. Knights. +[25] Bahaa E. A. Saleh and Malvin C. Teich. Fundamentals Silicon photonic dynamic optical channel leveler with + of Photonics. Wiley, Hoboken, NJ, 2 edition, 2007. external feedback loop. Optics Express, 18(13):13805– +[26] Vı́tor R. Almeida, Carlos A. Barrios, Roberto R. 13812, 2010. + Panepucci, and Michal Lipson. All-optical control of light [45] Karl J. Åström and Richard M. Murray. Feedback Systems: + on a silicon chip. Nature, 431(7012):1081–1084, 2004. An Introduction for Scientists and Engineers. Princeton +[27] Qianfan Xu, Bradley Schmidt, Sameer Pradhan, and University Press, Princeton, NJ, 2008. + Michal Lipson. Micrometre-scale silicon electro-optic mod- [46] Amnon Yariv, Yong Xu, Reginald K. Lee, and Axel + ulator. Nature, 435(7040):325–327, 2005. Scherer. Coupled-resonator optical waveguide: a proposal +[28] Kishore Padmaraju and Keren Bergman. Resolving the and analysis. Optics Letters, 24(11):711–713, 1999. + thermal challenges for silicon microring resonator devices. [47] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, + Nanophotonics, 3:269–281, 2014. Lynford L. Goddard, and Songbin Gong. Ultra-efficient +[29] Erwen Li, Behzad Ashrafi Nia, Bokun Zhou, and Alan X. and fully isotropic monolithic microring modulators in + Wang. Transparent conductive oxide-gated silicon mi- a thin-film lithium niobate photonics platform. Optics + croring with extreme resonance wavelength tunability. Express, 28(20):29644–29661, 2020. + Photonics Research, 7(4):473, 2019. [48] Abu Naim R. Ahmed, Shouyuan Shi, Mathew Zablocki, +[30] Lahiru Jayatilleka et al. Post-fabrication trimming of Peng Yao, and Dennis W. Prather. Tunable hybrid sil- + silicon photonic ring resonators at wafer-scale. Journal icon nitride and thin-film lithium niobate electro-optic + of Lightwave Technology, 39:5083–5088, 2021. microresonator. Optics Letters, 44(3):618, 2019. +[31] Elliott W. Cheney. Introduction to Approximation Theory. + McGraw–Hill, New York, 1966. +[32] Alec Radford et al. Language models are unsupervised + multitask learners. Technical report, OpenAI, 2019. +[33] Hugging Face. distilgpt2 model card, 2025. accessed + 2026-02-21. +[34] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), + 15 + + SUPPLEMENTARY INFORMATION + +Supplementary material accompanying “Photonic Exponential Approximation via Cascaded TFLN Microring Resonators +toward Softmax.” + + + S0. RIGOROUS DERIVATION AND VALIDITY SCOPE + + This section derives the depth-scaling relations and screening bounds used in the main text, and states the assumptions +under which they apply, together with validity scope and failure cases. We separate proved statements (Lemma, +Proposition, Theorem) from heuristic engineering estimates that rely on empirical calibration. + + + S0.1 Assumptions + +Assumption 1 (Lorentzian single-ring transfer). Each ring k has a normalized add-to-drop transmission of the form +Tk (I) = [1 + (ak + bI)2 ]−1 , where ak ∈ R is the dimensionless static detuning, b > 0 is the common normalized +sensitivity, and I ≥ 0 is a nonnegative control-signal amplitude. +Assumption 2 (Multiplicative cascade). The N rings are cascaded in a serial add-drop topology (the drop output of +ring k feeds the add input of ring k+1), and the probe is sufficiently weak that cross-ring and nonlinear probe-induced + QN +effects are negligible; thus the total normalized transmission is y(I) = k=1 Tk (I). +Assumption 3 (Identical-detuning family). All rings share the same static detuning: a1 = · · · = aN ≡ a. This reduces +the design space to (a, b) and a global scale C > 0; the scaled output is ỹ(I) = C [1 + (a + bI)2 ]−N . +Assumption 4 (Linear control-to-resonance mapping). Within the operating range I ∈ [0, L], the resonance shift is +a linear function of the control-signal amplitude (Eq. (10) of main text), i.e., higher-order detuning nonlinearity is +negligible. +Assumption 5 (Finite interval and bounded L). The approximation target is f (I) = eI−L on a finite interval +I ∈ [0, L] with L > 0 determined by the input batch (L = max(x) − min(x)). The depth-scaling results are derived for +fixed, finite L. +Assumption 6 (Flank-centered operating regime). The design uses the “flank-centered” initialization: a+b(L/2) = −1 +(midpoint on the Lorentzian half-maximum) and N b = 1 (slope matching). This places the operating point in the +steepest-slope region of the Lorentzian, where the log-transfer is most nearly linear. + + + S0.2 Rigorous results + + Throughout, define the log-domain residual + + r(I) ≡ ln ỹ(I) − (I − L) = ln C − N ln 1 + (a + bI)2 − (I − L), + + (S0.1) + +and the worst-case log-error E∞ = supI∈[0,L] |r(I)|. We set C to the minimax-optimal value ln C ⋆ = − maxI g(I) + + +minI g(I) /2, where g(I) = ln y(I) − (I − L), throughout. +Lemma 1 (Slope bound — rigorous). Under Assumptions 1–3, for every I ≥ 0, + + d + ln y(I) ≤ N |b|. + dI + +Proof. From Assumption 3, ln y(I) = −N ln 1 + (a + bI)2 . Differentiating: + + + d 2b(a + bI) + ln y(I) = −N . + dI 1 + (a + bI)2 + +Let u ≡ a + bI ∈ R. The function h(u) = 2u/(1 + u2 ) satisfies |h(u)| ≤ 1 for all u ∈ R (since 1 + u2 ≥ 2|u| by AM–GM). +Therefore |d(ln y)/dI| = N |b| |h(u)| ≤ N |b|. + 16 + +Remark 1 (Necessary condition for approximation). Since the target ln f (I) = I − L has constant slope +1, a +necessary condition for the cascade log-transfer to track this slope at any point is N |b| ≥ 1. This is Eq. (23) of the +main text and is a rigorous (not heuristic) necessary condition. +Proposition 1 (Log-domain Taylor expansion at flank center). Under Assumptions 1–6, define I0 = L/2 and +δ = I − I0 . Then + δ3 + ln ỹ(I) = const + δ + + R4 (δ), (S0.2) + 6N 2 +where |R4 (δ)| ≤ M4 δ 4 with M4 = N |b|4 · sup|δ|≤L/2 |q (4) (u(δ))| and q(u) = − ln(1 + u2 ). In particular, the quadratic +term vanishes identically at the flank point u0 = a + bI0 = −1. +Proof. Set u(δ) = a + b(I0 + δ) = −1 + bδ (using a + bI0 = −1). Define ϕ(u) = − ln(1 + u2 ). Then ln y(I) = N ϕ(u(δ)) +and u′ (δ) = b. Compute derivatives of ϕ at u0 = −1: + 2u + ϕ′ (u) = − , ϕ′ (−1) = 1, + 1 + u2 + 2(u2 − 1) + ϕ′′ (u) = , ϕ′′ (−1) = 0, + (1 + u2 )2 + 4u(3 − u2 ) −4(−1)(3 − 1) + ϕ′′′ (u) = , ϕ′′′ (−1) = = 1. + (1 + u2 )3 (1 + 1)3 +By the chain rule, writing F (δ) = N ϕ(u(δ)): + F ′ (0) = N b ϕ′ (−1) = N b = 1, + F ′′ (0) = N b2 ϕ′′ (−1) = 0, + 1 + F ′′′ (0) = N b3 ϕ′′′ (−1) = N b3 = + , + N2 +where we used b = 1/N from Assumption 6 in the last step. Hence the Taylor expansion with the minimax-optimal C +is + δ2 1 δ3 + ln ỹ(I) = const + δ + 0 · + 2· + R4 (δ). + 2 N 6 +Subtracting the target δ (the linear part of I − L around I0 ) gives a leading residual δ 3 /(6N 2 ). The remainder is +bounded by the standard Taylor remainder estimate. +Theorem 1 (Heuristic depth-scaling law). Under Assumptions 1–6 and ignoring the fourth-order remainder R4 , the +leading-order worst-case log-error on I ∈ [0, L] satisfies + 3 + (leading) 1 L L3 + E∞ ∼ = . (S0.3) + 6N 2 2 48 N 2 + (leading) +Setting E∞ ≤ εlog = ln(1 + ε) and solving for N gives + L3/2 + N ≥ p . (S0.4) + 48 εlog +Derivation (heuristic). From Proposition 1, the residual with respect to the target is dominated by δ 3 /(6N 2 ) for +|δ| ≤ L/2. The maximum of |δ|3 on [−L/2, L/2] is (L/2)3 . Setting the bound equal to εlog and solving: + L3 L3/2 + ≤ εlog =⇒ N≥p . + 48 N 2 48 εlog + √ +With 1/ 48 ≈ 0.144, and accounting for the fact that the minimax-optimal residual is typically smaller than the +one-sided Taylor bound by a factor of ∼ 2 (equi-oscillation), the effective prefactor becomes κ ≈ 0.07, yielding the + √ +main-text engineering estimate Eq. (28): N ≈ ⌈max(1/bmax , κ L3/2 / εlog )⌉. +Remark 2 (Status of Theorem 1). This is a heuristic scaling law, not a rigorous minimax guarantee. The +derivation truncates the Taylor series at third order and approximates the equi-oscillation factor empirically (κ ≈ 0.07). +For a rigorous bound one would need explicit control of R4 over the full interval [0, L], which depends on L, N , and + √ +higher derivatives of the Lorentzian; we do not claim such a bound here. The scaling N ∼ L3/2 / εlog is supported by +numerical evidence (Table I) but should be treated as an engineering design rule. + 17 + + S0.3 Derivation of the conservative screening bound + + We now derive the conservative screening bound (Eqs. S0.7–S0.8 below), which is stated inline in Sec. II of the main +text. +Proposition 2 (Conservative log-error bound). Under Assumptions 1–5 (identical detuning, but not restricted to the +flank-centered choice), fix b > 0 and choose the normalization ỹ(L) = 1. Define ϕ(u) = − ln(1 + u2 ) and write + + ln ỹ(I) = N ϕ(a + bI) − ϕ(a + bL) . + +The target in this normalization is (I − L). Denoting the residual r(I) = ln ỹ(I) − (I − L), we have r(L) = 0 and +r(0) = N [ϕ(a) − ϕ(a + bL)] + L. + For any choice of a such that the operating range {a + bI : I ∈ [0, L]} lies in the region where ϕ is concave (i.e., +ϕ′′ (u) ≤ 0 throughout), the worst-case log-error satisfies + + N ∥ϕ′′ ∥∞ b2 L2 N ϕ′ (a + bL) · b − 1 + E∞ ≤ + · L, (S0.5) + 8 2 + +where ∥ϕ′′ ∥∞ = supu∈[a, a+bL] |ϕ′′ (u)|. +Derivation sketch. Write h(I) ≡ N ϕ(a + bI). The slope is h′ (I) = N b ϕ′ (a + bI). At I = L, we want the slope to +match the target slope 1; define the slope mismatch ∆s ≡ h′ (L) − 1 = N b ϕ′ (a + bL) − 1. By the mean-value theorem +on [0, L]: + Z L + 1 − h′ (t) dt. + + r(I) − r(L) = r(I) = h(I) − h(L) − (I − L) = + I + RL +Write 1 − h′ (t) = (1 − h′ (L)) + (h′ (L) − h′ (t)) = −∆s + t h′′ (s) ds. Since h′′ (s) = N b2 ϕ′′ (a + bs), we bound +|h′′ (s)| ≤ N b2 ∥ϕ′′ ∥∞ . Integrating twice and applying the triangle inequality gives (S0.5). +Corollary 1 (Main-text conservative bound). Under slope matching at I = L (i.e., N b ϕ′ (a + bL) = 1, so ∆s = 0), +and using ∥ϕ′′ ∥∞ ≤ 2 (which holds since |ϕ′′ (u)| = |2(u2 − 1)/(1 + u2 )2 | ≤ 2 for all u ∈ R), the bound simplifies to + + N b2 L 2 + E∞ ≤ . (S0.6) + 4 +Using b = 1/N (the slope-matching choice from N b = 1) gives E∞ ≤ L2 /(4N ). If instead we retain a general b but add +the penalty from imperfect slope matching (e.g., from the constraint b ≤ bmax ), a combined conservative bound is + + L2 1 + E∞ ≤ + 2 , (S0.7) + 4N 2b N +which provides a conservative heuristic bound on the log-error. Setting this ≤ ln(1 + ε) and solving for N yields the +conservative screening depth: + 2 + L /4 + 1/(2b2 ) + + Nsafe ≥ . (S0.8) + ln(1 + ε) + +Remark 3 (Status of the conservative bound). Equation (S0.7) is a conservative heuristic design rule. It is +conservative because: (i) we use a global upper bound ∥ϕ′′ ∥∞ ≤ 2 instead of the actual curvature, (ii) we do not exploit +the minimax-optimal C shift. It is heuristic (not a formal guarantee) because: (i) the derivation assumes the operating +range lies in the concavity region of ϕ, which may not hold for all detuning choices; (ii) the second term 1/(2b2 N ) +arises from a simplified penalty model for flank-curvature mismatch that has not been proved to be a rigorous upper +bound in all parameter regimes. Nsafe from Eq. (S0.8) is therefore a screening estimate, suitable for preliminary +design-space exploration but not a certified minimax guarantee. + + + S0.4 Validity scope and failure cases + + The derivations above hold under the stated assumptions. We now identify the regimes where each assumption may +break down. + 18 + +(V1) Lorentzian model (A1). The single-ring Lorentzian form T = [1+(a+bI)2 ]−1 is a near-resonance approximation + valid when the probe frequency is within a few linewidths of the resonance. Far from resonance, higher-order + dispersion, Fano interference, or multi-mode effects introduce deviations. Failure case: operation with very large + detuning (|a + bI| ≫ 1 across the full interval), where the Lorentzian tails may not be accurate for high-Q rings. + +(V2) Multiplicative cascade (A2). Requires that inter-ring reflections and back-scattering are negligible (forward- + propagating coupling only, i.e. negligible back-reflection at each inter-ring junction). Failure case: very high ring + count N with non-negligible back-reflection per stage, which can produce Fabry–Pérot-like ripples in the cascade + transfer function. + +(V3) Identical-detuning family (A3). The Taylor expansion and conservative bound both assume a1 = · · · = aN . + In practice, fabrication variations introduce per-ring detuning spread σa . The Monte Carlo analysis in Sec. S8 + quantifies robustness, but the analytical bounds (S0.2)–(S0.8) are strictly valid only for identical detuning. + (0) +(V4) Linear control-to-resonance mapping (A4). The linearized model ω0 (I) = ω0 + ηI introduces systematic + error at large control amplitudes. For carrier-injection (free-carrier plasma effect) or thermal tuning over wide + ranges, second-order nonlinearity in the control-to-detuning mapping can exceed 1%. Failure case: large L + requiring a control swing exceeding the linearity range of the tuning mechanism. + +(V5) Finite interval (A5). All bounds scale with L (typically as L2 or L3/2 ). As L → ∞, N grows without bound + and insertion loss accumulates (∼ N · ILstage ), eventually degrading the probe SNR below the useful regime. + There is no finite N that works for all L simultaneously. Practical regime: L ≲ 10–12 (consistent with Leff at + p90–p95 from Sec. S3) is the primary target; L ≳ 16 requires N ≳ 30 even for moderate tolerance, pushing loss + budgets. + +(V6) Flank-centered initialization (A6). The Taylor-based scaling (Theorem 1) relies on the cancellation + ϕ′′ (−1) = 0 at the half-maximum point. If the operating point deviates (e.g., due to fabrication offset pushing + a + bI0 away from −1), a nonzero quadratic residual appears and the effective scaling worsens to E∞ ∼ L2 /N + rather than L3 /N 2 . Mitigation: heater/bias trimming to restore the flank condition. + + + S0.5 Mapping to main-text equations + +For reference, the results derived here correspond to the following main-text equations: + + • Slope bound (Lemma 1): rigorous; corresponds to main-text Eqs. (22)–(23). This is a guaranteed necessary + condition. + + • Engineering N -estimate (Theorem 1): heuristic scaling with empirical prefactor κ ≈ 0.07; corresponds to + main-text Eq. (28). This is a heuristic design rule calibrated against numerical fits. + + • Conservative bound (Corollary 1): conservative but not rigorously certified as a minimax upper bound; derived + as Eq. (S0.7) in this supplement, stated inline in Sec. II. This is a conservative heuristic screening condition. + + • Nsafe (Corollary 1, Eq. S0.8): the safe screening depth derived from the conservative bound; derived as Eq. (S0.8) + in this supplement, stated inline in Sec. II. This is a conservative backstop estimate for preliminary design. + +Summary of guarantee status: +Result Status Main-text Eq. +Slope bound N |b| ≥ 1 Rigorous (proved) (23) + √ +Scaling N ∼ κL3/2 / εlog Heuristic (Taylor truncation + empirical κ) (28) +Bound E∞ ≤ L2 /(4N ) + 1/(2b2 N ) Conservative heuristic (S0.7) +Nsafe screening depth Conservative backstop (S0.8) + + + S1. DEPTH-SCALING DERIVATION AND CONSERVATIVE SCREENING BOUND + + This section provides the detailed derivations underlying the depth-scaling relations and conservative screening +bounds summarized in the main text (Sec. II). These results complement the rigorous treatment in Sec. S0. + 19 + + S1.1 Local expansion and exponential-like behavior + + To provide immediate local intuition (without changing the global minimax objective), let δ = I − I0 around the +flank-centered point I0 = L/2 and impose a + bI0 = −1. With the local normalization C = 2N (so that ỹ(I0 ) = 1), a +third-order expansion of ỹ(I) = C[1 + (a + bI)2 ]−N gives + + N 2 2 2 N (N 2 − 1) 3 3 + ỹ(I) ≈ 1 + N b δ + b δ + b δ + O(δ 4 ), (S1.1) + 2 6 +so with b ∼ 1/N , the linear and quadratic coefficients align with those of eδ = 1 + δ + δ 2 /2 + δ 3 /6 + · · · , explaining +why the initialization is already close before refinement. + + + S1.2 Log-domain analysis and scaling derivation + + For depth scaling, the logarithmic domain is more transparent. Under the same flank centering (a + bI0 = −1), +expand around I0 = L/2 with δ = I − I0 to obtain + + N b3 3 + ln ỹ(I) = const + N b δ + δ + O(δ 4 ). (S1.2) + 6 +At a + bI0 = −1, the quadratic term cancels identically in the log expansion; imposing slope matching (N b = 1) gives + + δ3 + ln ỹ(I) = const + δ + + O(δ 4 ). (S1.3) + 6N 2 +Hence the leading log-domain residual scales as r(δ) ∼ δ 3 /N 2 . Over I ∈ [0, L] with |δ| ≤ L/2, this implies E∞ ∼ L3 /N 2 . +Requiring E∞ ≤ εlog leads to + + L3/2 + N∝√ , (S1.4) + εlog + +which explains the scaling used in the main-text engineering estimate (Eq. (28)). This derivation is heuristic (not a +formal guarantee), and the prefactor remains platform- and fitting-criterion dependent. + + + S1.3 Conservative upper bound and screening depth + + For fixed b and the identical-detuning family (a1 = · · · = aN ≡ a), one can write a conservative heuristic condition +for achieving a prescribed log-tolerance. A simple normalization is to enforce ỹ(L) = 1 (matching the target f (L) = 1). +For a particular constructive choice of a that keeps (a + bI) large and negative across [0, L], one can bound the +worst-case log-error as + + L2 1 + E∞ ≤ + 2 . (S1.5) + 4N 2b N +(This is a conservative rule of thumb; obtaining a formal guarantee would require a separate proof.) As a screening +estimate (not a formal guarantee), one may use + 2 + L /4 + 1/(2b2 ) + + N ≥ . (S1.6) + ln(1 + ε) + +While this bound is typically pessimistic, it provides a conservative backstop-style estimate for preliminary design +screening. The rigorous derivation of these bounds, including the concavity conditions and slope-matching assumptions, +is given in Sec. S0.3. + 20 + + S2. WORKED EXAMPLE AND EMPIRICAL LOGIT-RANGE CALIBRATION + + This section provides the detailed worked example for the input-to-output mapping and the empirical logit-range +calibration tables referenced in the main text (Sec. III). + + + S2.1 Worked input-to-output mapping example + + As a worked example, consider + + x = [−3.2, 1.2, 4.8, −0.9]. (S2.1) + +Compute m = max xn = 4.8. Then u = x − m = [−8.0, −3.6, 0, −5.7] and L = − min un = 8.0. The mapped +control-signal levels are + + I = u + L = [0, 4.4, 8.0, 2.3], (S2.2) + +and the required normalized exponentials are exn −m = eun = eIn −L . Using the fitted model directly, + N + 1 Y + Tk (In ) = , y(In ) = Tk (In ). + 1 + (ak + bIn )2 + k=1 + +Under the identical-detuning fit (a1 = · · · = aN ≡ a), this becomes + N + 1 + ỹ(In ) = C y(In ) = C . + 1 + (a + bIn )2 +For the re-fitted parameters used in this example, + + a = −1.4588, b = 0.10202, + (S2.3) + N = 10, C = 3.0896 × 101 . + +which gives + N + 1 + ỹ(In ) = C , + 1 + (a + bIn )2 + (S2.4) + ≈ [3.44 × 10−4 , 2.73 × 10−2 , + 9.74 × 10−1 , 3.26 × 10−3 ]. + + For reference, the corresponding target terms are + + In − L = [−8.0, −3.6, 0, −5.7], (S2.5) + +and + In −L + e ≈ 3.35 × 10−4 , 2.73 × 10−2 , + (S2.6) + 1.00, 3.35 × 10−3 . + + + + + + S2.2 Effective-range percentiles and clipping calibration + + We first estimate the logit range observed in data and then choose clipping accordingly. From two autoregressive +Transformers (distilgpt2 and gpt2) and two public corpora (Tiny Shakespeare and Pride and Prejudice) [1–5] at context +length 128, the effective range + + Leff,α = max(log pkept ) − min(log pkept ), α = 0.999, (S2.7) + +fell in a relatively narrow band, summarized in Table S2. + 21 + + TABLE S1: Example (N = 10): approximating exn −m = eIn −L using ỹ(I) = C[1 + (a + bI)2 ]−N with parameters + re-fitted on I ∈ [0, 8.0] using the same minimax pipeline. + + xn In target exn −m approx ỹ(In ) rel. err. + −4 −4 +−3.2 0.0 3.3546 × 10 3.4443 × 10 2.673% + 1.2 4.4 2.7324 × 10−2 2.7325 × 10−2 0.004% + 4.8 8.0 1.0000 0.9739 2.608% +−0.9 2.3 3.3460 × 10−3 3.2585 × 10−3 2.614% + + + TABLE S2: Effective-range percentiles (Leff,0.999 ) at context length 128. + + Percentile All runs (4 runs) GPT-2 + p50 6.92–7.23 7.09–7.23 + p90 8.60–8.75 8.73–8.75 + p95 8.97–9.12 9.06–9.12 + p99 9.50–9.69 9.58–9.69 + + + We then test clipping on the same rows with + + Ecum (t) = 12 ∥softmax(u(t) ) − softmax(u)∥1 , + (S2.8) + u(t) = max(u, t), u = s − max(s). + +and require p99{Ecum } ≤ 10−3 (0.1% budget). This criterion is satisfied at t = −12 (p99 ≈ 4.27 × 10−4 ) and violated +at t = −11 (p99 ≈ 1.24 × 10−3 ), so we set t∗ = −12 (Nclip = 12). + In practice, we (i) estimate an effective L from data, (ii) verify that fixed clipping keeps softmax error small, and (iii) +choose representative design points (e.g., L ≈ 8 or L ≈ 12) while treating the clipped tail as negligible. Full protocol +details, clipping-sweep tables/plots, and per-run statistics are provided in Sec. S3. + + + S2.3 Illustrative synthetic range map + √ + As a design-space reference, we consider synthetic logit-range regimes using L = max(x) − min(x) after QK ⊤ / dk +scaling. These regimes are illustrative rather than corpus-level percentiles; using the same fitting pipeline, Table S3 +summarizes achievable approximation error versus depth. + + TABLE S3: Synthetic softmax logit-range regimes (L = max(x) − min(x)) and fitted worst-case relative error + (design-space illustration; not intended as corpus-level statistics). + +L regime N =5 N = 10 N = 20 N = 30 + L=8 10.9% 2.68% 0.67% 0.30% + L = 12 40.0% 9.25% 2.27% 1.01% + L = 16 113% 23.0% 5.44% 2.41% + + + Table S3 suggests a simple rule of thumb: the required depth depends mainly on the target L regime. Near L ≈ 8, +moderate depth reaches a few-percent error, whereas L ≳ 12 typically requires deeper cascades to approach < 1% +error. + We include Table S3 as a synthetic design map rather than an empirical benchmark. + 22 + + S3. EMPIRICAL LOGIT-RANGE EXTRACTION FROM REAL TRANSFORMER RUNS + + We extracted empirical attention-logit ranges from real model runs to complement the synthetic L-regime map in +the main text. We used two open-source autoregressive Transformers (distilgpt2 and gpt2) and two public corpora +(Tiny Shakespeare and Pride and Prejudice), with context length 128 and causal masking. For each valid attention +row, if p = softmax(s) then the raw range is + Lraw = max(s) − min(s) = max(log p) − min(log p), (37) +where max/min are taken over valid causal keys only. Because very small tail probabilities can dominate min(log p), +we additionally report an effective range: + Leff,α = max(log pkept ) − min(log pkept ), (38) +where keys are sorted by attention weight and retained until cumulative mass reaches α = 0.999. + To stay within a 16 GB RAM budget, we processed one model at a time, batch size 1, fixed windowing (stride 128), +and streaming histogram quantiles. Observed process RSS stayed below 1.24 GB in these runs. + + TABLE S4: Empirical global logit-range percentiles from real model–dataset runs (context length 128): raw vs + effective (α = 0.999). + + Model Dataset raw p95 raw p99 Leff p50 Leff p90 Leff p95 Leff p99 + distilgpt2 tiny shakespeare 22.82 69.00 7.10 8.60 8.97 9.50 + distilgpt2 pride prejudice 21.76 68.60 6.92 8.60 9.03 9.57 + gpt2 tiny shakespeare 25.48 43.34 7.23 8.73 9.06 9.58 + gpt2 pride prejudice 24.13 40.92 7.09 8.75 9.12 9.69 + + For quick linkage to the main manuscript: the effective-range summary quoted in the main text corresponds to this +table (all runs: p50 = 6.92–7.23, p90 = 8.60–8.75, p95 = 8.97–9.12, p99 = 9.50–9.69), and the GPT-2 subset is p50 += 7.09–7.23, p90 = 8.73–8.75, p95 = 9.06–9.12, p99 = 9.58–9.69. +Clipping-validity sweep (additional justification). To test whether practical clipping magnitudes can be used +without materially changing softmax outputs, we evaluated a thresholded-logit approximation. For each row, define +u = s − max(s) and, for threshold t ≤ 0, + u(t) = max(u, t), p(t) = softmax(u(t) ). (39) +We report the cumulative softmax error + 1 (t) + p −p , + Ecum (t) = (40) + 2 1 +then sweep t ∈ {−14, −13, . . . , −6} and compute p50/p90/p95/p99 of Ecum over all extracted rows. + + TABLE S5: Global clipping-validity sweep: percentile statistics of Ecum (t) versus clipping threshold t. + + t p50 p90 p95 p99 + −5 −5 −5 + −14 2.53 × 10 4.55 × 10 4.80 × 10 5.18 × 10−5 + −5 −5 −5 + −13 2.69 × 10 4.85 × 10 7.38 × 10 1.48 × 10−4 + −5 −4 −4 + −12 2.99 × 10 1.21 × 10 2.13 × 10 4.27 × 10−4 + −11 3.31 × 10−5 3.95 × 10−4 6.55 × 10−4 1.24 × 10−3 + −10 3.72 × 10−5 1.28 × 10−3 2.01 × 10−3 3.58 × 10−3 + −9 4.41 × 10−5 4.04 × 10−3 6.11 × 10−3 1.03 × 10−2 + −8 2.25 × 10−4 1.26 × 10−2 1.83 × 10−2 2.91 × 10−2 + −7 2.76 × 10−3 3.85 × 10−2 5.30 × 10−2 7.89 × 10−2 + −6 1.88 × 10−2 1.11 × 10−1 1.43 × 10−1 1.95 × 10−1 + + Under a conservative budget criterion p99{Ecum } ≤ 10−3 , the least negative admissible threshold in this sweep +is t∗ = −12 (p99 ≈ 4.27 × 10−4 ). Equivalently, the operational clipping magnitude is Nclip ≡ −t∗ = 12. Notably, +this is closely aligned with the empirical effective-range scale (Table S4: p99 of Leff,0.999 up to ≈ 9.69), indicating +that clipping-constrained implementation and effective-range statistics operate in the same order-of-magnitude range +budget. This supports using a practical clipping magnitude comparable to the design range scale (L ≈ Nclip ) while +keeping aggregate softmax distortion below 0.1%. + 23 + + + + + FIG. S1: Global CDFs of raw Lraw (dashed) and effective Leff,0.999 (solid) for the four model–dataset runs. + + + + +FIG. S2: Percentile curves of cumulative softmax error Ecum (t) versus clipping threshold t. The dashed line marks the + 0.1% budget (10−3 ). + 24 + + S4. FDTD METHODOLOGY DETAILS AND X-CUT bV DERIVATION + + This section provides the detailed FDTD simulation methodology, the step-by-step X-cut arc electrode voltage +sensitivity derivation, and the full cascade optimization table referenced in the main text (Sec. IV–V). + + + S4.1 z-refined 3-fix simulation strategy + + For thin-film LiNbO3 structures, special care is required in the vertical (z) direction due to the high index contrast +between LiNbO3 (no ≈ 2.21) and SiO2 (n ≈ 1.44) and the sub-micron film thickness. We apply a “z-refined 3-fix” +strategy: + 1. Ordinary index correction: the material model uses the corrected ordinary refractive index no appropriate + for the TE mode in X-cut geometry, rather than the extraordinary index ne that governs TM propagation; + 2. z-span expansion: the simulation z-span is extended beyond the minimal waveguide region to include sufficient + substrate and superstrate so that evanescent field tails are captured without PML truncation artifacts; + 3. Auto-mesh: accuracy level 3; conformal variant 1 meshing is enabled, and no manual mesh override is applied. + The resulting vertical grid spacing in the slab region is approximately 55 nm, providing ∼2 cells across the 100 nm + slab. +This refinement strategy is critical for obtaining converged results in TFLN ring resonators, where the high-Q spectral +features are sensitive to numerical dispersion in under-resolved thin films [6]. Table S6 lists the full simulation +parameters. + + TABLE S6: 3D FDTD simulation parameters (Lumerical). + +Parameter Value +Solver Lumerical 3D FDTD +Mesh type Conformal variant 1 +Mesh accuracy 3 (auto-mesh) +z-mesh override None (auto-mesh) +Simulation time 50 ps +Auto shutoff 1 × 10−6 +Wavelength range 1530 nm to 1570 nm +Grid size 532 × 816 × 44 +Source Broadband mode source (TE0 ) + + + + + S4.2 X-cut arc electrode bV step-by-step derivation + + For the X-cut circular ring with lateral S–G arc electrodes (Table II), the crystal Z-axis (c-axis) is oriented at 45◦ +from the horizontal axis in the substrate plane. At azimuthal angle θ around the ring, the projection of the lateral +electric field onto the Z-axis is proportional to cos(θ − 45◦ ). The cos(θ − 45◦ ) = 0 boundaries fall at θ = 135◦ and +θ = 315◦ , naturally separating the bus-waveguide coupling regions from the electrode regions. Each ring carries a full +semicircular arc electrode on the side opposite to its coupling points. By the substitution φ = θ − 45◦ , the effective +EO fill factor is + Z Z +π/2 + 1 1 1 +π/2 1 + fEO = | cos(θ − 45◦ )| dθ = cos φ dφ = sin φ −π/2 = ≈ 0.318. (S4.1) + 2π semicircle 2π −π/2 2π π +The 45◦ rotation ensures that the electrode semicircle does not overlap with the coupling points, while the fill factor +integral is identical to the standard cos θ case by the change of variable. + The lateral S–G electrodes have gap gel = 5 µm, giving an effective electrode–waveguide distance deff ≈ gel /2 = 2.5 µm. +The lateral field geometry yields an EO overlap factor ΓEO = 0.7, compared to 0.5 for a vertical electrode configuration. + The refractive index change per volt in the electrode-covered section is + ∆neff 1 ΓEO 1 0.7 + = − n3e r33 = − × 2.1383 × 30.9 × 10−12 × = −4.226 × 10−5 V−1 . (S4.2) + V 2 deff 2 2.5 × 10−6 + 25 + +The corresponding resonance wavelength shift is + dλ0 1550 × 4.226 × 10−5 + = = 28.48 pm V−1 , (S4.3) + dV straight 2.30 + +giving an intrinsic (straight-section) voltage sensitivity of + 2QL dλ0 2 × 15,500 + bstraight + V = = × 0.02848 = 0.570 V−1 . (S4.4) + λ0 dV straight 1550 +However, only the arc-electrode portion of the ring circumference contributes to the round-trip phase shift. The +effective voltage sensitivity is therefore + 1 + bV = bstraight + V × fEO = 0.570 × ≈ 0.182 V−1 . (S4.5) + π +A 1 V applied voltage shifts the normalized detuning by ∆a ≈ 0.182. Despite the fill-factor penalty (fEO = 1/π ≈ 0.318), +the X-cut arc design benefits from a smaller effective electrode distance (2.5 µm vs. 4 µm for vertical configurations) +and a higher overlap factor (0.7 vs. 0.5), which partially compensate the reduced active length. + + + S4.3 Full cascade optimization table + + Table S7 presents the complete optimization results for the standard dynamic range L = 8 (corresponding to +e8 ≈ 2981, i.e., 34.7 dB), covering all cascade depths from N = 5 to N = 30. + + TABLE S7: Cascade optimization results for L = 8. The bias voltage Vbias = |a|/bV sets the DC offset, and +Vctrl = bL/bV is the maximum control voltage at I = L. Voltages computed with bV = 0.182 V−1 (FDTD-calibrated + best resonance QL = 15,500). + +N a b E∞ εmax (%) Vbias (V) Vctrl (V) + 5 −2.0789 0.21658 0.1035 10.91 11.4 9.5 + 8 −1.5959 0.12896 0.0412 4.20 8.8 5.7 +10 −1.4588 0.10202 0.0265 2.68 8.0 4.5 +12 −1.3731 0.08450 0.0184 1.86 7.5 3.7 +15 −1.2914 0.06726 0.0118 1.19 7.1 3.0 +17 −1.2543 0.05923 0.0092 0.92 6.9 2.6 +20 −1.2136 0.05025 0.0067 0.67 6.7 2.2 +25 −1.1685 0.04013 0.0043 0.43 6.4 1.8 +30 −1.1392 0.03341 0.0030 0.30 6.3 1.5 + + + Key thresholds for the minimum number of rings at various error targets are: + • ε < 10%: N ≥ 6, + • ε < 5%: N ≥ 8, + • ε < 2%: N ≥ 12, + • ε < 1%: N ≥ 17, + • ε < 0.5%: N ≥ 24. +These thresholds are independent of the quality factor Q, since the minimax approximation operates entirely in +normalized detuning space. The Q factor affects only the physical voltage required to achieve the necessary detuning +range, through bV . + + + S4.4 Lorentzian fit validation + + Figure S3 shows the Lorentzian fit to the FDTD drop-port resonance near λ = 1566 nm. The analytical Lorentzian +Tdrop (∆λ) = A/[1 + (2Q∆λ/λ0 )2 ] with QL = 15,500 closely tracks the FDTD data, validating the single-ring transfer +function model used in the cascade analysis. + 26 + + + + + FIG. S3: Lorentzian fit to the FDTD drop-port resonance. Markers: FDTD data; solid line: Lorentzian fit. The + extracted quality factor is QL = 15,500 with FWHM = 101 pm. + + + S4.5 Eigenmode (FDE) analysis of theoretical Qi + + To quantify how far below the physical limit the FDTD-extracted Qi = 38,800 lies, we perform a two-dimensional +finite-difference eigenmode (FDE) analysis of the bent rib waveguide cross-section using Lumerical MODE Solutions. + a. Setup. The FDE solver models the cross-section of the rib waveguide at the design bend radius R = 20 µm +and wavelength λ = 1550 nm, with perfectly matched layer (PML) boundaries on all four edges. The geometry is +identical to the 3D FDTD model: 600 nm total LiNbO3 (no = 2.211, lossless dielectric), 100 nm slab, 500 nm rib etch, +waveguide width W = 1.4 µm, on a 2 µm SiO2 substrate (n = 1.444) with air cladding. The mesh is set to 300 × 300 +cells over a 6 µm × 3 µm cross-section, yielding effective grid spacings ∆x ≈ 20 nm and ∆y ≈ 10 nm—substantially +finer than the 3D FDTD auto-mesh (55 nm vertical). + b. Complex effective index. The FDE solver returns a complex effective index neff = nr + i ni for each guided +mode, where the imaginary part ni encodes propagation loss. For the fundamental TE mode at R = 20 µm: + neff = 1.9653 + i (4.73 × 10−8 ), (41) + 4π ni + = 0.383 m−1 0.017 dB cm−1 . + + αrad+leak = (42) + λ +Since the material is set as lossless, this α captures only bending radiation loss and substrate leakage through the +100 nm slab. The corresponding quality factor is + 2π ng + Qrad+leak = = 2.43 × 107 , (43) + αrad+leak λ +where ng = 2.354 is the group index from the FDE solver (consistent with the FDTD FSR-derived ng = 2.30; the +small difference arises from the straight-section approximation inherent to 2D FDE). + c. Decomposition into bending and leakage. A separate FDE run with R = 1 mm (effectively straight) yields +Qleak = 2.93 × 107 , isolating the substrate leakage contribution. The pure bending radiation quality factor follows from + 1 1 1 + = − , Qbend = 1.43 × 108 . (44) + Qbend Qrad+leak Qleak +This confirms that bending radiation loss at R = 20 µm is negligible; substrate leakage through the thin slab is the +dominant geometric loss channel. + d. Material absorption. The FDE mode profile yields a confinement factor Γ = 0.887 (fraction of the optical +intensity within the LiNbO3 core and slab regions). The material-absorption-limited quality factor is + 2π ng + Qabs = , (45) + Γ αmat λ + 27 + +where αmat is the bulk material power-attenuation coefficient of LiNbO3 at 1550 nm. Table S8 evaluates Eq. (45) for +representative TFLN absorption values from the literature [6, 7]. + +TABLE S8: Theoretical intrinsic quality factor Qi of the R = 20 µm TFLN ring, decomposed into radiation (Qbend ), + substrate leakage (Qleak ), and material absorption (Qabs ). Sidewall scattering (fabrication-dependent) is excluded. + The total is 1/Qi = 1/Qrad+leak + 1/Qabs with Qrad+leak = 2.43 × 107 . + +Material condition αmat (dB/cm) Qabs Qi (total) +Bulk LiNbO3 (pristine) 0.002 2.3 × 108 2.2 × 107 +High-quality TFLN 0.01 4.7 × 107 1.6 × 107 +Good TFLN 0.03 1.6 × 107 9.5 × 106 +Typical TFLN 0.1 4.7 × 106 3.9 × 106 + + + For high-quality TFLN (αmat ≲ 0.01 dB cm−1 ), the theoretical Qi exceeds 107 —more than 400× higher than the +FDTD-extracted value of 38,800. This confirms that the FDTD result is dominated by numerical mesh artifacts +(approximately two cells across the 100 nm slab), not by physical loss mechanisms. Bending radiation loss at R = 20 µm +is negligible (Qbend = 1.43 × 108 ); the dominant geometric loss channel in the ideal structure is substrate leakage +through the thin slab (Qleak = 2.93 × 107 ). + 28 + + S5. FABRICATED HIGH-Q DESIGN PROJECTIONS + + Reproducing Qi > 105 in three-dimensional FDTD is computationally impractical: at accuracy level 3 the 100 nm +slab requires ∆z ≲ 20 nm to suppress staircase-induced scattering, inflating wall times beyond 30 days per run. The +numerically extracted Qi = 38,800 therefore represents a simulation floor, not a physical one. A two-dimensional +MODE-solver bend analysis confirms Qbend > 4.5 × 107 for R = 20 µm, placing bending radiation loss far below any +realistic intrinsic loss. + Table S9 surveys recent high-Q TFLN microring demonstrations. These studies show that Qi ≥ 9 × 106 has been +demonstrated in X-cut TFLN using multiple fabrication routes, including Ar+ milling, wet etching, and ICP-RIE/CMP- +based processes. + + TABLE S9: Demonstrated intrinsic quality factors in TFLN micro-ring resonators. “EO compatible” indicates + whether the fabrication process preserves electrode patterning capability. + +Ref. Qi R (µm) w (µm) Etch +Zhang [8] 107 80 ∼2 Ar+ mill +Gao [9] 108 100 ∼3 CMP∗ +Zhuang [10] 9×106 100 ∼2 Wet etch +Song [11] 2.9×107 200 4.5 ICP-RIE+CMP + All processes except ∗ are EO-electrode compatible. ∗ CMP-only (no dry etch); subsequent electrode patterning may degrade Qi . + + To project cascade performance into the fabricated regime, we fix Qext = 25,800 (the FDTD-extracted coupling +quality factor at gap = 100 nm) and compute Dmax = [Qi /(Qi + Qext )]2 for three representative intrinsic quality +factors (Table S10). + + N + TABLE S10: Projected cascade transmission for fabricated Qi values at fixed Qext = 25,800. Dmax is the ideal +on-resonance cascade transmission in dB. The minimax approximation error εmax depends only on N and L (not on + Qi ); at N = 20, L = 8: εmax = 0.67% (Table I). + +Projection Qi Dmax N =10 N =20 N =30 +FDTD baseline 3.88×104 0.36 −44.3 −88.5 −132.8 +Conservative 5×105 0.90 −4.4 −8.8 −13.2 +Moderate 106 0.95 −2.2 −4.5 −6.7 +Optimistic 5×106 0.99 −0.44 −0.88 −1.3 + + + Even in the conservative scenario (Qi = 5 × 105 ), Dmax = 0.90 and the N = 10 cascade loss is only −4.4 dB—an +order-of-magnitude improvement over the FDTD baseline. The moderate projection (Qi = 106 ) matches the “fabricated +high-Q” column in Table V. Because Qbend ≈ 4.5 × 107 ≫ Qi for all projections, bending loss is never the bottleneck; +the dominant loss mechanism is sidewall scattering, which is determined entirely by fabrication quality. The literature +values in Table S9 support the view that intrinsic quality factors in the projected range are physically achievable +in TFLN—albeit with wider waveguides (w ≥ 2 µm) and larger ring radii (R ≥ 80 µm) than the present design. +Transferring comparable sidewall quality to our geometry (R = 20 µm, w = 1.4 µm) is an open fabrication challenge; +the projections in Table S10 should be read as design targets contingent on achieving it. + 29 + + S6. INSERTION LOSS BUDGET DETAILS + + For a cascade of N rings, the total insertion loss is modeled as + + ILtot ≈ N · ILstage + ILcoupling , (S6.1) + +where ILstage is the per-ring insertion loss at off-resonance operation and ILcoupling accounts for fiber-to-chip and +chip-to-fiber coupling losses. Using typical loss numbers from the literature [12–16], we consider two scenarios: + + • Optimistic: ILstage = 0.08 dB, ILcoupling = 1.5 dB. Then ILtot ≈ 1.90 dB (N = 5), 2.30 dB (N = 10), 3.10 dB + (N = 20), and 3.80 dB (N = 30). + • Conservative: ILstage = 0.25 dB, ILcoupling = 3.0 dB. Then ILtot ≈ 4.25 dB (N = 5), 5.50 dB (N = 10), + 8.00 dB (N = 20), and 10.5 dB (N = 30). + + In both scenarios, N = 5–10 is manageable for probe-power budgeting, whereas N = 20 and N = 30 require tighter +power budgeting and more amplification margin. Higher ILtot raises the required probe SNR and pushes operation +closer to the detector noise floor, reducing usable dynamic range. + e. Four-component loss breakdown. The total insertion loss of the cascade has four components: + N + 1. On-resonance cascade transmission Dmax (dominant; see Table V); + 2. Inter-ring coupling loss (N − 1) × (−10 log10 ηcoupling ), where ηcoupling is the power transfer efficiency at each + inter-ring bus section. Two-ring FDTD yields ηcoupling ≈ 0.9 for the present diagonal-bus geometry, corresponding + to ∼0.46 dB per inter-ring stage; + 3. Off-resonance propagation loss N × ILstage , where ILstage = 0.08–0.25 dB per ring [12–14, 16]; + 4. Fiber-to-chip coupling loss ILcoupling = 1.5–3.0 dB [15]. + N +Table V presents the ideal on-resonance budget (Dmax only). Including all four components for the present diagonal-bus +layout: in the FDTD-characterized regime (Dmax = 0.36, N = 5), the total loss is approximately 22.2 + 1.8 + 0.4 + 1.5 ≈ +26 dB; in the fabricated high-Q regime (Dmax = 0.95, N = 30), the total loss is 6.7 + 13.3 + 2.4 + 1.5 ≈ 24 dB. The +inter-ring coupling loss dominates in the high-Q regime, underscoring that layout optimization (e.g., adiabatic tapers or +straight-bus coupling) is as important as achieving Dmax ≥ 0.95 through quality-factor improvement. For an optimized +layout with ηcoupling ≥ 0.98 (≤0.09 dB per stage), the N = 30 total loss would reduce to ∼13 dB. + 30 + + S7. ENERGY EFFICIENCY DETAILED DERIVATION + + This section provides the detailed energy-per-operation derivations for both electrical analog exponential circuits +and the photonic MRR cascade, as summarized in the main text (Sec. V). + + + S7.1 Electrical analog exponential circuits + + Three main families of electrical circuits realize the exponential function in the analog domain: + f. BJT translinear / Gilbert cell circuits. The collector current of a bipolar junction transistor is IC = +IS exp(VBE /VT ), providing an intrinsic exponential map [17, 18]. A Gilbert cell multiplier—the core building +block of translinear exponential circuits—dissipates 250–325 µW in typical CMOS/BiCMOS implementations [19]. At +a signal bandwidth of B ≈ 100 MHz, the energy per operation is + P 300 µW + EGilbert = = = 3 pJ. (S7.1) + B 100 MHz + g. CMOS subthreshold exponential circuits. A MOSFET in weak inversion exhibits ID ∝ exp(VGS /nVT ), enabling +direct exponential computation at ultra-low power [18]. A reconfigurable softmax circuit in 180 nm CMOS implements +a 10-input softmax at VDD = 500 mV with P = 3 µW [20]. Per-channel: Pexp ≈ 0.43 µW. At B ≈ 1 MHz (limited by +subthreshold fT ): + 0.43 µW + Esub-VT = = 0.43 pJ. (S7.2) + 1 MHz +This is the most energy-efficient electrical approach, but at severely limited bandwidth (∼1 MHz). + h. Digital CMOS (for reference). A digital exponential via Taylor series requires ∼10 multiply-add operations. +Using Horowitz’s energy figures [21] for 45 nm at 0.9 V: 32-bit FP multiply costs 3.7 pJ, FP add costs 0.9 pJ, giving + Edigital ≈ 10 × (3.7 pJ + 0.9 pJ) = 46 pJ. (S7.3) +At 8-bit precision (sufficient for inference): ∼2.3 pJ. + + + S7.2 Photonic MRR cascade: single-channel energy derivation + + We evaluate the energy at N = 30 cascaded X-cut TFLN micro-ring resonators with R = 20 µm in the fabricated +high-Q regime (Qi = 106 , QL ≈ 25,200; Supplementary Sec. S5), which achieves εmax = 0.30% with Vctrl = 0.91 V +(fully CMOS-compatible). The energy per exponential operation has three components: + (i) Electro-optic tuning energy. Each ring is tuned by charging the arc electrode capacitance to Vctrl . For the lateral +S–G arc electrodes covering one semicircle (Larc = πR = 62.8 µm), the electrode capacitance is estimated as + Cel ≈ 18 fF, (S7.4) +based on coplanar electrode modeling for TFLN lateral S–G geometries with gel = 5 µm (comparable to values reported +by Bahadori et al. [22] for similar geometries). The switching energy per ring at Vctrl = 0.91 V (using the projected +QL = 25,200, which gives bV = 0.295 V−1 ): + 2 + Ering = 12 Cel Vctrl = 12 × 18 fF × (0.91 V)2 = 7.4 fJ. (S7.5) +For N = 30 rings: EEO = 30 × 7.4 = 0.22 pJ. + Note the important scaling: EEO ∝ 1/N since b ∝ 1/N from minimax optimization, because + 2 + EEO ∝ N × Vctrl ∝ N × (b/bV )2 ∝ 1/N. (S7.6) +The bias voltage (3.9 V) is static and does not contribute per-operation energy. + (ii) Laser source energy (amortized). Because every cascade channel uses the same fixed probe wavelength, a single +CW laser can be shared among M parallel softmax channels via a 1 × M optical power splitter. With wall-plug +efficiency ηWPE ≈ 15% [23], the per-channel optical power is Popt = Pin /M ≈ 100 µW (for Pin = 1 mW, M = 10), +requiring Plaser ≈ 667 µW per channel. At fmod = 10 GHz: Elaser = 667 µW / 10 GHz = 67 fJ. + (iii) Photodetector energy. Integrated SiGe photodetectors with TIA achieve sub-pJ reception [24]: EPD ≈ 0.5 pJ. + The total single-channel energy is + (1ch) + Ephotonic = EEO + Elaser + EPD = 0.22 + 0.07 + 0.50 = 0.79 pJ. (S7.7) + 31 + + S7.3 Q-factor scaling of energy efficiency + + 2 + Since Vctrl ∝ 1/Q and EEO ∝ Vctrl , the EO energy scales as 1/Q2 . Table S11 shows the total energy for N = 30 at +various quality factors. + +TABLE S11: Energy per exponential operation vs. quality factor (N = 30, εmax = 0.30%, X-cut arc electrode with bV + scales linearly with Q; Cel = 18 fF). Elaser + EPD = 0.57 pJ is the Q-independent floor. The dagger (†) marks the +FDTD-calibrated quality factor; the double dagger (‡) marks the high-Q design point (Qi = 106 ). Excludes thermal + stabilization (0.15–0.60 pJ for N = 30). + + Q Vctrl (V) Vbias (V) EEO (pJ) Etotal (pJ) + 5,000 4.57 19.5 5.64 6.21 + 10,000 2.28 9.7 1.40 1.97 + 12,500 1.83 7.8 0.90 1.47 +15,500† 1.47 6.3 0.58 1.15 + 20,000 1.14 4.9 0.35 0.92 +25,200‡ 0.91 3.9 0.22 0.79 + 30,000 0.76 3.2 0.16 0.73 + 50,000 0.46 1.9 0.06 0.63 + + + At QL = 15,500 (FDTD-calibrated), the EO contribution (0.58 pJ) is comparable to the optical floor, placing the +design in the efficient operating regime. Beyond Q ≈ 30,000, the EO contribution becomes negligible and the total +energy saturates near the floor; further Q improvement primarily benefits CMOS driver voltage compatibility rather +than energy. + i. Additional energy contributions. The estimates above exclude two further contributions: (i) DAC energy +for setting the per-ring control voltages, typically 0.1–1 pJ per conversion at 10 GHz bandwidth; and (ii) thermal +stabilization power for maintaining resonance alignment, estimated at ∼50–200 µW per ring for TFLN (lower than +silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). At 10 GHz modulation rate, +the thermal contribution amounts to ∼0.005–0.02 pJ per ring per operation. For the N = 30 cascade, this sums to +0.15–0.60 pJ, which is comparable to EEO and must be included in the total: Etotal ≈ 0.94–1.39 pJ. The total energy +comparison should therefore be treated as an order-of-magnitude estimate. + + + S7.4 Comparison with electronic implementations + + Here we provide an order-of-magnitude energy comparison between electrical analog exponential circuits and our +photonic MRR cascade, grounding the analysis in published device data and first-principles estimates. We assume +a shared CW laser with total optical output Pin,tot = 1 mW, split across M = 10 parallel softmax channels via a +1 × M power splitter, yielding per-channel input Pin,ch = 100 µW. The output power at the cascade drop port is + N +Pout = Pin,ch × Dmax , which ranges from 0.61 µW (FDTD regime, N = 5) to 21.5 µW (fabricated regime, N = 30) +(Table V). + j. Electrical analog exponential circuits. Three main families of electrical analog exponential circuits are compared: +BJT translinear/Gilbert cell (∼ 3 pJ at 100 MHz [17–19]), CMOS subthreshold (∼ 0.43 pJ at 1 MHz [18, 20]), and +digital FP32 Taylor series (∼ 46 pJ at 1 GHz [21]). + k. Photonic MRR cascade: single-channel energy. For N = 30 X-cut TFLN micro-ring resonators in the self- +consistent high-Q regime (QL = 25,200), the three energy components are EO tuning (EEO = 0.22 pJ), amortized +laser (Elaser = 0.07 pJ, shared across M = 10 parallel channels), and photodetector (EPD = 0.50 pJ), yielding +Ephotonic = 0.79 pJ. Including thermal stabilization for N = 30 rings (0.15–0.60 pJ), the total rises to 0.94–1.39 pJ. +Notably, EEO ∝ 1/N since b ∝ 1/N from minimax optimization. + l. Single-channel comparison. Table S12 presents the comparison. The photonic cascade at N = 30 achieves +0.79 pJ baseline—3.8× lower than the BJT Gilbert cell (3 pJ) and 58× lower than digital FP32 (46 pJ). Including +thermal stabilization (0.94–1.39 pJ), the advantage over INT8 (2.3 pJ) is 1.7–2.4×, while operating at 10 GHz +bandwidth. At fabricated Q ≥ 30,000, EEO drops to 0.16 pJ and Etotal ≈ 0.73 pJ (excluding thermal; Table S11), +recovering a 3.2× advantage over INT8. Subthreshold CMOS achieves the lowest energy (0.43 pJ) but at 10,000× +lower bandwidth. + m. Caveats. These values are order-of-magnitude estimates, not device-accurate predictions. The photonic +estimate excludes DAC energy for voltage generation (typically 0.1–1 pJ per conversion at 10 GHz bandwidth, shared +with any analog approach) and thermal tuning power for maintaining resonance alignment (∼50–200 µW per ring for + 32 + + TABLE S12: Energy per exponential operation: single-channel comparison. + +Implementation E/op (pJ) Bandwidth Notes +Digital FP32 (Taylor) ∼46 1 GHz 10 FP MACs +BJT Gilbert cell ∼3 100 MHz Analog +Digital INT8 (Taylor) ∼2.3 1 GHz 10 INT MACs +Photonic MRR (N = 30) 0.94–1.39 10 GHz Analog† +Subthreshold CMOS ∼0.43 1 MHz Analog + † 0.79 pJ excluding thermal; 0.94–1.39 pJ including thermal. Self-consistent with fabricated high-Q regime (Q = 25,200); see + L + Supplementary Sec. S7. + + +TFLN, lower than silicon due to the small thermo-optic coefficient of LiNbO3 , dn/dT ≈ 3.9 × 10−6 K−1 ). Effective +precision at the photodetector is limited to ∼6–8 bits by shot noise and receiver electronics. The energy advantage +over electrical implementations is strongest in the fabricated high-Q regime (Dmax ≥ 0.95), where N = 30 is practical +and Vctrl remains CMOS-compatible. + 33 + + S8. MONTE CARLO ROBUSTNESS UNDER DEVICE NON-IDEALITIES + + This section describes the robustness model summarized in the main text. For the fitted L = 8, N = 10 design +(a = −1.4588, b = 0.10202), each Monte Carlo chip sample includes: (i) per-ring static detuning spread, (ii) per- +ring sensitivity spread, (iii) global thermal drift and crosstalk-like slope drift, (iv) stage insertion-loss variation, (v) +control-channel noise, and (vi) detector noise with one-point calibration at I = L. + For ring k, we use + 1 + Tk (I) = 2, (46) + 1 + (ak + bk I + dth + dxt I/L) + +with + N + Y + y(I) = Tk (I) × 10−ILtot /10 , (47) + k=1 + +and one-point calibration ỹ(I) = Ccal y(I) such that ỹ(L) = 1 for the same chip instance. + + TABLE S13: Non-ideality distributions used in the Monte Carlo sweeps. + + Parameter Nominal Stress + σa 0.020 0.032 + σb,rel 0.020 0.032 + σth 0.015 0.025 + σxt 0.012 0.020 + σI 0.004 0.007 + ILstage (dB, µ ± σ) 0.12 ± 0.03 0.18 ± 0.05 + σdet 3.0 × 10−6 6.0 × 10−6 + + + + TABLE S14: Monte Carlo summary (same run reported in main text). + + Metric Nominal Stress + Median KL(pref ∥papprox ) 2.17 × 10−4 7.39 × 10−4 + p95 KL(pref ∥papprox ) 5.92 × 10−4 2.21 × 10−3 + Median max |∆p| 0.170% 0.193% + p95 max |∆p| 0.319% 0.419% + +Conservative-bound sketch used for the main-text screening equation. For the identical-detuning family +with fixed b, define + + ln ỹ(I) = N ϕ(a + bI) − N ϕ(a + bL), ϕ(u) = − ln(1 + u2 ), (48) + +so that ỹ(L) = 1 by construction. Around a constructive choice with a + bI < 0 on [0, L], a second-order remainder +argument for the mismatch between the target slope and the fitted slope yields a term scaling as L2 /(4N ), while the +flank-curvature penalty contributes a term scaling as 1/(2b2 N ). Combining the two contributions gives the screening +inequality + + L2 1 + E∞ ≲ + 2 , (49) + 4N 2b N +which leads to the conservative screening equation reported in the main manuscript. We emphasize that this is a +conservative heuristic design rule (not a formal minimax theorem), used only for preliminary depth screening. + 34 + + + + +FIG. S4: CDF of end-to-end softmax probability error under the same non-ideality samples. + 35 + + S9. DELAY-AWARE FEEDBACK NORMALIZATION VALIDATION + + We model global normalization as a delayed PI-controlled loop: + + S(t) = G(t)P (t) + n(t), (50) + dP + τ = −P (t) + u(t − Td ), (51) + dt Z + u(t) = Kp e(t) + Ki e(t) dt, e(t) = Sref − S(t), (52) + +with actuator saturation 0 ≤ u ≤ Pmax . A piecewise G(t) profile is used to emulate workload changes. For physical +intuition, Table S15 converts normalized delay/settling metrics into absolute-time examples. + +TABLE S15: Example absolute-time interpretation of normalized PI-loop metrics using one representative stable case + ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2)) and a ±2% settling-time definition (Tsettle ∼ 12.4τ ). + + Assumed τ Delay Td = 0.2τ Settling ∼ 12.4τ Interpretation + 100 ns 20 ns 1.24 µs fast loop + 1 µs 200 ns 12.4 µs moderate loop + 5 µs 1 µs 62 µs slower loop + +Reference-backed latency context for bottleneck screening. To place the delayed-loop times against mixed- +signal system latencies, Table S16 summarizes representative time scales with explicit path classes (on-chip vs off-chip) +for memory and interconnect paths, alongside conversion latency ranges. These are intentionally order-of-magnitude +ranges (not fixed constants), and can shift with architecture, clocking, and protocol stack choices. + + TABLE S16: Representative subsystem latency ranges used for conservative bottleneck screening in Sec. S9. + + Subsystem path Tsys Sources + On-chip memory (L1/L2) 20–200 ns [25] + Off-chip memory (DRAM) 200–700 ns [25, 26] + ADC conversion 10–710 ns [27, 28] + DAC + driver/settling 1–200 ns [29] + On-chip interconnect (NoC) 5–100 ns [30] + Off-chip I/O (PCIe/CXL) 1–10 µs [25, 31] + +Conservative risk-screening heuristic for loop latency. As a screening heuristic, we use the settling time from +one representative stable case ((Kp , Ki , Td /τ ) = (0.55, 0.8, 0.2); Table S18), with settling defined as the first time +entering and remaining within a ±2% band around Sref , as a normalization-loop latency proxy: + + Tnorm ≈ 12.4 τ. (53) + +This value is not a universal bound; different gain settings, delay ratios, or loop architectures will yield different settling +times. It is used only as a reference point for order-of-magnitude risk screening. Define the conservative screening +metric + + Tnorm ≥ β Tsys , (54) + +with β = 1 (high-risk screening line) and β = 0.5 (early-warning line); this is a heuristic risk indicator, not a formal +dominance proof. The corresponding threshold is + β Tsys + τcrit (β) = . (55) + 12.4 +Table S17 gives the resulting numeric ranges. +For the explicit examples in Table S15, τ = 0.1 µs gives Tnorm ≈ 1.24 µs, τ = 1 µs gives Tnorm ≈ 12.4 µs, and τ = 5 µs +gives Tnorm ≈ 62 µs. These numbers indicate a risk trend (not a hard boundary): for this representative case, the +normalization loop is typically non-dominant when τ is well below the relevant τcrit band, and it may become dominant + 36 + + TABLE S17: Computed τcrit ranges from Eq. (55) using Table S16. + + Subsystem Tsys range τcrit (β = 0.5) τcrit (β = 1) + On-chip memory path 20–200 ns 0.81–8.06 ns 1.61–16.13 ns + Off-chip memory path 200–700 ns 8.06–28.23 ns 16.13–56.45 ns + ADC conversion 10–710 ns 0.40–28.63 ns 0.81–57.26 ns + DAC+driver/settling 1–200 ns 0.04–8.06 ns 0.08–16.13 ns + On-chip interconnect (NoC) 5–100 ns 0.20–4.03 ns 0.40–8.06 ns + Off-chip I/O fabric 1–10 µs 0.04–0.40 µs 0.08–0.81 µs + + +as τ approaches or exceeds that band. The transition depends on path class (on-chip vs off-chip) and on architecture- +specific timing closure, including whether the normalization path lies on the end-to-end critical path (Table S16). +Accordingly, this analysis is intended for preliminary risk screening only; concrete implementations +require full timing validation. + +TABLE S18: Representative step-response cases for the delayed PI loop (settling defined by a ±2% band around Sref ). + + Case (Kp , Ki , Td /τ ) Overshoot Settling Stable + Stable (0.55, 0.8, 0.2) 25.6% ∼ 12.4τ Yes + Marginal (0.95, 1.6, 0.45) 25.6% ∼ 12.8τ Yes + Unstable (1.2, 2.2, 0.75) 45.1% not settled No + + + + TABLE S19: Stable-region fraction from gain-map scans at each delay ratio. + + Td /τ Stable fraction + 0.0 88.1% + 0.2 88.0% + 0.5 72.4% + 0.8 47.5% + 37 + + + + +FIG. S5: Step-response examples of the delayed PI normalization loop. + 38 + + + + +FIG. S6: Delay-dependent stability maps over scanned (Kp , Ki ) ranges. + 39 + + S10. REPRODUCIBILITY + + Scripts used for this Supplementary validation: + • scripts/nonideality montecarlo.py + + • scripts/feedback loop validation.py + + • scripts/extract logit range effective.py + + • scripts/analyze softmax clipping validity.py +Public code repository: https://github.com/hyoseokp/MRR-AEF (commit 585e695). Empirical extraction outputs +are stored under: + • paper/empirical L v3/ + + + + + [1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia + Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017), pages + 5998–6008, 2017. + [2] Alec Radford et al. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019. + [3] Hugging Face. distilgpt2 model card, 2025. accessed 2026-02-21. + [4] Andrej Karpathy. Tiny shakespeare dataset (char-rnn), 2025. accessed 2026-02-21. + [5] Jane Austen. Pride and prejudice. Project Gutenberg eBook No. 1342, 2025. accessed 2026-02-21. + [6] Di Zhu et al. Integrated photonics on thin-film lithium niobate. Advances in Optics and Photonics, 13(2):242–352, 2021. + [7] Yaowen Hu, Di Zhu, Shengyuan Lu, Xinrui Zhu, Yunxiang Song, Dylan Renaud, Daniel Assumpcao, Rebecca Cheng, + CJ Xin, Matthew Yeh, Hana Warner, Xiangwen Guo, Amirhassan Shams-Ansari, David Barton, Neil Sinclair, and Marko + Loncar. Integrated electro-optics on thin-film lithium niobate. Nature Reviews Physics, 2025. + [8] Mian Zhang, Cheng Wang, Rebecca Cheng, Amirhassan Shams-Ansari, and Marko Lončar. Monolithic ultra-high-Q lithium + niobate microring resonator. Optica, 4(12):1536–1537, 2017. + [9] Renhong Gao, Ni Yao, Jianglin Guan, Li Deng, Jintian Lin, Min Wang, Lingling Qiao, Wei Fang, and Ya Cheng. Lithium + niobate microring with ultra-high Q factor above 108 . Chin. Opt. Lett., 20(1):011902, 2022. +[10] Rongjin Zhuang, Jinze He, Yifan Qi, and Yang Li. High-Q thin-film lithium niobate microrings fabricated with wet etching. + Adv. Mater., 35(3):2208113, 2023. +[11] Xinrui Zhu, Yaowen Hu, Shengyuan Lu, Hana K. Warner, Xudong Li, Yunxiang Song, Letı́cia S. Magalhães, Amirhassan + Shams-Ansari, Neil Sinclair, and Marko Lončar. Twenty-nine million intrinsic Q-factor monolithic microresonators on + thin-film lithium niobate. Photon. Res., 12(8):A63–A68, 2024. +[12] Sudip Shekhar, Wim Bogaerts, Lukas Chrostowski, John E. Bowers, Michael Hochberg, Richard Soref, and Bhavin J. + Shastri. Roadmapping the next generation of silicon photonics. Nature Communications, 15:751, 2024. +[13] Xaveer Leijtens et al. Multimode silicon photonics. Nanophotonics, 7:1571–1580, 2018. +[14] Haoqian Li et al. In-memory photonic dot-product engine with electrically programmable weight banks. Nature Communi- + cations, 14:2389, 2023. +[15] Daan Vermeulen et al. High-efficiency fiber-to-chip grating couplers realized using an advanced cmos-compatible silicon-on- + insulator platform. Optics Express, 18(17):18278–18283, 2010. +[16] F. S. Tan, D. J. W. Klunder, H. F. Bulthuis, G. Sengo, H. J. W. M. Hoekstra, and A. Driessen. Direct measurement of + the on-chip insertion loss of high finesse microring resonators in si3 n4 -sio2 technology. In Proceedings of the IEEE LEOS + Benelux Chapter, 2001. +[17] B. Gilbert. Translinear circuits: a proposed classification. Electron. Lett., 11(1):14–16, 1975. +[18] C. Mead. Analog VLSI and Neural Systems. Addison-Wesley, 1989. +[19] B. Razavi. Design of Analog CMOS Integrated Circuits. McGraw-Hill, 2 edition, 2017. +[20] Massimo Vatalaro, Tatiana Moposita, Sebastiano Strangio, Lionel Trojman, Andrei Vladimirescu, Marco Lanuzza, and + Felice Crupi. A low-voltage, low-power reconfigurable current-mode softmax circuit for analog neural networks. Electronics, + 10(9):1004, 2021. +[21] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State + Circuits Conference (ISSCC), pages 10–14, 2014. +[22] Meisam Bahadori, Yansong Yang, Ahmed E. Hassanien, Lynford L. Goddard, and Songbin Gong. Ultra-efficient and fully + isotropic monolithic microring modulators in a thin-film lithium niobate photonics platform. Optics Express, 28(20):29644– + 29661, 2020. +[23] A. Biberman and K. Bergman. Optical interconnection networks for high-performance computing systems. Rep. Prog. + Phys., 75(4):046402, 2012. + 40 + +[24] D. A. B. Miller. Attojoule optoelectronics for low-energy information processing and communications. J. Lightwave Technol., + 35(3):346–396, 2017. +[25] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via + microbenchmarking. arXiv preprint arXiv:1804.06826, 2018. +[26] Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, and Onur Mutlu. A case for exploiting subarray-level parallelism + (SALP) in DRAM. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA), pages + 368–379, 2012. +[27] Texas Instruments. ADC12DJ3200: 6.4-GSPS single-channel or 3.2-GSPS dual-channel, 12-bit, RF-sampling analog-to-digital + converter. Datasheet (SLVSD97A, revised April 2020), 2020. Accessed 2026-02-22. +[28] Texas Instruments. ADS8881: 18-bit, 1-MSPS, low-power, true-differential SAR ADC. Datasheet (SBAS547D, revised + August 2015), 2015. Accessed 2026-02-22. +[29] Texas Instruments. DAC38RF82/DAC38RF89: Dual-channel, 14-bit, 9-GSPS and 6-GSPS RF DACs. Datasheet + (SLASEA6D, revised June 2020), 2020. Accessed 2026-02-22. +[30] W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the 38th Design + Automation Conference (DAC), pages 684–689, 2001. +[31] Shintaro Sano, Yosuke Bando, Kazuhiro Hiwada, Hirotsugu Kajihara, Tomoya Suzuki, Yu Nakanishi, Daisuke Taki, and + Akiyuki Kaneko. Gpu graph processing on CXL-based microsecond-latency external memory. In Proceedings of the SC ’23 + Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023. +
\ No newline at end of file |
