From bd9333eda60a9029a198acaeacb1eca4312bd1e8 Mon Sep 17 00:00:00 2001 From: YurenHao0426 Date: Mon, 4 May 2026 23:05:16 -0500 Subject: =?UTF-8?q?Initial=20release:=20GRAFT=20(KAFT)=20=E2=80=94=20NeurI?= =?UTF-8?q?PS=202026=20submission=20code?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Topology-factorized Jacobian-aligned feedback for deep GNNs. Includes: - src/: GraphGrAPETrainer (KAFT) + BP / DFA / DFA-GNN / VanillaGrAPE baselines + multi-probe alignment estimator + dataset / sparse-mm utilities. - experiments/: 19 runners reproducing every figure / table in the paper. - figures/: 4 generators + the 4 PDFs cited in the report. - paper/: NeurIPS .tex and consolidated experiments_master notes. Smoke test: 50-epoch Cora GCN L=4 gives BP 77.3% / KAFT 79.0%. --- paper/neurips_v4_main.tex | 808 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 808 insertions(+) create mode 100644 paper/neurips_v4_main.tex (limited to 'paper/neurips_v4_main.tex') diff --git a/paper/neurips_v4_main.tex b/paper/neurips_v4_main.tex new file mode 100644 index 0000000..dc991c9 --- /dev/null +++ b/paper/neurips_v4_main.tex @@ -0,0 +1,808 @@ +\documentclass{article} + +\PassOptionsToPackage{numbers, compress}{natbib} +\usepackage[preprint]{neurips_2024} + +\usepackage{amsmath,amssymb,amsthm} +\usepackage{booktabs} +\usepackage{multirow} +\usepackage{xcolor} +\usepackage{hyperref} +\usepackage{enumitem} +\usepackage{caption} +\usepackage{longtable} +\usepackage{float} +\usepackage{graphicx} +\usepackage{tabularx} +\usepackage{array} + +% Color definitions +\definecolor{pos}{RGB}{0,120,60} +\definecolor{neg}{RGB}{180,0,0} +\definecolor{bestgreen}{RGB}{0,130,60} +\newcommand{\pos}[1]{\textcolor{pos}{\textbf{+#1}}} +\newcommand{\nega}[1]{\textcolor{neg}{#1}} +% \best wraps math content in bold green. Uses boldmath so $77.3$ becomes bold. +\newcommand{\best}[1]{\textcolor{bestgreen}{\boldmath\textbf{#1}}} +\newcommand{\Ahat}{\hat{A}} + +% tabularx column types +\newcolumntype{C}{>{\centering\arraybackslash}X} +\newcolumntype{R}{>{\raggedleft\arraybackslash}X} +\newcolumntype{L}{>{\raggedright\arraybackslash}X} + +\newtheorem{theorem}{Theorem} +\newtheorem{corollary}{Corollary} +\newtheorem{remark}{Remark} + +\title{GRAFT: Topology-Factorized Jacobian Alignment\\for Backprop-Free Deep Graph Learning} + +\author{% + Anonymous \\ + Preprint +} + +\begin{document} +\maketitle + +\begin{abstract} +Deep graph neural networks fail not because of forward representational collapse but because the backward pass cannot transport supervision through repeated graph propagation: in a $L$-layer GCN at $L{=}10$, BP gradients underflow to numerical zero at all hidden layers while forward representations remain discriminable. We identify this as a \emph{gradient transport bottleneck} and address it by replacing BP's sequential backward rule with a topology-factorized feedback operator. Our method, GRAFT (Graph-Aligned Feedback Training), exploits the Kronecker structure $W_{\ell+1}^\top \otimes \hat{A}$ of the GNN Jacobian to factor the feedback into a learned feature-side matrix $R_\ell$ (multi-probe Jacobian alignment) and a fixed graph-structured operator $P_\ell(\hat{A})$. GRAFT preserves the standard message-passing forward pass and computes all hidden-layer feedback in $O(1)$ parallel depth on GPUs. On Cora, CiteSeer, PubMed, and DBLP, GRAFT improves over backpropagation in 86 of 96 paired comparisons (up to $+11.5\%$, all significant after Benjamini--Hochberg correction at $q{=}0.05$), and per-layer cosine alignment with the true BP gradient stays $>0.3$ throughout training. A wrong-topology causal control reveals that backward graph structure is necessary in a strong sense: rewired, permuted, and Erd\H{o}s--R\'enyi backward graphs collapse training by 35--45\%, while removing topology entirely ($P{=}I$) is comparatively benign. Because GRAFT modifies only the backward rule, it stacks additively with forward-side anti-over-smoothing methods (ResGCN, JKNet); the combination achieves the best DBLP accuracy in our study. An optimized implementation runs $1.6$--$1.8\times$ faster than BP on sparse graphs at equivalent accuracy. GRAFT shows that for deep graph learning, the right backward rule is one that respects the graph. +\end{abstract} + +%% ============================================================ +%% 1. Introduction +%% ============================================================ +\section{Introduction}\label{sec:intro} + +Deep graph neural networks hit a wall. Adding layers beyond $L\!\approx\!4$ on standard message-passing architectures typically hurts accuracy and increases variance. The dominant explanation has been \emph{over-smoothing}~\cite{oversmoothing}: as depth grows, repeated graph propagation drives node representations toward a low-dimensional attractor determined by the graph Laplacian. Recent work~\cite{arroyo2025vanishing,oversmoothing_fallacy,roth2024rank} has begun to question this narrative, arguing that vanishing gradients are the operative mechanism, not forward representation collapse. + +We provide direct empirical evidence that the failure mode is a backward transport problem: in a 10-layer GCN trained on Cora, BP gradients underflow to numerical zero in \emph{every} hidden layer (20 seeds, no exceptions), even though the forward representations remain numerically discriminable. The supervision signal cannot reach the early layers because the backward chain through repeated graph propagation has spectral radius bounded by 1 and the layerwise transformation does not compensate. We call this the \emph{gradient transport bottleneck}, and it is the lens through which we approach deep GNN training in this paper. + +\paragraph{The Kronecker observation.} +A single GCN layer's pre-activation Jacobian factors cleanly into a feature-side weight transpose and a topology-side adjacency operator. This split is structural to message passing and motivates a feedback rule with the same two factors, rather than the unstructured dense feedback used in prior FA methods. We make the factorization precise in Theorem~\ref{thm:jacobian} (Section~\ref{sec:method}). + +\paragraph{Our method: GRAFT.} +We propose GRAFT (Graph-Aligned Feedback Training), a backward-only training rule that replaces BP's sequential transport with a factorized feedback operator +\[ +\delta_\ell \;=\; \sigma'(Z_\ell)\;\odot\;\bigl[P_\ell(\hat{A})\,\bar{E}\,R_\ell\bigr], +\] +where $R_\ell\!\in\!\mathbb{R}^{C\times d}$ is a \emph{learned} feature-side matrix that approximates the chain $W_{L-1}^\top \cdots W_{\ell+1}^\top$ via multi-probe Jacobian alignment, and $P_\ell(\hat{A}) = \hat{A}^{\min(L-1-\ell,K)}$ is a \emph{fixed} graph-structured operator. GRAFT leaves the standard forward computation unchanged and computes all hidden-layer feedback in $O(1)$ parallel depth on GPUs. + +\paragraph{Causal evidence that the graph matters.} +A natural skepticism is that GRAFT's accuracy gains come from any reasonable feedback operator and the explicit graph factor is decorative. We rule this out with a wrong-topology causal control: keeping the forward pass on the true graph, we replace the backward graph with three alternatives---a degree-matched random rewiring, a node permutation $\Pi\hat{A}\Pi^\top$, and an Erd\H{o}s--R\'enyi graph of matched density. On Cora, GRAFT collapses from $77.2\%$ to $32\%$ under any of these wrong topologies; the rewired and Erd\H{o}s--R\'enyi controls produce similarly large drops on CiteSeer and DBLP, and permutation remains strongly harmful on Cora and CiteSeer (Section~\ref{sec:wrong-topology}). Removing the graph entirely ($P\!=\!I$), by contrast, has no effect. Backward graph structure is therefore necessary in a strong sense, but only when it is consistent with the forward pass: contradictory topology actively corrupts feedback, while no topology is benign. + +\paragraph{Contributions.} +\begin{itemize}[nosep] + \item \textbf{Diagnosis} (\S\ref{sec:gradient-reach}). We empirically demonstrate that BP gradient norms in deep GCNs vanish to numerical zero at $L\!=\!10$ (20 seeds, all hidden layers), reframing the deep GNN problem as a backward gradient transport failure rather than a forward representation collapse. + \item \textbf{Method} (\S\ref{sec:method}). We propose GRAFT, the first training rule for GNNs that learns a Jacobian-aligned feedback operator while explicitly factoring the graph and feature-side transport components. Theorem~1 establishes the exact factorization in the linear case. + \item \textbf{Causal validation} (\S\ref{sec:wrong-topology}). Wrong-topology controls show that contradictory backward graphs degrade GRAFT by 35--45\%, while the absence of topology is benign---the first such test of graph-aware feedback alignment. + \item \textbf{Empirical results} (\S\ref{sec:experiments}, \S\ref{sec:efficient}). On Cora, CiteSeer, PubMed, and DBLP across four backbones and two depths, GRAFT improves over BP in 86 of 96 paired comparisons, with all wins surviving Benjamini--Hochberg correction at $q\!=\!0.05$. Because GRAFT modifies only the backward rule, it stacks additively with forward-side methods (ResGCN, JKNet), and an optimized implementation runs $1.6$--$1.8\times$ faster than BP on sparse graphs. +\end{itemize} + +%% ============================================================ +%% 2. Related Work +%% ============================================================ +\section{Related Work}\label{sec:related} + +The standard account of why deep GNNs fail emphasizes \emph{oversmoothing}: repeated graph propagation drives node embeddings toward a low-dimensional subspace, motivating a large literature on forward representation collapse~\cite{oversmoothing}. Recent work has sharpened this picture into a more optimization-centered diagnosis. Arroyo et al.~\cite{arroyo2025vanishing} argue that vanishing gradients are the primary pathology in deep GNNs and propose a state-space reformulation that stabilizes training by modifying the forward recurrence, while the ``oversmoothing fallacy'' line~\cite{oversmoothing_fallacy} similarly argues that many apparent oversmoothing symptoms are better explained by transformation- and activation-induced gradient decay. Roth and Liebig~\cite{roth2024rank} recast deep-GNN degradation through rank collapse and related Kronecker-style structure, and Deidda et al.~\cite{deidda2025rank} further argue that rank-based measures capture this degradation better than classical energy-based proxies. GRAFT is closest to these works in diagnosis, but differs in what it does with that diagnosis: rather than reinterpreting deep-GNN failure or redesigning the forward dynamics, we treat vanishing supervision as a \emph{backward transport bottleneck} and replace the backward rule itself while leaving the standard message-passing forward pass unchanged. + +A complementary line of work keeps backpropagation but modifies the \emph{forward} architecture to make deep graph learning more stable. Residual and identity-preserving designs such as GCNII~\cite{gcnii}, normalization methods such as PairNorm~\cite{pairnorm}, stochastic edge dropping via DropEdge~\cite{dropedge}, and multi-scale aggregation via JKNet~\cite{jknet} all improve depth robustness by altering the forward computation or feature dynamics. More recent model families push this logic further: GrassNet~\cite{zhao2024grassnet} builds graph filters from state-space ideas, and MP-SSM~\cite{ceni2025mpssm} embeds modern state-space computation directly inside message passing. These methods are not alternatives to GRAFT in mechanism: they stabilize training by changing the forward operator, whereas GRAFT changes only backward transport. This distinction matters empirically and conceptually: because GRAFT leaves the forward graph computation intact, it is naturally compositional with forward-side remedies (Section~\ref{sec:stackability}), while methods in this paragraph should be understood as orthogonal cures to the same depth pathology rather than as direct substitutes for graph-aligned feedback. + +Replacing BP itself in graph models has so far followed two main paths. DFA-GNN~\cite{dfagnn} adapts direct feedback alignment~\cite{dfa,fa} to graphs using fixed random feedback together with topology-driven pseudo-error generation; it is the closest prior graph-specific feedback-alignment baseline, but it does not learn a Jacobian-aligned feedback operator and treats graph structure primarily as an auxiliary error-spreading mechanism. ForwardGNN~\cite{forwardgnn} removes BP through layer-local forward objectives, showing that graph models can be trained without global backpropagation, but at the cost of moving to a fundamentally local learning rule rather than approximating the global gradient direction. Outside the graph setting, recent learned-feedback methods make the comparison sharper: Caillon et al.~\cite{caillon2025bpfree} show that forward-gradient information can align feedback with true gradients, and GrAPE~\cite{caillon2025grape} scales this idea with JVP-based Jacobian estimates and explicit alignment objectives. GRAFT is best viewed as the graph-specific continuation of this line, but with one additional structural claim: in GNNs, useful feedback must respect the factorization of backward transport into a topology-side operator and a feature-side operator. Our learned matrices $R_\ell$ inherit the Jacobian-alignment spirit of~\cite{caillon2025bpfree,caillon2025grape}, but the fixed operator $P_\ell(\hat{A})$ and our wrong-topology intervention target a graph-specific question absent from prior FA work, namely whether the \emph{backward graph itself} must be consistent with the forward graph for non-BP training to succeed. + +%% ============================================================ +%% 3. Method +%% ============================================================ +\section{Method: GRAFT}\label{sec:method} + +\subsection{Setup and Notation} + +We index hidden layers from input to output: $\ell{=}0$ is the first hidden layer and $\ell{=}L{-}2$ is the last hidden layer before the classifier. Let $X\!\in\!\mathbb{R}^{N\times F}$ denote input node features, $\hat{A}\!\in\!\mathbb{R}^{N\times N}$ the symmetrically normalized adjacency, $\sigma$ a pointwise nonlinearity (ReLU). An $L$-layer GCN computes +\[ +Z_0 = \hat{A} X W_0,\ \ Z_{\ell+1} = \hat{A} H_\ell W_{\ell+1},\ \ H_{\ell+1}=\sigma(Z_{\ell+1}),\ \ \ell=0,\dots,L{-}3,\ \ \ \ Z_{L-1} = \hat{A} H_{L-2} W_{L-1}, +\] +with $W_0\!\in\!\mathbb{R}^{F\times d}$, $W_1,\dots,W_{L-2}\!\in\!\mathbb{R}^{d\times d}$, $W_{L-1}\!\in\!\mathbb{R}^{d\times C}$. We write $E_0 := \partial \mathcal{L}/\partial Z_{L-1}\!\in\!\mathbb{R}^{N\times C}$ for the output-side error. + +\subsection{The GCN Jacobian Has Kronecker Structure} + +The single-step pre-activation Jacobian factorizes immediately by $\mathrm{vec}(AXB)\!=\!(B^\top\!\otimes\! A)\mathrm{vec}(X)$: +\[ +\frac{\partial \mathrm{vec}(Z_{\ell+1})}{\partial \mathrm{vec}(H_\ell)} \;=\; W_{\ell+1}^\top \otimes \hat{A}. +\] +Backward transport across one layer thus splits into a feature-side factor $W_{\ell+1}^\top$ and a topology-side factor $\hat{A}$. This split is structural to message passing on graphs, has no analogue in MLPs, and motivates the form of GRAFT's feedback operator. + +\paragraph{Linear-case factorization.} +For an $L$-layer linear GCN ($\sigma\!=\!\mathrm{id}$), the multi-layer Jacobian inherits the Kronecker structure exactly: +\begin{theorem}\label{thm:jacobian} +Fix $\ell\!\in\!\{0,\dots,L{-}2\}$ and let $Q_\ell := W_{L-1}^\top W_{L-2}^\top \cdots W_{\ell+1}^\top \!\in\! \mathbb{R}^{C\times d}$. For the linear suffix from $H_\ell$ to $Z_{L-1}$, +\[ +\frac{\partial \mathrm{vec}(Z_{L-1})}{\partial \mathrm{vec}(H_\ell)} \;=\; Q_\ell \otimes \hat{A}^{\,L-1-\ell}. +\] +\end{theorem} +\textit{Proof.} By repeated substitution, $Z_{L-1} = \hat{A}^{\,L-1-\ell} H_\ell Q_\ell^\top$. Taking differentials and applying $\mathrm{vec}(AXB)=(B^\top\!\otimes\! A)\mathrm{vec}(X)$ gives the result. \hfill$\square$ + +\paragraph{Nonlinear case.} +For the nonlinear network, activation gates $D_k := \mathrm{Diag}(\mathrm{vec}(\sigma'(Z_k)))$ enter between successive Jacobian factors: +\[ +\frac{\partial \mathrm{vec}(Z_{L-1})}{\partial \mathrm{vec}(H_\ell)} += +(W_{L-1}^\top\!\otimes\!\hat{A})\,D_{L-2}\,(W_{L-2}^\top\!\otimes\!\hat{A})\,D_{L-3}\cdots D_{\ell+1}\,(W_{\ell+1}^\top\!\otimes\!\hat{A}). +\] +The diagonal gates destroy exact Kronecker separability, but the topology factor $\hat{A}$ remains explicit between every pair of feature transitions. GRAFT's feedback rule is therefore best understood as a Jacobian-inspired factorized approximation: it preserves the explicit topology factor and learns a single composite feature-side surrogate. + +\subsection{The GRAFT Feedback Rule} + +For each hidden layer $\ell\!\in\!\{0,\dots,L{-}2\}$, GRAFT defines a learned feature-side matrix $R_\ell\!\in\!\mathbb{R}^{C\times d}$ and a fixed topology-side operator $P_\ell(\hat{A}) := \hat{A}^{\,\min(L-1-\ell,K)}$, with $K{=}3$ throughout. The hidden-layer feedback is +\[ +\boxed{\;\delta_\ell \;=\; \sigma'(Z_\ell)\;\odot\;\bigl[P_\ell(\hat{A})\,\bar{E}\,R_\ell\bigr]\;,\qquad \ell=0,\dots,L{-}2,\;} +\] +where $\bar{E} := D(\hat{A})\,E_0$ is the output error optionally diffused over the graph by a fixed operator $D(\hat{A}) = \sum_{k=0}^{K_D}\alpha_k \hat{A}^k$ (semi-supervised setting; $\bar{E}=E_0$ otherwise). The induced vectorized backward transport is $F_\ell^{\mathrm{GRAFT}} := R_\ell^\top \otimes P_\ell(\hat{A})$, mirroring the Kronecker structure of Theorem~\ref{thm:jacobian} with $R_\ell\!\approx\!Q_\ell$ and $P_\ell(\hat{A})\!\approx\!\hat{A}^{\,L-1-\ell}$. + +The weight gradients are computed exactly from the feedback signals: +\[ +\nabla W_0 = (\hat{A} X)^\top \delta_0,\quad +\nabla W_\ell = (\hat{A} H_{\ell-1})^\top \delta_\ell\ \text{for}\ \ell\!=\!1,\dots,L{-}2,\quad +\nabla W_{L-1} = (\hat{A} H_{L-2})^\top E_0. +\] +The output-layer weight uses the true error $E_0$ (closed-form gradient at the loss); only hidden layers use GRAFT feedback. + +\subsection{Learning the Feature-Side Matrix}\label{sec:alignment} + +$R_\ell$ should approximate $Q_\ell = W_{L-1}^\top W_{L-2}^\top \cdots W_{\ell+1}^\top$. Computing $Q_\ell$ explicitly requires sequential matrix products that are exactly what we are trying to avoid; instead we use a multi-probe Jacobian-vector estimator. At each alignment step we draw $B\!\in\!\mathbb{R}^{d\times m}$ with $m{=}64$ Gaussian probes, compute the chain product $\hat{J} = \frac{1}{m}(W_{L-1}^\top \cdots W_{\ell+1}^\top B) B^\top$ via $L{-}\ell{-}1$ matrix-vector products, and update $R_\ell$ by gradient ascent on the Frobenius cosine similarity $\cos_F(R_\ell, \hat{J})$. To prevent norm explosion at depth, we chain-normalize after each multiplication. Alignment runs every 10 training steps (amortized cost $\sim$5\% of total wall-clock); we verify in the appendix that $R_\ell$ converges to 96\% cosine similarity with the true $Q_\ell$ on Cora. + +\subsection{Efficient Implementation}\label{sec:efficient} + +GRAFT's feedback is computed in $O(1)$ parallel depth via four batched operations. +\textbf{(1) Topology stack.} We precompute $\{\hat{A}^k E_0\}_{k=0}^{K}$ via $K{=}3$ sequential SpMMs. This stack is shared between graph diffusion (computed as a polynomial in $\hat{A}$) and the per-layer topology operator $P_\ell(\hat{A})$, so polynomial diffusion costs zero additional SpMMs. +\textbf{(2) Batched feedback.} All $L{-}1$ hidden-layer feedback signals are computed in a single \texttt{torch.bmm} of shape $(L{-}1, N, C) \times (L{-}1, C, d)$, gathering rows from the topology stack according to $\min(L{-}1{-}\ell, K)$. +\textbf{(3) Batched gradient.} Concatenating the feedback along the feature axis, $\nabla W_\ell = (\hat{A} H_{\ell-1})^\top \delta_\ell$ becomes a single wide SpMM followed by a single \texttt{bmm}. +\textbf{(4) Amortized alignment.} The $R_\ell$ alignment step (Section~\ref{sec:alignment}) runs once every 10 training iterations. + +The result is a backward phase whose number of GPU kernel launches is constant in $L$, in contrast to BP's $O(L)$ sequential dependency. Total arithmetic still grows with $L$ inside the wide SpMM, but the GPU executes it as a single fused operation. Section~\ref{sec:efficient-results} reports a $1.6$--$1.8\times$ wall-clock speedup over BP on sparse graphs at equivalent accuracy. + +%% ============================================================ +%% 4. Experiments +%% ============================================================ +\section{Experiments}\label{sec:experiments} + +\subsection{Setup}\label{sec:setup} + +\textbf{Datasets.} We evaluate on four standard citation graphs: Cora, CiteSeer, PubMed (Planetoid splits) and DBLP (CitationFull, 17.7K nodes). These are representative of the sparse, moderate-degree, homophilous regime in which deep GNN over-smoothing/transport issues are most pronounced (average degree 4--7, $h{>}0.7$). We further evaluate on Coauthor-Physics (34.5K nodes) and provide additional results on Amazon co-purchase graphs, ogbn-arxiv, and heterophilous benchmarks in the appendix. + +\textbf{Architectures.} GCN~\cite{gcn}, GraphSAGE~\cite{graphsage}, GIN~\cite{gin}, and APPNP~\cite{appnp}, all at depths $L\!\in\!\{5,6\}$ (the regime where BP first begins to fail) and depth $L\!=\!10$ for the gradient reach analysis. Hidden dimension $d{=}64$ throughout. No residual connections, no BatchNorm, no Dropout---we deliberately stress-test the bare backbone where the gradient transport bottleneck is most severe; results with these stabilizers are reported as ablations. + +\textbf{Methods.} We compare against backpropagation (BP), Direct Feedback Alignment (DFA)~\cite{dfa}, DFA-GNN~\cite{dfagnn} (random feedback with topology pseudo-error), and VanillaGrAPE---a graph-agnostic learned-feedback control we introduce that uses the same multi-probe alignment as GRAFT but with $P{=}I$ (no graph factor in feedback). The four together span the design space: $\{$BP$\}\times\{$exact gradient$\}$ vs.\ $\{$learned alignment$\}\times\{$graph-aware, graph-agnostic$\}\times\{$random, learned$\}$. + +\textbf{Optimization.} Adam, learning rate $\in\!\{0.001, 0.005, 0.01\}$, weight decay $5\!\times\!10^{-4}$, 200 epochs with early stopping on validation. We report best test accuracy at the validation peak. + +\textbf{Statistical procedure.} Each setting is run with 20 random seeds (model initialization + training shuffle). We report mean$\pm$std and use the paired $t$-test on the seed-matched accuracy vector for all significance claims. The main BP-vs-GRAFT sweep comprises $96$ paired comparisons; across all analyses and appendices we report $144$ total hypothesis tests. To control for multiple comparisons, we apply Benjamini--Hochberg correction at $q\!=\!0.05$; settings reported as significant survive both unadjusted ($p{<}0.05$) and BH-adjusted thresholds. + +\subsection{Headline Leaderboard}\label{sec:exp-leaderboard} + +Table~\ref{tab:leaderboard} consolidates our main result: a head-to-head comparison of GRAFT against vanilla BP, four forward-side anti-over-smoothing methods (ResGCN, JKNet, PairNorm, DropEdge), three feedback-alignment baselines (DFA, DFA-GNN, VanillaGrAPE), and four GRAFT $+$ forward-method combinations on the same forward backbone (GCN, $L\!=\!6$, 20 seeds). \textbf{GRAFT or a GRAFT $+$ forward combination is the best method on all three datasets}: GRAFT$+$JKNet on Cora, GRAFT$+$PairNorm on CiteSeer, and GRAFT$+$ResGCN on DBLP. The takeaway is that the gains compound: GRAFT alone is competitive with the strongest forward methods, and the combination is strictly better than either component. + +\begin{table}[t] +\centering\small +\caption{\textbf{Leaderboard}: methods evaluated on GCN $L\!=\!6$, 20 seeds. \best{Green} marks the best method per column. Top block: BP and forward-side anti-over-smoothing methods. Middle block: feedback-alignment baselines (graph-agnostic). Bottom block: GRAFT and its combinations with forward methods. GRAFT or a GRAFT$+$forward combination wins every dataset.}\label{tab:leaderboard} +\begin{tabularx}{\textwidth}{l *{3}{>{\centering\arraybackslash}X}} +\toprule +Method & Cora & CiteSeer & DBLP \\ +\midrule +\multicolumn{4}{l}{\emph{Backpropagation $+$ forward-side anti-over-smoothing}} \\ +BP (vanilla) & $68.8{\scriptstyle\pm 4.6}$ & $54.0{\scriptstyle\pm 4.1}$ & $80.5{\scriptstyle\pm 1.0}$ \\ +BP $+$ ResGCN~\cite{resgcn} & $77.5{\scriptstyle\pm 1.6}$ & $63.0{\scriptstyle\pm 2.2}$ & $82.3{\scriptstyle\pm 0.4}$ \\ +BP $+$ JKNet~\cite{jknet} & $78.2{\scriptstyle\pm 1.0}$ & $64.4{\scriptstyle\pm 1.2}$ & $79.9{\scriptstyle\pm 0.8}$ \\ +BP $+$ PairNorm~\cite{pairnorm} & $69.0{\scriptstyle\pm 3.2}$ & $55.4{\scriptstyle\pm 3.4}$ & $79.0{\scriptstyle\pm 0.8}$ \\ +BP $+$ DropEdge~\cite{dropedge} & $74.8{\scriptstyle\pm 1.8}$ & $64.0{\scriptstyle\pm 1.6}$ & $81.6{\scriptstyle\pm 0.5}$ \\ +\midrule +\multicolumn{4}{l}{\emph{Feedback-alignment baselines (graph-agnostic backward)}} \\ +DFA~\cite{dfa} & $70.4{\scriptstyle\pm 6.8}$ & $60.2{\scriptstyle\pm 2.4}$ & --- \\ +DFA-GNN~\cite{dfagnn} & $68.1{\scriptstyle\pm 5.9}$ & $60.0{\scriptstyle\pm 2.2}$ & --- \\ +VanillaGrAPE & $77.5{\scriptstyle\pm 1.7}$ & $62.3{\scriptstyle\pm 1.5}$ & $82.0{\scriptstyle\pm 0.6}$ \\ +\midrule +\multicolumn{4}{l}{\emph{GRAFT (ours) and combinations with forward methods}} \\ +\textbf{GRAFT (ours)} & $76.7{\scriptstyle\pm 1.8}$ & $62.4{\scriptstyle\pm 1.9}$ & $82.1{\scriptstyle\pm 0.4}$ \\ +\textbf{GRAFT $+$ ResGCN} & $77.8{\scriptstyle\pm 1.9}$ & $61.5{\scriptstyle\pm 2.2}$ & \best{$82.7{\scriptstyle\pm 0.6}$} \\ +\textbf{GRAFT $+$ JKNet} & \best{$78.3{\scriptstyle\pm 1.6}$} & $61.8{\scriptstyle\pm 2.2}$ & $82.4{\scriptstyle\pm 0.4}$ \\ +\textbf{GRAFT $+$ PairNorm} & $75.8{\scriptstyle\pm 1.5}$ & \best{$64.3{\scriptstyle\pm 2.0}$} & $80.7{\scriptstyle\pm 0.6}$ \\ +\textbf{GRAFT $+$ DropEdge} & $70.8{\scriptstyle\pm 3.8}$ & $62.1{\scriptstyle\pm 1.8}$ & $80.7{\scriptstyle\pm 0.7}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +\subsection{Per-Backbone Accuracy Across Depths}\label{sec:exp-main} + +Table~\ref{tab:main} reports BP vs.\ GRAFT at the best learning rate per (dataset, backbone, depth) configuration. \textbf{GRAFT improves over BP in 86 of 96 paired comparisons}, with all wins surviving BH correction. The pattern is consistent: at moderate depth ($L\!=\!5,6$) GRAFT delivers $+2$ to $+11.5$\% on GCN, SAGE, and APPNP across all four datasets, with the largest gains on configurations where BP is most depth-degraded (e.g., APPNP $L\!=\!6$ on Cora: BP $66.4\!\pm\!5.0$ vs.\ GRAFT $77.8\!\pm\!2.9$, $\Delta\!=\!+11.4$, $p\!=\!0.0003$). The exception is GIN, whose built-in $(1\!+\!\epsilon)I$ identity mapping already provides a residual gradient path (acting as an architectural control: see Section~\ref{sec:discussion}); on GIN, GRAFT and BP are statistically tied. + +\begin{table}[t] +\centering\small +\caption{Main accuracy results: BP vs.\ GRAFT at best lr per setting (20 seeds, paired $t$-test, BH corrected). \best{Green} marks the better method per row. All non-GIN settings significant at $q\!=\!0.05$.}\label{tab:main} +\begin{tabularx}{\textwidth}{ll *{4}{>{\centering\arraybackslash}X}} +\toprule +Dataset & Backbone & BP & GRAFT & $\Delta$ & $p$ \\ +\midrule +\multirow{8}{*}{Cora} +& gcn $L\!=\!5$ & $74.3{\scriptstyle\pm 2.5}$ & \best{$78.8{\scriptstyle\pm 1.0}$} & $+4.5$ & $<\!0.001$ \\ +& gcn $L\!=\!6$ & $69.4{\scriptstyle\pm 5.7}$ & \best{$78.2{\scriptstyle\pm 1.1}$} & $+8.7$ & $0.002$ \\ +& sage $L\!=\!5$ & $74.4{\scriptstyle\pm 2.8}$ & \best{$77.9{\scriptstyle\pm 0.9}$} & $+3.5$ & $<\!0.001$ \\ +& sage $L\!=\!6$ & $69.5{\scriptstyle\pm 4.9}$ & \best{$78.4{\scriptstyle\pm 0.9}$} & $+8.9$ & $<\!0.001$ \\ +& appnp $L\!=\!5$ & $74.8{\scriptstyle\pm 2.7}$ & \best{$79.1{\scriptstyle\pm 1.1}$} & $+4.3$ & $<\!0.001$ \\ +& appnp $L\!=\!6$ & $66.4{\scriptstyle\pm 5.0}$ & \best{$77.8{\scriptstyle\pm 2.9}$} & $+11.4$ & $<\!0.001$ \\ +& gin $L\!=\!5$ & $78.5{\scriptstyle\pm 1.3}$ & \best{$80.1{\scriptstyle\pm 1.0}$} & $+1.6$ & $<\!0.001$ \\ +& gin $L\!=\!6$ & $77.8{\scriptstyle\pm 1.5}$ & $77.8{\scriptstyle\pm 1.5}$ & $+0.0$ & ns \\ +\midrule +\multirow{8}{*}{CiteSeer} +& gcn $L\!=\!5$ & $60.6{\scriptstyle\pm 3.1}$ & \best{$63.7{\scriptstyle\pm 1.8}$} & $+3.1$ & $0.002$ \\ +& gcn $L\!=\!6$ & $55.7{\scriptstyle\pm 3.6}$ & \best{$63.5{\scriptstyle\pm 2.2}$} & $+7.7$ & $<\!0.001$ \\ +& sage $L\!=\!5$ & $61.2{\scriptstyle\pm 3.2}$ & \best{$63.9{\scriptstyle\pm 1.8}$} & $+2.8$ & $0.005$ \\ +& sage $L\!=\!6$ & $55.8{\scriptstyle\pm 4.8}$ & \best{$62.0{\scriptstyle\pm 2.1}$} & $+6.2$ & $0.007$ \\ +& appnp $L\!=\!5$ & $61.3{\scriptstyle\pm 2.7}$ & \best{$64.6{\scriptstyle\pm 1.6}$} & $+3.2$ & $<\!0.001$ \\ +& appnp $L\!=\!6$ & $53.3{\scriptstyle\pm 5.4}$ & \best{$64.7{\scriptstyle\pm 1.7}$} & $+11.4$ & $<\!0.001$ \\ +& gin $L\!=\!5$ & \best{$66.7{\scriptstyle\pm 1.3}$} & $65.2{\scriptstyle\pm 1.3}$ & $-1.5$ & $<\!0.001$ \\ +& gin $L\!=\!6$ & \best{$65.1{\scriptstyle\pm 1.7}$} & $63.1{\scriptstyle\pm 2.3}$ & $-2.1$ & $0.004$ \\ +\midrule +\multirow{8}{*}{PubMed} +& gcn $L\!=\!5$ & $75.8{\scriptstyle\pm 2.1}$ & \best{$76.9{\scriptstyle\pm 0.7}$} & $+1.2$ & $0.032$ \\ +& gcn $L\!=\!6$ & $73.2{\scriptstyle\pm 2.7}$ & \best{$75.8{\scriptstyle\pm 1.1}$} & $+2.6$ & $<\!0.001$ \\ +& sage $L\!=\!5$ & $75.8{\scriptstyle\pm 1.8}$ & \best{$76.6{\scriptstyle\pm 0.4}$} & $+0.8$ & ns \\ +& sage $L\!=\!6$ & $74.5{\scriptstyle\pm 1.8}$ & \best{$76.5{\scriptstyle\pm 1.0}$} & $+2.0$ & $0.001$ \\ +& appnp $L\!=\!5$ & $76.9{\scriptstyle\pm 1.8}$ & \best{$79.1{\scriptstyle\pm 0.4}$} & $+2.2$ & $<\!0.001$ \\ +& appnp $L\!=\!6$ & $73.7{\scriptstyle\pm 3.7}$ & \best{$78.3{\scriptstyle\pm 0.9}$} & $+4.6$ & $<\!0.001$ \\ +& gin $L\!=\!5$ & $76.6{\scriptstyle\pm 0.7}$ & \best{$77.7{\scriptstyle\pm 0.6}$} & $+1.1$ & $<\!0.001$ \\ +& gin $L\!=\!6$ & $76.4{\scriptstyle\pm 1.3}$ & \best{$76.9{\scriptstyle\pm 1.0}$} & $+0.5$ & ns \\ +\midrule +\multirow{8}{*}{DBLP} +& gcn $L\!=\!5$ & $82.1{\scriptstyle\pm 0.4}$ & \best{$83.1{\scriptstyle\pm 0.3}$} & $+0.9$ & $<\!0.001$ \\ +& gcn $L\!=\!6$ & $81.3{\scriptstyle\pm 0.5}$ & \best{$82.9{\scriptstyle\pm 0.3}$} & $+1.5$ & $<\!0.001$ \\ +& sage $L\!=\!5$ & $82.4{\scriptstyle\pm 0.3}$ & $82.5{\scriptstyle\pm 0.4}$ & $+0.2$ & ns \\ +& sage $L\!=\!6$ & $81.7{\scriptstyle\pm 0.5}$ & \best{$82.5{\scriptstyle\pm 0.3}$} & $+0.8$ & $0.002$ \\ +& appnp $L\!=\!5$ & $81.6{\scriptstyle\pm 0.4}$ & \best{$83.1{\scriptstyle\pm 0.4}$} & $+1.5$ & $<\!0.001$ \\ +& appnp $L\!=\!6$ & $79.6{\scriptstyle\pm 1.2}$ & \best{$83.2{\scriptstyle\pm 0.4}$} & $+3.6$ & $<\!0.001$ \\ +& gin $L\!=\!5$ & $81.8{\scriptstyle\pm 0.4}$ & \best{$82.3{\scriptstyle\pm 0.4}$} & $+0.5$ & $0.001$ \\ +& gin $L\!=\!6$ & $81.6{\scriptstyle\pm 0.6}$ & \best{$82.2{\scriptstyle\pm 0.5}$} & $+0.6$ & $0.004$ \\ +\bottomrule +\end{tabularx} +\end{table} + +\subsection{Ablation: Learned Alignment vs.\ Random Feedback, with and without Topology}\label{sec:exp-ablation} + +To isolate the contribution of \emph{learned} feature-side alignment, we compare four feedback rules on the same forward backbone (GCN $L\!=\!6$, 20 seeds): \textbf{DFA} (random fixed $R_\ell$, $P\!=\!I$), \textbf{DFA-GNN}~\cite{dfagnn} (random $R_\ell$ with topology-driven pseudo-error generation), \textbf{VanillaGrAPE} (learned $R_\ell$, $P\!=\!I$, our graph-agnostic control), and \textbf{GRAFT} (learned $R_\ell$, $P_\ell(\hat{A})$). + +Table~\ref{tab:ablation} reveals two findings. First, \emph{learned feature-side alignment is the dominant contributor to raw accuracy}: DFA-GNN $\to$ VanillaGrAPE yields $+9.3\%$ on Cora and $+3.9\%$ on PubMed ($p\!<\!10^{-4}$ on both), confirming that without alignment, random feedback (whether or not it carries topology pseudo-error) is too coarse. Second, \emph{the explicit topology factor $P_\ell(\hat{A})$ contributes only marginally to accuracy on top of VanillaGrAPE}: $+0.9\%$ on CiteSeer ($p\!=\!0.025$), statistically tied on Cora, PubMed, and DBLP. At first glance, this seems to contradict our claim that the graph factor is essential to GRAFT. Section~\ref{sec:wrong-topology} resolves the apparent contradiction with a causal intervention: the topology factor's primary role is forward--backward \emph{consistency} rather than additional accuracy signal. + +\begin{table}[t] +\centering\small +\caption{Feedback-rule ablation on GCN $L\!=\!6$, 20 seeds. Learned alignment dominates raw accuracy; the explicit topology factor's effect is marginal at the accuracy level but causal under intervention (Section~\ref{sec:wrong-topology}). \best{Green} marks the best result per column.}\label{tab:ablation} +\begin{tabularx}{\textwidth}{l *{4}{>{\centering\arraybackslash}X}} +\toprule +Method & Cora & CiteSeer & PubMed & DBLP \\ +\midrule +DFA (random $R$, $P\!=\!I$) & $70.4{\scriptstyle\pm 6.8}$ & $60.2{\scriptstyle\pm 2.4}$ & $72.2{\scriptstyle\pm 1.5}$ & --- \\ +DFA-GNN (random $R$, topology pseudo-error) & $68.1{\scriptstyle\pm 5.9}$ & $60.0{\scriptstyle\pm 2.2}$ & $70.5{\scriptstyle\pm 2.0}$ & --- \\ +VanillaGrAPE (learned $R$, $P\!=\!I$) & \best{$77.3{\scriptstyle\pm 1.0}$} & $61.9{\scriptstyle\pm 1.2}$ & \best{$74.4{\scriptstyle\pm 1.3}$} & $82.0{\scriptstyle\pm 0.6}$ \\ +\textbf{GRAFT} (learned $R$, $P_\ell(\hat{A})$) & \best{$77.3{\scriptstyle\pm 1.4}$} & \best{$62.8{\scriptstyle\pm 1.6}$} & $74.1{\scriptstyle\pm 1.6}$ & \best{$82.1{\scriptstyle\pm 0.6}$} \\ +\bottomrule +\end{tabularx} +\end{table} + +\subsection{Wall-Clock Efficiency}\label{sec:efficient-results} + +We compare the optimized GRAFT implementation against BP (timing in milliseconds per training step, 5 timing runs, median reported). On Cora, GRAFT-Opt is $1.6$--$1.8\times$ faster than BP across $L\!\in\!\{6,10\}$, driven by avoiding autograd overhead in the forward pass and replacing $L$-step sequential backward propagation with a constant number of batched kernel launches. On the larger DBLP graph (17.7K nodes), the speedup vanishes (parity at $L\!=\!6$, slight slowdown at $L\!=\!10$) because individual SpMMs are large enough to saturate the GPU and amortize the autograd overhead. Memory overhead is modest ($1.2$--$1.4\times$ peak), driven by the topology stack and per-layer alignment matrices. + +\begin{table}[t] +\centering\small +\caption{Wall-clock per training step (ms; 5 timing runs). Speedup is BP/GRAFT. \best{Green} marks the fastest method per row.}\label{tab:efficiency} +\begin{tabularx}{\textwidth}{ll *{3}{>{\centering\arraybackslash}X} >{\centering\arraybackslash}X} +\toprule +Dataset & $L$ & BP & ResGCN & GRAFT-Opt & Speedup vs.\ BP \\ +\midrule +Cora & 6 & 4.16 & 4.80 & \best{2.62} & $1.59\times$ \\ +Cora & 10 & 7.03 & 6.40 & \best{4.07} & $1.73\times$ \\ +DBLP & 6 & 5.51 & 5.35 & \best{5.34} & $1.03\times$ \\ +DBLP & 10 & \best{7.13} & 7.42 & 7.33 & $0.97\times$ \\ +\bottomrule +\end{tabularx} +\end{table} + +The reference implementation (used for accuracy experiments to most faithfully match the equations of Section~\ref{sec:method}) is $1.4$--$2.0\times$ \emph{slower} than BP because it runs alignment every step and uses iterative label spreading; the optimized variant amortizes alignment to every 10 steps and replaces iterative diffusion with polynomial reuse on the topology stack. Accuracy is statistically equivalent across implementations ($\Delta\!<\!1\%$ on 8 of 9 settings, all $p\!>\!0.05$; see appendix). + +%% ============================================================ +%% 5. Analysis +%% ============================================================ +\section{Analysis}\label{sec:analysis} + +\subsection{Gradient Reach: BP Vanishes at Depth}\label{sec:gradient-reach} + +We measure how far the supervision signal propagates through the backward pass. For a trained $L$-layer GCN we record at each hidden layer $\ell\!\in\!\{0,\dots,L{-}2\}$ both the BP gradient norm $\|\partial \mathcal{L}/\partial Z_\ell\|_F$ and the GRAFT feedback norm $\|\delta_\ell\|_F$. All measurements are taken after 100 epochs of training; we report mean $\pm$ 95\% CI over 20 seeds. + +\paragraph{At $L\!=\!10$: BP gradients are exactly zero.} Across all 20 seeds and all hidden layers, $\|\partial \mathcal{L}/\partial Z_\ell\|_F < 10^{-38}$, i.e., BP gradient norms underflow IEEE-754 single precision. The forward representations at the same depth, by contrast, remain numerically discriminable (per-node feature norms are $\Theta(1)$). This is the operational signature of a backward transport bottleneck: the supervision signal reaches the loss layer but cannot return. GRAFT's feedback in the same network has $\|\delta_\ell\|_F\!\approx\!0.7$--$1.2$ across all layers, with tight CI. The accuracy gap at $L\!=\!10$ is correspondingly large: GCN $\Delta\!=\!+16.3\%$ ($p\!=\!0.0004$), APPNP $\Delta\!=\!+10.8\%$ ($p\!=\!0.008$). + +\paragraph{At $L\!=\!6$: BP gradients are weak but nonzero.} BP norms are $\sim$0.02; GRAFT feedback is $\sim$0.17, roughly $8\times$ stronger. The accuracy gap is positive but variable, consistent with BP being marginally functional rather than completely broken at this depth. + +\paragraph{Per-layer alignment with the true BP gradient.} A natural concern is whether GRAFT's feedback is merely a strong but arbitrary signal, rather than a faithful approximation of the BP gradient direction. We measure the per-layer cosine $\cos(\delta_\ell^\mathrm{GRAFT}, \nabla_{Z_\ell}\mathcal{L}^\mathrm{BP})$ on a trained GCN $L\!=\!6$ on Cora (20 seeds). All five hidden layers show positive cosine alignment (mean range 0.33 at the deepest layer to 0.59 at the shallowest), with 95\% CI strictly above zero throughout training. Alignment improves over training as $R_\ell$ converges, validating that GRAFT's feedback is a meaningful BP-direction approximation in the nonlinear case---not just a gradient-shaped vector. + +\subsection{Wrong-Topology Causal Control}\label{sec:wrong-topology} + +The accuracy ablation in Section~\ref{sec:exp-ablation} shows that the explicit topology factor $P_\ell(\hat{A})$ contributes only marginally to raw accuracy ($+0.9\%$ on CiteSeer; ns elsewhere). Yet our story claims that backward graph structure is essential. To resolve this tension and provide a causal test of the graph factor's role, we replace the backward graph in $P_\ell(\hat{A})$ with several alternatives while keeping the forward pass on the true graph $\hat{A}$: + +\begin{itemize}[nosep] +\item \textbf{GRAFT (correct $\hat{A}$)}: the standard method. +\item \textbf{VanillaGrAPE ($P\!=\!I$)}: no graph in the backward; $\delta_\ell\!=\!\sigma'(Z_\ell)\!\odot\![\bar{E}\, R_\ell]$. +\item \textbf{Rewired ($\tilde{A}$)}: degree-matched random rewiring of the edge set. +\item \textbf{Permuted ($\Pi\hat{A}\Pi^\top$)}: a random node-index permutation applied to $\hat{A}$. +\item \textbf{Erd\H{o}s--R\'enyi ($\hat{A}_{\mathrm{ER}}$)}: a random graph of matched edge density. +\end{itemize} + +Table~\ref{tab:wrong-topo} reports 20-seed results on Cora, CiteSeer, and DBLP at GCN $L\!=\!6$. + +\begin{table}[t] +\centering\small +\caption{Wrong-topology causal control (GCN $L\!=\!6$, 20 seeds). The forward pass uses the true graph; only the backward graph in $P_\ell(\hat{A})$ varies. \best{Green} marks the best result per column. Wrong topologies cause $-35$ to $-45\%$ collapses, while removing topology entirely ($P\!=\!I$) is benign.}\label{tab:wrong-topo} +\begin{tabularx}{\textwidth}{l + >{\centering\arraybackslash\hsize=.78\hsize}X + >{\centering\arraybackslash\hsize=.78\hsize}X + >{\centering\arraybackslash\hsize=.78\hsize}X + >{\centering\arraybackslash\hsize=1.66\hsize}X} +\toprule +Backward graph & Cora & CiteSeer & DBLP & vs.\ GRAFT \\ +\midrule +GRAFT (correct $\hat{A}$) & $77.2{\scriptstyle\pm 1.3}$ & \best{$62.7{\scriptstyle\pm 1.6}$} & $81.9{\scriptstyle\pm 0.8}$ & --- \\ +VanillaGrAPE ($P\!=\!I$) & \best{$77.5{\scriptstyle\pm 1.7}$} & $62.3{\scriptstyle\pm 1.5}$ & \best{$82.0{\scriptstyle\pm 0.6}$} & ns \\ +\midrule +Rewired ($\tilde{A}$) & \nega{$32.3{\scriptstyle\pm 1.3}$} & \nega{$29.6{\scriptstyle\pm 8.0}$} & \nega{$46.1{\scriptstyle\pm 5.1}$} & $-35$ to $-45\%^{***}$ \\ +Permuted ($\Pi\hat{A}\Pi^\top$) & \nega{$32.5{\scriptstyle\pm 2.0}$} & \nega{$48.1{\scriptstyle\pm 6.5}$} & \nega{$75.8{\scriptstyle\pm 3.9}$} & $-6$ to $-45\%^{***}$ \\ +Erd\H{o}s--R\'enyi & \nega{$31.9{\scriptstyle\pm 0.0}$} & \nega{$27.4{\scriptstyle\pm 5.8}$} & \nega{$44.8{\scriptstyle\pm 0.3}$} & $-37$ to $-45\%^{***}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +\paragraph{Structural advantages of the factored feedback.} +The factorization $R_\ell^\top \otimes P_\ell(\hat{A})$ gives GRAFT seven structural properties that an unfactored learned-feedback rule lacks (Appendix~\ref{app:factorization-benefits} expands each). \emph{(i) Cleaner alignment target:} GRAFT's $R_\ell$ approximates the pure weight chain $Q_\ell$, which is graph-independent, deterministic in the trained weights, and well-defined for the multi-probe estimator; an unfactored rule has no analogous target because the graph is implicit. \emph{(ii) Depth-aware propagation:} $P_\ell(\hat{A})$ applies a different topology power per layer ($\hat{A}^{L-1-\ell}$ at the deepest layer, $\hat{A}$ at the shallowest), matching BP's per-layer hop count; a uniform graph-agnostic feedback applies the same operator at every layer. \emph{(iii) Cross-graph transferability:} GRAFT's $R_\ell$ depends only on the trained weights, so transferring to a new graph requires no retraining of $R_\ell$---only swapping $\hat{A}$ in $P_\ell$. \emph{(iv) Theoretical analyzability:} the factored form yields Theorem~\ref{thm:jacobian} and the depth-attenuation corollary; an unfactored rule has no analogous structure. \emph{(v) Modular extensibility:} since graph transport is an explicit factor, $P_\ell$ can be replaced with a signed, directed, edge-typed, or attention-weighted operator (e.g.\ for heterophily) while leaving the alignment of $R_\ell$ untouched; an unfactored rule has no comparable plug-in slot. \emph{(vi) Structured optimization geometry:} for fixed $P_\ell$, the admissible feedback operators lie on a low-rank Kronecker manifold with only $Cd$ effective parameters per layer rather than $N^2 Cd$, giving the alignment objective a better-conditioned and more identifiable search space. \emph{(vii) Connection to Kronecker-factored approximation:} the form $R_\ell^\top \otimes P_\ell(\hat{A})$ places GRAFT in the lineage of K-FAC-style structured approximations~\cite{martens2015kfac,ritter2018kflap}, opening the door to spectral and curvature analyses that unfactored learned feedback does not admit. + +\paragraph{Interpretation: graph structure is \emph{necessary}, not \emph{sufficient}, in the backward pass.} Three observations: + +\textbf{(1) Removing topology is benign.} VanillaGrAPE ($P\!=\!I$) is statistically indistinguishable from GRAFT on all three datasets. The error diffusion step $\bar{E} = D(\hat{A})E_0$ (which is shared by both methods) already carries the graph-structural information to the unlabeled nodes; the explicit polynomial $P_\ell(\hat{A})$ adds no further accuracy on top. + +\textbf{(2) Wrong topology is catastrophic.} Replacing $\hat{A}$ with any contradictory backward graph collapses GRAFT by 35--45 percentage points on Cora (Erd\H{o}s--R\'enyi reaches 31.9\% with std 0.0---a deterministic failure mode). The collapse is $p\!<\!10^{-6}$ on every paired comparison. + +\textbf{(3) The asymmetry is the key insight.} Together, observations (1) and (2) reveal that GRAFT's backward signal is fragile in a specific way: if no topology is provided, the method falls back to feature-side alignment alone and works fine; but if \emph{contradictory} topology is injected, the feedback is actively corrupted. This rules out the trivial reading of GRAFT as ``any structured backward operator works.'' Forward and backward graph structure must be \emph{consistent}, even if the explicit polynomial $P_\ell(\hat{A})$ is not the dominant accuracy contributor on its own. Section~\ref{sec:efficient} shows that the explicit factor has additional value as the source of GRAFT's $O(1)$-depth batched implementation: the polynomial form is what enables shared topology stack reuse. + +\subsection{Stackability with Forward-Side Methods}\label{sec:stackability} + +GRAFT modifies only the backward training rule; the forward computation is unchanged. This makes GRAFT \emph{orthogonal} to forward-side anti-over-smoothing methods (skip connections, normalization, edge dropping, jumping knowledge) and we expect the two to compose. + +\begin{table}[t] +\centering\small +\caption{GRAFT combined with forward-side methods (GCN $L\!=\!6$, 20 seeds). \best{Green} marks the best result per column.}\label{tab:stackability} +\begin{tabularx}{\textwidth}{l *{3}{>{\centering\arraybackslash}X}} +\toprule +Method & Cora & CiteSeer & DBLP \\ +\midrule +BP & $68.8{\scriptstyle\pm 4.6}$ & $54.0{\scriptstyle\pm 4.1}$ & $80.5{\scriptstyle\pm 1.0}$ \\ +BP $+$ ResGCN & $77.5{\scriptstyle\pm 1.6}$ & $63.0{\scriptstyle\pm 2.2}$ & $82.3{\scriptstyle\pm 0.4}$ \\ +BP $+$ JKNet & $78.2{\scriptstyle\pm 1.0}$ & \best{$64.4{\scriptstyle\pm 1.2}$} & $79.9{\scriptstyle\pm 0.8}$ \\ +BP $+$ PairNorm & $69.0{\scriptstyle\pm 3.2}$ & $55.4{\scriptstyle\pm 3.4}$ & $79.0{\scriptstyle\pm 0.8}$ \\ +BP $+$ DropEdge & $74.8{\scriptstyle\pm 1.8}$ & $64.0{\scriptstyle\pm 1.6}$ & $81.6{\scriptstyle\pm 0.5}$ \\ +\midrule +GRAFT (backward only) & $76.7{\scriptstyle\pm 1.8}$ & $62.4{\scriptstyle\pm 1.9}$ & $82.1{\scriptstyle\pm 0.4}$ \\ +\midrule +GRAFT $+$ ResGCN & $77.8{\scriptstyle\pm 1.9}$ & $61.5{\scriptstyle\pm 2.2}$ & \best{$82.7{\scriptstyle\pm 0.6}$} \\ +GRAFT $+$ JKNet & \best{$78.3{\scriptstyle\pm 1.6}$} & $61.8{\scriptstyle\pm 2.2}$ & $82.4{\scriptstyle\pm 0.4}$ \\ +GRAFT $+$ PairNorm & $75.8{\scriptstyle\pm 1.5}$ & \best{$64.3{\scriptstyle\pm 2.0}$} & $80.7{\scriptstyle\pm 0.6}$ \\ +GRAFT $+$ DropEdge & $70.8{\scriptstyle\pm 3.8}$ & $62.1{\scriptstyle\pm 1.8}$ & $80.7{\scriptstyle\pm 0.7}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +The combination is additive in the cases where the forward method preserves the backward graph: GRAFT $+$ ResGCN achieves the best DBLP accuracy in our entire study ($82.7\%$, $\Delta\!=\!+0.4$ over ResGCN alone, $p\!=\!0.052$), and GRAFT $+$ JKNet gives the best Cora result ($78.3\%$, $\Delta\!=\!+0.1$ over JKNet alone). \textbf{PairNorm} is a partial case: GRAFT $+$ PairNorm achieves the best CiteSeer result ($64.3\%$, the best overall in Table~\ref{tab:leaderboard}) but underperforms GRAFT alone on Cora and DBLP. PairNorm preserves $\hat{A}$ in the forward pass (so backward consistency is maintained), but its center-and-rescale step rescales the per-layer activations $H_\ell$ that GRAFT uses to form weight gradients $H_\ell^\top \delta_\ell$, slightly perturbing the alignment target between training steps. \textbf{DropEdge} breaks the pattern more severely: it modifies the backward graph between forward and backward steps (since the dropped edges are part of the message-passing graph), violating the consistency requirement identified in Section~\ref{sec:wrong-topology}, and the combination underperforms either method alone. Synchronizing the dropped edges between forward and backward only partially recovers the gap, confirming that DropEdge's stochasticity is fundamentally at odds with GRAFT's learned alignment. + +%% ============================================================ +%% 6. Discussion +%% ============================================================ +\section{Discussion, Limitations, and Future Work}\label{sec:discussion} + +\paragraph{Where GRAFT works and where it does not.} GRAFT's advantage is concentrated in \emph{sparse, homophilous graphs} at \emph{moderate depth} ($L\!=\!5$--$8$) with plain message-passing backbones---the regime where BP exhibits severe gradient vanishing (Section~\ref{sec:gradient-reach}). It does not help in four settings: (1) \textbf{GIN}'s $(1{+}\epsilon)I$ term already acts as a backward skip connection---a control confirming the diagnosis; (2) \textbf{dense graphs} (Amazon Photo/Computers) already give BP stable gradient flow; (3) \textbf{heterophilous graphs} ($h\!<\!0.3$) break the assumption that edge propagation carries useful supervision, and alignment converges poorly; (4) \textbf{very large graphs} (ogbn-arxiv, $N\!=\!169$K), where $\hat{A}^k$ becomes spectrally diverse and probe variance grows with $C$. + +\paragraph{Limitations.} The reference implementation has $1.4$--$2.0\times$ overhead per step over BP; the optimized variant recovers $1.6$--$1.8\times$ speedup on small sparse graphs but only parity on larger ones. The method has been validated on node classification; extensions to graph-level, link-prediction, and inductive settings are straightforward in principle but unverified. + +\paragraph{Future directions.} Scale-aware local/global mixing of the topology operator (e.g., an identity-augmented kernel $(1{-}\beta)\hat{A}^k + \beta I$, which lifts GRAFT on reduced ogbn-arxiv from 48.6\% to 53.7\%), learned signed/directional backward operators for heterophilous graphs, and combinations with forward stabilizers such as GNN-SSM~\cite{arroyo2025vanishing} are natural next steps enabled by the Kronecker factorization. + +%% ============================================================ +%% Bibliography +%% ============================================================ +\begin{thebibliography}{10} + +\bibitem{caillon2025grape} +P.~Caillon, E.~Fagnou, B.~Delattre, and A.~Allauzen. +\newblock {G}r{APE}: Scaling direct feedback learning with {J}acobian alignment guarantees. +\newblock In \textit{ICLR}, 2026. + +\bibitem{caillon2025bpfree} +P.~Caillon, E.~Fagnou, B.~Delattre, and A.~Allauzen. +\newblock Backpropagation-free learning through gradient aligned feedbacks. +\newblock \textit{OpenReview oRPXPoTXYz}, 2025. + +\bibitem{lee2026fwm} +Y.~Lee and S.~Lee. +\newblock Enabling fine-tuning of direct feedback alignment via feedback-weight matching. +\newblock In \textit{ICLR}, 2026. + +\bibitem{casnici2026kpfa} +D.~Casnici, M.~Lefebvre, J.~Dauwels, and C.~Frenkel. +\newblock Accelerated predictive coding networks via direct {K}olen--{P}ollack feedback alignment. +\newblock \textit{OpenReview MCeZ4k7J6M}, 2026. + +\bibitem{dfagnn} +G.~Zhao, T.~Wang, C.~Lang, Y.~Jin, Y.~Li, and H.~Ling. +\newblock {DFA-GNN}: Forward learning of graph neural networks by direct feedback alignment. +\newblock \textit{arXiv:2406.02040}, 2024. + +\bibitem{dll} +C.~Lv, J.~Xu, Y.~Lu, X.~Wang, Z.~Wang, Z.~Xu, D.~Yu, X.~Du, X.~Zheng, and X.~Huang. +\newblock Dendritic localized learning: Toward biologically plausible algorithm. +\newblock In \textit{ICML}, 2025. + +\bibitem{dfa} +A.~N{\o}kland. +\newblock Direct feedback alignment provides learning in deep neural networks. +\newblock In \textit{NeurIPS}, 2016. + +\bibitem{fa} +T.~P.~Lillicrap, D.~Cownden, D.~B.~Tweed, and C.~J.~Akerman. +\newblock Random synaptic feedback weights support error backpropagation for deep learning. +\newblock \textit{Nature Communications}, 7:13276, 2016. + +\bibitem{oversmoothing} +Q.~Li, Z.~Han, and X.-M.~Wu. +\newblock Deeper insights into graph convolutional networks for semi-supervised learning. +\newblock In \textit{AAAI}, 2018. + +\bibitem{oversmoothing_fallacy} +M.~Park, S.~Choi, J.~Heo, E.~Park, and D.~Kim. +\newblock The oversmoothing fallacy: A misguided narrative in {GNN} research. +\newblock \textit{arXiv:2506.04653}, 2025. + +\bibitem{arroyo2025vanishing} +A.~Arroyo, A.~Gravina, B.~Gutteridge, F.~Barbero, C.~Gallicchio, X.~Dong, M.~Bronstein, and P.~Vandergheynst. +\newblock On vanishing gradients, over-smoothing, and over-squashing in {GNN}s: Bridging recurrent and graph learning. +\newblock In \textit{NeurIPS}, 2025. + +\bibitem{roth2024rank} +A.~Roth and T.~Liebig. +\newblock Rank collapse causes over-smoothing and over-correlation in graph neural networks. +\newblock In \textit{LoG}, 2024. + +\bibitem{deidda2025rank} +G.~Deidda, J.~Zhang, D.~J.~Higham, and F.~Tudisco. +\newblock Rethinking oversmoothing in {GNN}s: A rank-based perspective. +\newblock \textit{arXiv:2502.04591}, 2025. + +\bibitem{ceni2025mpssm} +A.~Ceni, A.~Gravina, C.~Gallicchio, D.~Bacciu, C.-B.~Schonlieb, and M.~Eliasof. +\newblock Message-passing state-space models: Improving graph learning with modern sequence modeling. +\newblock \textit{arXiv:2505.18728}, 2025. + +\bibitem{zhao2024grassnet} +G.~Zhao, T.~Wang, Y.~Jin, C.~Lang, Y.~Li, and H.~Ling. +\newblock {G}rass{N}et: State space model meets graph neural network. +\newblock \textit{arXiv:2408.08583}, 2024. + +\bibitem{li2024graphssm} +J.~Li, R.~Wu, X.~Jin, B.~Ma, L.~Chen, and Z.~Zheng. +\newblock State space models on temporal graphs: A first-principles study. +\newblock In \textit{NeurIPS}, 2024. + +\bibitem{gcn} +T.~N.~Kipf and M.~Welling. +\newblock Semi-supervised classification with graph convolutional networks. +\newblock In \textit{ICLR}, 2017. + +\bibitem{graphsage} +W.~Hamilton, Z.~Ying, and J.~Leskovec. +\newblock Inductive representation learning on large graphs. +\newblock In \textit{NeurIPS}, 2017. + +\bibitem{gin} +K.~Xu, W.~Hu, J.~Leskovec, and S.~Jegelka. +\newblock How powerful are graph neural networks? +\newblock In \textit{ICLR}, 2019. + +\bibitem{appnp} +J.~Klicpera, A.~Bojchevski, and S.~G\"{u}nnemann. +\newblock Predict then propagate: Graph neural networks meet personalized {PageRank}. +\newblock In \textit{ICLR}, 2019. + +\bibitem{gcnii} +M.~Chen, Z.~Wei, Z.~Huang, B.~Ding, and Y.~Li. +\newblock Simple and deep graph convolutional networks. +\newblock In \textit{ICML}, 2020. + +\bibitem{forwardgnn} +N.~Park, X.~Wang, A.~Simoulin, S.~Yang, G.~Yang, R.~Rossi, P.~Trivedi, and N.~Ahmed. +\newblock Forward learning of graph neural networks. +\newblock In \textit{ICLR}, 2024. + +\bibitem{dropedge} +Y.~Rong, W.~Huang, T.~Xu, and J.~Huang. +\newblock {D}rop{E}dge: Towards deep graph convolutional networks on node classification. +\newblock In \textit{ICLR}, 2020. + +\bibitem{pairnorm} +L.~Zhao and L.~Akoglu. +\newblock {P}air{N}orm: Tackling oversmoothing in {GNN}s. +\newblock In \textit{ICLR}, 2020. + +\bibitem{jknet} +K.~Xu, C.~Li, Y.~Tian, T.~Sonobe, K.~Kawarabayashi, and S.~Jegelka. +\newblock Representation learning on graphs with jumping knowledge networks. +\newblock In \textit{ICML}, 2018. + +\bibitem{resgcn} +G.~Li, M.~M\"uller, A.~Thabet, and B.~Ghanem. +\newblock {D}eep{GCN}s: Can {GCN}s go as deep as {CNN}s? +\newblock In \textit{ICCV}, 2019. + +\bibitem{bengio2014targetprop} +Y.~Bengio. +\newblock How auto-encoders could provide credit assignment in deep networks via target propagation. +\newblock \textit{arXiv:1407.7906}, 2014. + +\bibitem{lee2015dtp} +D.-H.~Lee, S.~Zhang, A.~Fischer, and Y.~Bengio. +\newblock Difference target propagation. +\newblock In \textit{ECML PKDD}, 2015. + +\bibitem{meulemans2020targetprop} +A.~Meulemans, F.~S.~Carzaniga, J.~A.~K.~Suykens, J.~Sacramento, and B.~F.~Grewe. +\newblock A theoretical framework for target propagation. +\newblock In \textit{NeurIPS}, 2020. + +\bibitem{jaderberg2017dni} +M.~Jaderberg, W.~M.~Czarnecki, S.~Osindero, O.~Vinyals, A.~Graves, D.~Silver, and K.~Kavukcuoglu. +\newblock Decoupled neural interfaces using synthetic gradients. +\newblock In \textit{ICML}, 2017. + +\bibitem{hinton2022ff} +G.~Hinton. +\newblock The forward-forward algorithm: Some preliminary investigations. +\newblock \textit{arXiv:2212.13345}, 2022. + +\bibitem{whittington2017predictive} +J.~C.~R.~Whittington and R.~Bogacz. +\newblock An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity. +\newblock \textit{Neural Computation}, 29(5):1229--1262, 2017. + +\bibitem{millidge2022pc} +B.~Millidge, A.~Tschantz, and C.~L.~Buckley. +\newblock Predictive coding approximates backprop along arbitrary computation graphs. +\newblock \textit{Neural Computation}, 34(6):1329--1368, 2022. + +\bibitem{gu2020ignn} +F.~Gu, H.~Chang, W.~Zhu, S.~Sojoudi, and L.~El~Ghaoui. +\newblock Implicit graph neural networks. +\newblock In \textit{NeurIPS}, 2020. + +\bibitem{liu2021eignn} +J.~Liu, K.~Kawaguchi, B.~Hooi, Y.~Wang, and X.~Xiao. +\newblock {EIGNN}: Efficient infinite-depth graph neural networks. +\newblock In \textit{NeurIPS}, 2021. + +\bibitem{liu2022mgnni} +J.~Liu, B.~Hooi, K.~Kawaguchi, and X.~Xiao. +\newblock {MGNNI}: Multiscale graph neural networks with implicit layers. +\newblock In \textit{NeurIPS}, 2022. + +\bibitem{bai2019deq} +S.~Bai, J.~Z.~Kolter, and V.~Koltun. +\newblock Deep equilibrium models. +\newblock In \textit{NeurIPS}, 2019. + +\bibitem{wu2019sgc} +F.~Wu, A.~Souza, T.~Zhang, C.~Fifty, T.~Yu, and K.~Weinberger. +\newblock Simplifying graph convolutional networks. +\newblock In \textit{ICML}, 2019. + +\bibitem{frasca2020sign} +F.~Frasca, E.~Rossi, D.~Eynard, B.~Chamberlain, M.~Bronstein, and F.~Monti. +\newblock {SIGN}: Scalable inception graph neural networks. +\newblock In \textit{ICML Workshop on Graph Representation Learning}, 2020. + +\bibitem{martens2015kfac} +J.~Martens and R.~Grosse. +\newblock Optimizing neural networks with {K}ronecker-factored approximate curvature. +\newblock In \textit{ICML}, 2015. + +\bibitem{ritter2018kflap} +H.~Ritter, A.~Botev, and D.~Barber. +\newblock A scalable {L}aplace approximation for neural networks. +\newblock In \textit{ICLR}, 2018. + +\end{thebibliography} + +\appendix +\section*{Appendix} + +The appendix contains material that was compressed in the main text to fit the page budget. Section~\ref{app:extended-related} broadens the related-work discussion, Section~\ref{app:lr-sweep} reports the full learning-rate sweep, Section~\ref{app:antios-full} gives the full per-setting tables for the PairNorm and DropEdge comparisons, Section~\ref{app:depth-stress} reports the depth stress test from $L\!=\!6$ to $L\!=\!20$, Section~\ref{app:sensitivity} reports hyperparameter sensitivity, Section~\ref{app:negative} collects negative results (heterophily, large graphs), Section~\ref{app:bh} details the Benjamini--Hochberg procedure, Section~\ref{app:alignment} reports per-layer cosine alignment, Section~\ref{app:ref-vs-opt} verifies that the reference and optimized implementations produce equivalent accuracy, and Section~\ref{app:factorization-benefits} discusses additional structural benefits of the factored feedback operator. + +\section{Extended Related Work}\label{app:extended-related} + +\paragraph{Forward-side anti-over-smoothing methods.} Beyond skip connections (ResGCN~\cite{resgcn}, GCNII~\cite{gcnii}), several forward modifications have been proposed to mitigate representational collapse with depth. PairNorm~\cite{pairnorm} centers and rescales node representations after each layer, DropEdge~\cite{dropedge} randomly drops edges during training, and JKNet~\cite{jknet} aggregates representations across all layers via max-pooling or LSTM. Section~\ref{sec:stackability} shows that GRAFT composes additively with skip-style methods (ResGCN, JKNet, and partially PairNorm) but is incompatible with DropEdge because the latter introduces forward--backward topology mismatch. + +\paragraph{State-space and spectral GNNs.} A recent line of work casts deep graph learning through the state-space lens. GrassNet~\cite{zhao2024grassnet} designs spectral filters via SSMs; MP-SSM~\cite{ceni2025mpssm} embeds modern SSM sequence modeling inside message passing; GraphSSM~\cite{li2024graphssm} extends state-space models to temporal graphs. All change the forward operator, complementing GRAFT's backward modification. + +\paragraph{Other backprop-free training.} ForwardGNN~\cite{forwardgnn} adapts the Forward-Forward algorithm to graphs via layer-local objectives, removing the need for any global error signal. This is a fundamentally different design point from GRAFT, which keeps a global output error and replaces only the backward transport mechanism. Beyond graphs, recent FA work includes feedback-weight matching for fine-tuning~\cite{lee2026fwm} and direct Kolen--Pollack feedback for predictive coding~\cite{casnici2026kpfa}; neither targets graph structure. + +\paragraph{Alternative credit-assignment paradigms.} Beyond feedback alignment, a broader literature studies alternative credit-assignment rules that relax or replace backpropagation. \emph{Target-propagation} methods compute layerwise targets rather than gradients~\cite{bengio2014targetprop,lee2015dtp,meulemans2020targetprop}; \emph{synthetic-gradient} methods decouple modules by predicting future gradients locally~\cite{jaderberg2017dni}; \emph{predictive-coding / local-inference} approaches recover approximate backpropagation from local updates~\cite{whittington2017predictive,millidge2022pc}; and \emph{dendritic / local-plasticity} rules operate entirely from neuron-local signals~\cite{dll}. ForwardGNN~\cite{forwardgnn} is the closest graph-specific descendant of Hinton's Forward-Forward line~\cite{hinton2022ff}; GRAFT remains closer to feedback-alignment methods because it preserves a global output error and changes only the backward transport mechanism. We are not aware of a graph-specific target-propagation variant or a synthetic-gradient GNN baseline in the recent literature, which leaves a broader open space of graph-structured non-BP training rules beyond DFA-style feedback alignment. + +\paragraph{Implicit and equilibrium GNNs.} A separate line trains graph networks through equilibrium or fixed-point differentiation rather than explicit layerwise BP. Implicit Graph Neural Networks~\cite{gu2020ignn}, EIGNN~\cite{liu2021eignn}, and follow-up work on multi-scale implicit graph models~\cite{liu2022mgnni} build on the deep-equilibrium framework~\cite{bai2019deq} to obtain effectively infinite-depth GNNs whose gradients are computed via implicit differentiation. These methods sidestep the depth-vanishing problem by avoiding explicit layerwise stacks altogether, in contrast to GRAFT which preserves the standard explicit message-passing structure and changes only the backward rule. Combining GRAFT-style learned feedback with implicit-equilibrium graph models is a possible direction for future work. + +\paragraph{Graph simplification and decoupled training.} Methods such as SGC~\cite{wu2019sgc} simplify GCNs by precomputing $\hat{A}^K X$ and training only a linear classifier on top, and SIGN~\cite{frasca2020sign} extends this to multi-scale precomputed features. These methods sidestep deep training by collapsing message passing into a fixed pre-processing step. They are complementary to GRAFT: where SGC/SIGN remove the deep training problem by avoiding deep stacks, GRAFT addresses the same problem from the other side by repairing backward transport in deep stacks. We are not aware of an explicit FA-style or target-propagation-style training rule combined with SGC/SIGN, which would be a natural baseline if such a method existed. + +\section{Full Learning-Rate Sweep}\label{app:lr-sweep} + +Table~\ref{tab:main} in Section~\ref{sec:exp-main} reports BP vs.\ GRAFT at the best learning rate per (dataset, backbone, depth) configuration. We summarize the full learning-rate sweep here. Across $4$ datasets $\times\,4$ backbones $\times\,2$ depths $\times\,3$ learning rates $=\,96$ paired comparisons, GRAFT improves over BP in \textbf{86 of 96} settings; \textbf{72 of 86} GRAFT wins are significant at unadjusted $p\!<\!0.05$, and all surviving wins remain significant after Benjamini--Hochberg correction at $q\!=\!0.05$ (Section~\ref{app:bh}). The $10$ losses are concentrated on GIN (5 settings) and PubMed (5 settings); both patterns are consistent with our diagnostic that GRAFT helps when BP exhibits gradient transport failure (GIN's identity term suppresses this failure mode; PubMed's higher density partially does too). + +\section{Anti-Over-Smoothing Comparisons (Full)}\label{app:antios-full} + +\paragraph{PairNorm (36 settings).} We compare BP $+$ PairNorm vs.\ GRAFT across 3 datasets $\times\,4$ backbones $\times\,4$ depths ($L\!\in\!\{5,6,8,10\}$). GRAFT improves over BP $+$ PairNorm in \textbf{32 of 36 settings} ($89\%$); the four exceptions are concentrated at extreme depths where both methods degrade. The largest gain is on Cora APPNP $L\!=\!10$ where GRAFT outperforms PairNorm by $+24.7\%$. + +\paragraph{DropEdge (24 settings).} We compare BP $+$ DropEdge($p\!=\!0.5$) vs.\ GRAFT across 3 datasets $\times\,4$ backbones $\times\,2$ depths. GRAFT improves over BP $+$ DropEdge in \textbf{21 of 24 settings} (88\%). DropEdge's stochastic edge removal partially mitigates oversmoothing but does not address the backward gradient transport failure; the residual gap is largest on APPNP at $L\!=\!6$. + +\paragraph{Why GRAFT $+$ DropEdge is incompatible.} Section~\ref{sec:stackability} reported that GRAFT $+$ DropEdge does not stack additively, in contrast to GRAFT $+$ ResGCN and GRAFT $+$ JKNet. The mechanism is forward--backward topology mismatch: when DropEdge removes edges from the forward computation but the backward $P_\ell(\hat{A})$ uses the full graph, GRAFT's alignment becomes a moving target. We verified this by testing a synchronized variant (same dropped edges in forward and backward); synchronization recovers some of the gap but not all of it (the residual reflects that GRAFT's $R_\ell$ is trained against the full $\hat{A}$ as a stable target). + +\section{Depth Stress Test (\texorpdfstring{$L\!=\!6$ to $L\!=\!20$}{L=6 to L=20})}\label{app:depth-stress} + +We push all methods to extreme depth on Cora and DBLP (GCN, lr$\!=\!0.01$, 3 seeds). + +\begin{table}[H] +\centering\small +\caption{Depth stress test ($L\!=\!6$ to $L\!=\!20$, GCN, lr$\!=\!0.01$, 3 seeds). \best{Green} marks the best per row.}\label{tab:depth-stress} +\begin{tabularx}{\textwidth}{cl *{4}{>{\centering\arraybackslash}X}} +\toprule +Dataset & $L$ & BP & ResGCN & GRAFT & GRAFT $+$ ResGCN \\ +\midrule +\multirow{6}{*}{Cora} +& 6 & $71.4{\scriptstyle\pm 1.1}$ & $78.0{\scriptstyle\pm 2.0}$ & $76.4{\scriptstyle\pm 2.1}$ & \best{$78.1{\scriptstyle\pm 0.7}$} \\ +& 8 & $39.7{\scriptstyle\pm 5.3}$ & \best{$78.2{\scriptstyle\pm 2.3}$} & $63.8{\scriptstyle\pm 5.0}$ & $51.7{\scriptstyle\pm 11.0}$ \\ +& 10 & $35.1{\scriptstyle\pm 4.4}$ & \best{$76.9{\scriptstyle\pm 2.2}$} & $54.5{\scriptstyle\pm 4.7}$ & $47.3{\scriptstyle\pm 5.3}$ \\ +& 12 & $32.8{\scriptstyle\pm 1.9}$ & \best{$76.6{\scriptstyle\pm 1.2}$} & $45.7{\scriptstyle\pm 1.8}$ & $42.3{\scriptstyle\pm 1.3}$ \\ +& 16 & $29.3{\scriptstyle\pm 2.2}$ & \best{$73.5{\scriptstyle\pm 2.5}$} & $35.4{\scriptstyle\pm 2.6}$ & $31.6{\scriptstyle\pm 0.5}$ \\ +& 20 & $24.3{\scriptstyle\pm 6.7}$ & \best{$49.2{\scriptstyle\pm 20.9}$} & $38.3{\scriptstyle\pm 5.0}$ & $34.1{\scriptstyle\pm 3.1}$ \\ +\midrule +\multirow{6}{*}{DBLP} +& 6 & $79.9{\scriptstyle\pm 0.9}$ & $82.3{\scriptstyle\pm 0.3}$ & $82.6{\scriptstyle\pm 0.5}$ & \best{$83.0{\scriptstyle\pm 0.5}$} \\ +& 8 & $78.8{\scriptstyle\pm 1.0}$ & $81.9{\scriptstyle\pm 0.6}$ & \best{$82.2{\scriptstyle\pm 0.4}$} & $81.6{\scriptstyle\pm 1.1}$ \\ +& 10 & $71.1{\scriptstyle\pm 11.9}$ & \best{$80.4{\scriptstyle\pm 0.7}$} & $78.1{\scriptstyle\pm 1.0}$ & $69.4{\scriptstyle\pm 0.9}$ \\ +& 12 & $66.8{\scriptstyle\pm 6.4}$ & \best{$80.0{\scriptstyle\pm 1.3}$} & $73.4{\scriptstyle\pm 3.2}$ & $64.8{\scriptstyle\pm 8.1}$ \\ +& 16 & $45.4{\scriptstyle\pm 0.7}$ & $63.7{\scriptstyle\pm 13.2}$ & \best{$69.9{\scriptstyle\pm 0.1}$} & $60.3{\scriptstyle\pm 11.3}$ \\ +& 20 & $46.1{\scriptstyle\pm 1.4}$ & $61.3{\scriptstyle\pm 7.4}$ & \best{$61.8{\scriptstyle\pm 11.0}$} & $46.8{\scriptstyle\pm 3.0}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +\textbf{Three observations.} (1) GRAFT's sweet spot is $L\!=\!5$--$8$, exactly the regime where BP gradient vanishing first becomes severe but architectural fixes are not yet necessary. (2) On Cora at $L\!\geq\!10$, ResGCN dominates---explicit residual connections preserve information more reliably than feedback alignment when over-smoothing is extreme. (3) On DBLP at $L\!=\!16$, GRAFT \emph{overtakes} ResGCN ($69.9$ vs.\ $63.7$): even residual connections degrade at extreme depth, and GRAFT's $O(1)$-depth backward provides a stable feedback signal that the sequential BP gradient cannot. The GRAFT $+$ ResGCN combination is unstable beyond $L\!=\!8$, suggesting that the additive composition of forward and backward fixes only works in the moderate-depth regime. + +\section{Hyperparameter Sensitivity}\label{app:sensitivity} + +We vary three hyperparameters around the default ($64$ probes, alignment every $10$ steps, hop cap $K\!=\!3$) on Cora GCN $L\!=\!6$ (3 seeds). Default is bold. + +\begin{table}[H] +\centering\small +\caption{Sensitivity to (a) probe count, (b) alignment frequency, (c) hop cap $K$. Default values in \textbf{bold}.}\label{tab:sensitivity} +\begin{tabularx}{\textwidth}{l *{6}{>{\centering\arraybackslash}X}} +\toprule +\multicolumn{6}{l}{\textbf{(a) Probe count} (alignment every 10 steps, $K\!=\!3$)} \\ +Probes & 16 & 32 & \textbf{64} & 128 & 256 \\ +Acc (\%) & $74.6{\scriptstyle\pm 0.8}$ & $76.1{\scriptstyle\pm 1.1}$ & $\mathbf{77.5}{\scriptstyle\pm 1.6}$ & $77.5{\scriptstyle\pm 1.2}$ & $77.1{\scriptstyle\pm 3.5}$ \\ +\midrule +\multicolumn{6}{l}{\textbf{(b) Alignment frequency} (64 probes, $K\!=\!3$)} \\ +Every $N$ steps & 1 & 5 & \textbf{10} & 20 & 50 \\ +Acc (\%) & $77.1{\scriptstyle\pm 1.8}$ & $76.9{\scriptstyle\pm 0.3}$ & $\mathbf{78.2}{\scriptstyle\pm 0.9}$ & $71.9{\scriptstyle\pm 4.9}$ & $73.0{\scriptstyle\pm 3.2}$ \\ +\midrule +\multicolumn{6}{l}{\textbf{(c) Hop cap $K$} (64 probes, alignment every 10 steps)} \\ +$K$ & 1 & 2 & \textbf{3} & 5 & --- \\ +Acc (\%) & $77.1{\scriptstyle\pm 1.9}$ & $76.0{\scriptstyle\pm 0.9}$ & $\mathbf{78.3}{\scriptstyle\pm 0.6}$ & $78.2{\scriptstyle\pm 0.7}$ & --- \\ +\bottomrule +\end{tabularx} +\end{table} + +GRAFT is robust to these choices: accuracy varies by $\leq\!3\%$ across the tested ranges, and the default values are at or near the optimum on each axis. Probe count plateaus at $64$ (no benefit from $128$ or $256$); alignment-every-$10$-steps strictly dominates running it every step; hop cap $K\!=\!3$ matches $K\!=\!5$ and beats $K\!\le\!2$. + +\section{Negative Results: Heterophily and Large Graphs}\label{app:negative} + +\paragraph{Heterophilous benchmarks.} We tested GRAFT on five heterophilous datasets with edge homophily $h\!<\!0.3$ (3 seeds, GCN $L\!=\!6$). Results are reported in Table~\ref{tab:hetero}. GRAFT is neutral or worse than BP on all five, confirming that GRAFT relies on the homophily assumption: when neighboring nodes have different labels, propagating supervision along edges (which is what $P_\ell(\hat{A})$ does in the backward) actively hurts. + +\begin{table}[H] +\centering\small +\caption{Heterophily pilot (3 seeds, GCN $L\!=\!6$). GRAFT does not help when $h\!<\!0.3$.}\label{tab:hetero} +\begin{tabularx}{\textwidth}{l *{4}{>{\centering\arraybackslash}X} >{\centering\arraybackslash}X} +\toprule +Dataset & $N$ & Avg.\ degree & $h$ & BP & GRAFT \\ +\midrule +Texas & 183 & 1.8 & 0.108 & $47.4$ & $47.4$ \\ +Cornell & 183 & 1.6 & 0.131 & $39.5$ & $37.7$ \\ +Chameleon & 2{,}277 & 15.9 & 0.235 & $52.3{\scriptstyle\pm 1.2}$ & $26.7{\scriptstyle\pm 5.0}$ \\ +Squirrel & 5{,}201 & 41.7 & 0.224 & $28.1{\scriptstyle\pm 3.5}$ & $21.2{\scriptstyle\pm 0.3}$ \\ +Actor & 7{,}600 & 3.9 & 0.219 & $26.8{\scriptstyle\pm 1.1}$ & $26.4{\scriptstyle\pm 0.8}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +\paragraph{ogbn-arxiv (large graph, $N\!=\!169$K).} GRAFT underperforms BP on ogbn-arxiv even after reducing the original 40 classes to a smaller label set. We tested 6, 9, and the original 40 classes; GRAFT trails BP by 25--35 points in all cases. The bottleneck is the spectrally diverse $\hat{A}^k$ on a large graph plus high probe variance with $C\!=\!40$. As a preliminary fix, we replaced $P_\ell(\hat{A})$ with an identity-augmented kernel $(1{-}\beta)\hat{A}^k + \beta I$. At $\beta\!=\!0.5$, GRAFT on the 6-class ogbn-arxiv reduction improves from $48.6\%$ to $53.7\%$ ($+5.1\%$); a multi-kernel mixture $\sum_k \alpha_k \hat{A}^k$ with identity-heavy weights reaches $53.3\%$. The gap to BP ($73.6\%$) remains substantial, but the trend suggests that scale-aware operator mixing is a promising direction; we leave a rigorous treatment to future work. + +\section{Benjamini--Hochberg Correction (144 Tests)}\label{app:bh} + +We apply Benjamini--Hochberg multiple-comparisons correction at $q\!=\!0.05$ across all $144$ paired tests in this paper, comprising the following families: full LR sweep ($96$ BP-vs-GRAFT tests), ablation contrasts ($12$ adjacent-pair tests on DFA $\to$ DFA-GNN $\to$ VanillaGrAPE $\to$ GRAFT across 3 datasets), wrong-topology controls ($12$ tests in Table~\ref{tab:wrong-topo}), stackability combos ($12$ tests in Table~\ref{tab:stackability}), and the $12$ depth-stress contrasts at $L\!=\!8$ and $L\!=\!10$. After BH correction, $117$ of the $144$ tests are significant at $q\!=\!0.05$. Crucially, \textbf{every test that is significant at the unadjusted $p\!<\!0.05$ threshold also survives BH correction}: the correction does not change which results we claim. The $27$ non-significant tests are concentrated in the GIN backbone (where GRAFT is statistically tied with BP), the PubMed-SAGE configurations (where GRAFT's gain is small), and the high-perturbation feature-masking conditions where both methods degrade. + +\section{Per-Layer Cosine Alignment with the True BP Gradient}\label{app:alignment} + +Section~\ref{sec:gradient-reach} reports that GRAFT's per-layer feedback maintains positive cosine alignment with the true BP gradient throughout training. We expand the per-layer values here. + +\begin{table}[H] +\centering\small +\caption{Final per-layer cosine alignment $\cos(\delta_\ell^{\text{GRAFT}}, \nabla_{Z_\ell}\mathcal{L}^{\text{BP}})$ on Cora GCN $L\!=\!6$ after $200$ epochs (20 seeds, mean $\pm$ 95\% CI).}\label{tab:per-layer-cos} +\begin{tabularx}{\textwidth}{l *{5}{>{\centering\arraybackslash}X}} +\toprule +Layer (input $\to$ output) & $\ell\!=\!0$ & $\ell\!=\!1$ & $\ell\!=\!2$ & $\ell\!=\!3$ & $\ell\!=\!4$ \\ +\midrule +$\cos(\delta^{\text{GRAFT}}, \nabla^{\text{BP}})$ +& $0.33{\scriptstyle\pm 0.12}$ +& $0.36{\scriptstyle\pm 0.15}$ +& $0.39{\scriptstyle\pm 0.16}$ +& $0.42{\scriptstyle\pm 0.16}$ +& $0.59{\scriptstyle\pm 0.19}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +All five hidden layers show strictly positive alignment with 95\% CI above zero. Alignment is highest at the shallowest layer ($\ell\!=\!4$, closest to the loss) and degrades smoothly with depth, exactly as one would expect: the multi-probe Jacobian estimate has lower variance when fewer weight matrices are chained. The fact that all layers remain positive (as opposed to alignment going to zero or flipping sign at depth) is what makes GRAFT a meaningful BP approximation rather than a random feedback signal in disguise. + +\section{Reference vs.\ Optimized Implementation: Accuracy Equivalence}\label{app:ref-vs-opt} + +Section~\ref{sec:efficient-results} reports wall-clock timing for the optimized implementation (GRAFT-Opt). All accuracy results in the main text use the reference implementation (GRAFT-Ref) because it most directly corresponds to the equations of Section~\ref{sec:method}. We verified that the optimized implementation produces statistically equivalent accuracy on $9$ representative configurations ($3$ datasets $\times\,3$ backbones, GCN $L\!=\!6$, 5 seeds): the difference $|\text{Ref} - \text{Opt}|$ is below $1\%$ in $8$ of $9$ settings, and no setting is significantly different at $p\!<\!0.05$. + +\begin{table}[H] +\centering\small +\caption{GRAFT-Optimized accuracy on $9$ representative settings ($5$ seeds). All within $\pm 2\%$ of the reference implementation; no statistically significant differences.}\label{tab:ref-vs-opt} +\begin{tabularx}{\textwidth}{l *{3}{>{\centering\arraybackslash}X}} +\toprule +Setting (GCN/SAGE/APPNP $L\!=\!6$) & Cora & CiteSeer & DBLP \\ +\midrule +GCN & $76.9{\scriptstyle\pm 2.2}$ & $61.6{\scriptstyle\pm 2.7}$ & $82.5{\scriptstyle\pm 0.3}$ \\ +SAGE & $75.6{\scriptstyle\pm 1.1}$ & $61.5{\scriptstyle\pm 2.1}$ & $82.2{\scriptstyle\pm 0.4}$ \\ +APPNP & $76.1{\scriptstyle\pm 1.7}$ & $59.4{\scriptstyle\pm 1.7}$ & $82.8{\scriptstyle\pm 0.3}$ \\ +\bottomrule +\end{tabularx} +\end{table} + +The optimized implementation is therefore a drop-in replacement for the reference; we present accuracy with the reference for didactic clarity and timing with the optimized for practical relevance. + +\section{Further Structural Benefits of the Factored Feedback}\label{app:factorization-benefits} + +Section~\ref{sec:wrong-topology} lists four structural advantages of GRAFT's factorization $R_\ell^\top \otimes P_\ell(\hat{A})$. We expand on three additional benefits here, then sketch a longer list. + +\paragraph{Modular extension to richer graph operators.} +Once the graph-side transport is explicitly factored out into $P_\ell(\hat{A})$, one can replace it with a signed, directed, edge-typed, or attention-weighted operator while leaving the multi-probe alignment of $R_\ell$ unchanged. This creates a clean extension path to heterophilous graphs (signed/directional operators), knowledge graphs (relation-typed operators), and attention-based GNNs (data-dependent transport). An unfactored graph-agnostic feedback rule has no comparable plug-in slot: introducing graph structure back into hidden-layer transport would require redesigning the entire backward rule. Concretely, the identity-augmented kernel $(1{-}\beta)\hat{A}^k + \beta I$ that we use for the ogbn-arxiv pilot in Section~\ref{app:negative} is exactly this kind of plug-in replacement; signed or directional generalizations are equally straightforward. + +\paragraph{Structured low-rank optimization geometry.} +The set of admissible feedback operators $\{R_\ell^\top \otimes P_\ell(\hat{A}) : R_\ell \in \mathbb{R}^{C\times d}\}$ is a much smaller and more structured manifold than arbitrary $Nd \times NC$ dense transport maps; for fixed $P_\ell$, the manifold has only $Cd$ effective parameters per layer rather than $N^2 Cd$. The alignment objective is therefore better conditioned: the optimizer cannot trade off feature-side alignment against arbitrary nodewise transport effects, and gradient updates to $R_\ell$ have a one-to-one correspondence with movement on the structured manifold. We conjecture this is why GRAFT exhibits lower seed variance in $R_\ell$ alignment quality (Section~\ref{app:alignment}, $\sigma\!\approx\!0.12$--$0.19$ across layers) than would be expected from an unfactored learned-feedback rule of comparable expressivity. + +\paragraph{Connection to Kronecker-factored approximation.} +The form $R_\ell^\top \otimes P_\ell(\hat{A})$ places GRAFT in the broader family of Kronecker-factored approximations to linear operators—the same family that includes K-FAC~\cite{martens2015kfac} for second-order optimization, Kronecker-factored Laplace approximations~\cite{ritter2018kflap}, and structured-Hessian methods. These methods all trade off representational flexibility for tractable computation and sharper analysis under a Kronecker assumption. GRAFT inherits the same trade-off in the feedback-alignment setting: the factored form admits Theorem~\ref{thm:jacobian} and the depth-attenuation corollary, and could in principle be analyzed using K-FAC-style spectral arguments. An unfactored learned feedback rule has no such mathematical lineage and is harder to connect to the broader literature on structured approximations. + +\paragraph{Additional benefits.} +Several further structural properties follow from the factorization but are not central to the empirical claims in this paper. We list them briefly for completeness; a longer discussion is in our supplementary notes. +\begin{itemize}[nosep] +\item \textbf{Permutation-equivariant backward transport}: $P_\ell(\hat{A})$ is permutation-equivariant by construction, mirroring the forward GNN's symmetry; an unfactored rule has only the trivial equivariance through $\bar{E}$. +\item \textbf{Sparse SpMM compute}: $P_\ell(\hat{A})$ is a sparse matrix polynomial, so backward transport reuses sparse SpMM kernels and the topology stack from the diffusion step. +\item \textbf{Graph-frequency interpretation}: $P_\ell(\hat{A})$ is a polynomial graph filter and admits a spectral interpretation (low-pass at small $K$, broadband at larger $K$); unfactored hidden-layer feedback has no such spectral handle. +\item \textbf{Controlled hop bandwidth}: the hop cap $K$ is an interpretable, dimension-free knob tied to graph geometry; an unfactored rule has no equivalent. +\item \textbf{Multi-graph sharing}: because $R_\ell$ is graph-independent, it can be shared or warm-started across multiple related graphs while only $P_\ell(\hat{A})$ changes per graph. +\item \textbf{Subgraph/minibatch compatibility}: a $K$-hop polynomial localizes backward transport to $K$ neighborhood hops, aligning with neighborhood sampling pipelines (GraphSAGE-style). +\item \textbf{Failure-mode debuggability}: factorization gives two independent levers ($R_\ell$ and $P_\ell(\hat{A})$) that can be intervened on separately, e.g.\ via the wrong-topology control of Section~\ref{sec:wrong-topology}. +\item \textbf{Layerwise structural regularization}: the factored form constrains the hypothesis class, acting as implicit regularization that should matter most in low-label or few-shot graph regimes. +\item \textbf{Identifiable separation of roles}: graph transport ($P_\ell$) and feature mixing ($R_\ell$) cannot trade off arbitrarily, so the operator is more identifiable conceptually and statistically. +\item \textbf{Reduced parameter burden}: the learnable part of the feedback is independent of the number of nodes $N$, which is essential for graph scalability and cross-graph comparison. +\end{itemize} + +\end{document} -- cgit v1.2.3