Skip to content

net: macb: v2 follow-ups to silent TX stall fix series (PR #7340)#7369

Open
lukaszraczylo wants to merge 3 commits into
raspberrypi:rpi-6.18.yfrom
lukaszraczylo:macb-v2-followup
Open

net: macb: v2 follow-ups to silent TX stall fix series (PR #7340)#7369
lukaszraczylo wants to merge 3 commits into
raspberrypi:rpi-6.18.yfrom
lukaszraczylo:macb-v2-followup

Conversation

@lukaszraczylo
Copy link
Copy Markdown
Contributor

Three follow-up commits to PR #7340 (merged 2026-05-08) addressing one regression and two operational polish items I found while preparing the netdev v2 submission.

Background

Mainline RFC: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/
Mainline v2 sent today (2026-05-14): https://lore.kernel.org/netdev/ (cover Message-ID <20260514215459.36109-1-lukasz@raczylo.com>)

These rpi-6.18.y-anchored commits track the mainline v2 changes one-for-one but apply on top of the already-merged form in this tree.

Commits

1. net: macb: gate PCIe posted-write flush behind MACB_CAPS_PCIE_POSTED_WRITES

@a9e13da2dc4c added the NCR readback unconditionally to every macb variant. SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx, etc.) write NCR over on-chip MMIO with no fabric posted-write concern, so they pay the non-posted-read latency for nothing.

Introduce MACB_CAPS_PCIE_POSTED_WRITES (BIT(14) — first unused) and gate both readbacks behind it. Set on raspberrypi_rp1_config only.

2. net: macb: drop destructive ISR read, use IMR barrier in macb_tx_poll ← the real regression fix

@ff6914e97386 added if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) || macb_tx_complete_pending(queue)). On read-clear ISR silicon — which is what raspberrypi_rp1_config defaults to (MACB_CAPS_ISR_CLEAR_ON_WRITE not set) — the queue_readl(queue, ISR) consumes every bit set in the register, but the use-site masks down to TCOMP and discards the rest. Any RCOMP / ROVR / TXUBR bit set at the moment of the read is silently consumed; the IRQ handler that would have processed it sees ISR == 0 and returns IRQ_NONE. On a level-triggered IRQ the consumed bit drops the line before GIC delivery, so the IRQ vanishes entirely.

Race window is ~200-500 ns per macb_tx_poll completion; small but non-zero, and at moderate-to-heavy RX load the math works out to non-negligible RX-completion loss.

Replace with (void)queue_readl(queue, IMR). IMR is the read-only mask mirror, no side effects on either read-clear or W1C ISR silicon. Same PCIe-barrier effect, so the subsequent macb_tx_complete_pending() still sees flushed peripheral DMA writes. Loses the "directly sample latched TCOMP" half of the original change; keeps the half that addresses the documented rmb() race.

3. net: macb: gate TX stall watchdog on netif_carrier_ok, use warn_ratelimited

@79dc190b12f9 's watchdog fires a false positive at boot when tx_head advances from kernel-queued packets between macb_open() and link autoneg completion while tx_tail stays at 0 (no TCOMPs yet). Observed at ~25% rate during a 24-node fleet rolling reboot (6 of 24 nodes fired with the canonical tail=0 head=5-7 signature).

The re-kick itself is harmless (macb_tx_restart() checks TBQP vs head before re-asserting TSTART) but the warning is misleading.

Two changes:

  • Add a netif_carrier_ok() gate at the top of the watchdog tick. No carrier means no completion possible; skip the check but keep ticking.
  • Swap netdev_warn_oncenetdev_warn_ratelimited so operators can count real events over time without floods.

Testing

  • All three commits running in production on a 24-node Raspberry Pi 5 fleet on rpi-6.18.y + macb-v2 since 2026-05-14 ~14:00 UTC. ~120 cumulative node-hours so far. Zero mid-runtime TX stalls; zero user-space-watchdog RECOVER events; the five TX stall detected lines in dmesg are all the boot-time false positive the carrier-gate commit eliminates.

  • Pre-patch fleet rate was multiple stalls per hour at fleet level ([JFYI] Raspberry Pi 5 + kernel 6.17: Silent network death cilium/cilium#43198, Ubuntu launchpad #2133877). Pre-patch baseline over the same 120 node-hours would have predicted ~60 stalls; observed: 0.

  • checkpatch clean on all three commits.

Happy to split / squash / rebase on feedback. CC @pelwell who reviewed the original series.

…WRITES

After commits a9e13da ("net: macb: flush PCIe posted write
after TSTART doorbell") merged for the silent TX stall on
BCM2712/RP1, the readback in macb_start_xmit() and macb_tx_restart()
runs on every macb variant.

SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx, etc.) write
NCR over on-chip MMIO with no fabric posted-write concern, and pay
the non-posted PCIe-read latency for no benefit.  Only the
PCIe-attached macb (currently BCM2712 + RP1) needs the flush.

Introduce MACB_CAPS_PCIE_POSTED_WRITES and gate both readbacks
behind it; set the cap on raspberrypi_rp1_config only.

This is the rpi-6.18.y portion of the v2 follow-up to the netdev
RFC submission; the mainline v2 is on lore:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
Commit ff6914e ("net: macb: re-check ISR after IER re-enable
in macb_tx_poll") added a re-check of ISR with a TCOMP mask after
unmasking the interrupt, intended to catch completions that fired
while masked plus serve as a PCIe read barrier so the descriptor
check sees up-to-date TX_USED.

That ISR read is destructive on silicon where MACB_CAPS_ISR_CLEAR_ON_WRITE
is not set -- raspberrypi_rp1_config is one such case.  On read-clear
ISR the queue_readl(queue, ISR) call consumes every set bit, but the
macb_tx_poll() use-site masks the value down to MACB_BIT(TCOMP) and
discards the rest.  Any RCOMP / ROVR / TXUBR bit set in ISR at the
moment of the read is silently consumed and never processed by the
primary macb_interrupt() handler.  The race window is small (~200-500 ns
per macb_tx_poll completion) but real, and on a level-triggered IRQ the
consumed bit drops the line before GIC delivery, so the IRQ for that
event is never raised at all.

Replace the destructive read with a read of IMR, the read-only
interrupt-mask mirror.  IMR has no side effects on either read-clear
or W1C ISR silicon, and the MMIO read still serves as an architected
PCIe read barrier that retires prior peripheral DMA writes before the
following macb_tx_complete_pending() inspects the descriptor.  The
"directly sample a latched TCOMP" half of the original change is lost;
the PCIe-barrier half -- which is what the documented race in the
existing rmb() comment requires -- is preserved.

This is the rpi-6.18.y portion of the v2 follow-up to the netdev RFC
submission; the mainline v2 is on lore:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Fixes: ff6914e ("net: macb: re-check ISR after IER re-enable in macb_tx_poll")
Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
…imited

Commit 79dc190 ("net: macb: add TX stall watchdog as
defence-in-depth safety net") added a per-queue 1 Hz delayed_work
that fires macb_tx_restart() if tx_tail has not advanced.  Two
operational follow-ups:

1. Boot-time false positive.  Between macb_open() and link autoneg
   completion, queue->tx_head can advance from kernel-queued packets
   while tx_tail stays at 0 because no TCOMPs have arrived yet.  The
   first watchdog tick at +1000 ms then trips its head!=tail check
   and fires a spurious TX-stall warning + re-kick.  Observed at
   ~25% rate during fleet rolling reboots (6 of 24 nodes, all
   tail=0 head=5-7).  The re-kick itself is harmless on a healthy
   ring (macb_tx_restart() verifies TBQP vs head before re-asserting
   TSTART), but the warning is noise and confuses operators who
   read it as a real silent-stall event.

   Gate the stall check on netif_carrier_ok().  No carrier means no
   completion is possible, so skipping the check is safe; the
   watchdog still ticks so it picks up immediately once carrier
   comes up.

2. netdev_warn_once() limits visibility to one event per netdev
   lifetime, which is fine for a warn-once defensive log but defeats
   any operator accounting of real stall events on long-running
   nodes.  netdev_warn_ratelimited() keeps bounded log noise while
   making the events observable in the journal.

This is the rpi-6.18.y portion of the v2 follow-up to the netdev RFC
submission; the mainline v2 is on lore:

  https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/

Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
lukaszraczylo added a commit to lukaszraczylo/pkgs that referenced this pull request May 14, 2026
Drop the [PATCH net-next v2 0/3] series 'net: macb: candidate fixes
for silent TX stall on BCM2712/RP1' into kernel/build/patches/,
replacing the v1 form merged in siderolabs#1526.

netdev v2 thread:
  https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/

v2 changes from v1 (already merged in siderolabs#1526):

0001 (PCIe posted-write flush after TSTART doorbell) - now gated
  behind a new MACB_CAPS_PCIE_POSTED_WRITES capability, set only on
  raspberrypi_rp1_config.  v1 applied the readback to every macb
  variant; SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx)
  have no fabric posted-write concern and were paying the
  non-posted-read latency for nothing.

0002 (PCIe read barrier before TX completion descriptor check) -
  replaces the v1 form, which was a regression on read-clear ISR
  silicon.  v1 read ISR with a TCOMP mask in macb_tx_poll(); on
  raspberrypi_rp1_config (where MACB_CAPS_ISR_CLEAR_ON_WRITE is
  not set) that read consumes every bit set in ISR, but the
  use-site masks down to TCOMP and discards the rest -- any
  RCOMP / ROVR / TXUBR bit at that instant is silently consumed
  and the IRQ handler that would have processed it sees ISR=0.
  On level-triggered IRQ the consumed bit drops the line before
  GIC delivery, vanishing the IRQ entirely.  Caught by self-audit
  while preparing v2; disclosed to netdev in the reply thread.
  v2 replaces the destructive ISR read with (void)queue_readl(
  queue, IMR), the read-only mask mirror -- non-destructive, same
  PCIe-barrier effect on prior peripheral DMA writes.

0003 (TX stall watchdog) - tracks tail movement via a bool flag
  set by macb_tx_complete() instead of a tx_tail snapshot
  (form suggested by Phil Elwell on raspberrypi/linux#7340).
  Adds a netif_carrier_ok() gate at the top of the watchdog tick
  so the boot-time false positive seen on autoneg-still-pending
  boots no longer fires (observed at ~25% rate on a 24-node fleet
  rolling reboot under v1).  Swaps netdev_warn_once to
  netdev_warn_ratelimited so operators can count real events
  across the netdev lifetime.

The v2 patch 2 IMR-barrier form has been running in production on a
24-node Raspberry Pi 5 fleet since 2026-05-14; ~190 cumulative
node-hours so far, zero mid-runtime TX stalls, zero user-space
watchdog RECOVER events.  Pre-patch baseline (~0.5 stall/node-hour
at fleet level) would have predicted ~95 mid-runtime stalls in that
window; observed is 0.

Related:
  * netdev v1 RFC thread:    https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/
  * netdev v2 series:        https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/
  * raspberrypi/linux merge: raspberrypi/linux#7340
  * raspberrypi/linux v2 PR: raspberrypi/linux#7369

Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant