net: macb: v2 follow-ups to silent TX stall fix series (PR #7340)#7369
Open
lukaszraczylo wants to merge 3 commits into
Open
net: macb: v2 follow-ups to silent TX stall fix series (PR #7340)#7369lukaszraczylo wants to merge 3 commits into
lukaszraczylo wants to merge 3 commits into
Conversation
…WRITES After commits a9e13da ("net: macb: flush PCIe posted write after TSTART doorbell") merged for the silent TX stall on BCM2712/RP1, the readback in macb_start_xmit() and macb_tx_restart() runs on every macb variant. SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx, etc.) write NCR over on-chip MMIO with no fabric posted-write concern, and pay the non-posted PCIe-read latency for no benefit. Only the PCIe-attached macb (currently BCM2712 + RP1) needs the flush. Introduce MACB_CAPS_PCIE_POSTED_WRITES and gate both readbacks behind it; set the cap on raspberrypi_rp1_config only. This is the rpi-6.18.y portion of the v2 follow-up to the netdev RFC submission; the mainline v2 is on lore: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
Commit ff6914e ("net: macb: re-check ISR after IER re-enable in macb_tx_poll") added a re-check of ISR with a TCOMP mask after unmasking the interrupt, intended to catch completions that fired while masked plus serve as a PCIe read barrier so the descriptor check sees up-to-date TX_USED. That ISR read is destructive on silicon where MACB_CAPS_ISR_CLEAR_ON_WRITE is not set -- raspberrypi_rp1_config is one such case. On read-clear ISR the queue_readl(queue, ISR) call consumes every set bit, but the macb_tx_poll() use-site masks the value down to MACB_BIT(TCOMP) and discards the rest. Any RCOMP / ROVR / TXUBR bit set in ISR at the moment of the read is silently consumed and never processed by the primary macb_interrupt() handler. The race window is small (~200-500 ns per macb_tx_poll completion) but real, and on a level-triggered IRQ the consumed bit drops the line before GIC delivery, so the IRQ for that event is never raised at all. Replace the destructive read with a read of IMR, the read-only interrupt-mask mirror. IMR has no side effects on either read-clear or W1C ISR silicon, and the MMIO read still serves as an architected PCIe read barrier that retires prior peripheral DMA writes before the following macb_tx_complete_pending() inspects the descriptor. The "directly sample a latched TCOMP" half of the original change is lost; the PCIe-barrier half -- which is what the documented race in the existing rmb() comment requires -- is preserved. This is the rpi-6.18.y portion of the v2 follow-up to the netdev RFC submission; the mainline v2 is on lore: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ Fixes: ff6914e ("net: macb: re-check ISR after IER re-enable in macb_tx_poll") Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
…imited Commit 79dc190 ("net: macb: add TX stall watchdog as defence-in-depth safety net") added a per-queue 1 Hz delayed_work that fires macb_tx_restart() if tx_tail has not advanced. Two operational follow-ups: 1. Boot-time false positive. Between macb_open() and link autoneg completion, queue->tx_head can advance from kernel-queued packets while tx_tail stays at 0 because no TCOMPs have arrived yet. The first watchdog tick at +1000 ms then trips its head!=tail check and fires a spurious TX-stall warning + re-kick. Observed at ~25% rate during fleet rolling reboots (6 of 24 nodes, all tail=0 head=5-7). The re-kick itself is harmless on a healthy ring (macb_tx_restart() verifies TBQP vs head before re-asserting TSTART), but the warning is noise and confuses operators who read it as a real silent-stall event. Gate the stall check on netif_carrier_ok(). No carrier means no completion is possible, so skipping the check is safe; the watchdog still ticks so it picks up immediately once carrier comes up. 2. netdev_warn_once() limits visibility to one event per netdev lifetime, which is fine for a warn-once defensive log but defeats any operator accounting of real stall events on long-running nodes. netdev_warn_ratelimited() keeps bounded log noise while making the events observable in the journal. This is the rpi-6.18.y portion of the v2 follow-up to the netdev RFC submission; the mainline v2 is on lore: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
lukaszraczylo
added a commit
to lukaszraczylo/pkgs
that referenced
this pull request
May 14, 2026
Drop the [PATCH net-next v2 0/3] series 'net: macb: candidate fixes for silent TX stall on BCM2712/RP1' into kernel/build/patches/, replacing the v1 form merged in siderolabs#1526. netdev v2 thread: https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/ v2 changes from v1 (already merged in siderolabs#1526): 0001 (PCIe posted-write flush after TSTART doorbell) - now gated behind a new MACB_CAPS_PCIE_POSTED_WRITES capability, set only on raspberrypi_rp1_config. v1 applied the readback to every macb variant; SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx) have no fabric posted-write concern and were paying the non-posted-read latency for nothing. 0002 (PCIe read barrier before TX completion descriptor check) - replaces the v1 form, which was a regression on read-clear ISR silicon. v1 read ISR with a TCOMP mask in macb_tx_poll(); on raspberrypi_rp1_config (where MACB_CAPS_ISR_CLEAR_ON_WRITE is not set) that read consumes every bit set in ISR, but the use-site masks down to TCOMP and discards the rest -- any RCOMP / ROVR / TXUBR bit at that instant is silently consumed and the IRQ handler that would have processed it sees ISR=0. On level-triggered IRQ the consumed bit drops the line before GIC delivery, vanishing the IRQ entirely. Caught by self-audit while preparing v2; disclosed to netdev in the reply thread. v2 replaces the destructive ISR read with (void)queue_readl( queue, IMR), the read-only mask mirror -- non-destructive, same PCIe-barrier effect on prior peripheral DMA writes. 0003 (TX stall watchdog) - tracks tail movement via a bool flag set by macb_tx_complete() instead of a tx_tail snapshot (form suggested by Phil Elwell on raspberrypi/linux#7340). Adds a netif_carrier_ok() gate at the top of the watchdog tick so the boot-time false positive seen on autoneg-still-pending boots no longer fires (observed at ~25% rate on a 24-node fleet rolling reboot under v1). Swaps netdev_warn_once to netdev_warn_ratelimited so operators can count real events across the netdev lifetime. The v2 patch 2 IMR-barrier form has been running in production on a 24-node Raspberry Pi 5 fleet since 2026-05-14; ~190 cumulative node-hours so far, zero mid-runtime TX stalls, zero user-space watchdog RECOVER events. Pre-patch baseline (~0.5 stall/node-hour at fleet level) would have predicted ~95 mid-runtime stalls in that window; observed is 0. Related: * netdev v1 RFC thread: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/ * netdev v2 series: https://lore.kernel.org/netdev/20260514215459.36109-1-lukasz@raczylo.com/T/ * raspberrypi/linux merge: raspberrypi/linux#7340 * raspberrypi/linux v2 PR: raspberrypi/linux#7369 Signed-off-by: Lukasz Raczylo <lukasz@raczylo.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three follow-up commits to PR #7340 (merged 2026-05-08) addressing one regression and two operational polish items I found while preparing the netdev v2 submission.
Background
Mainline RFC: https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/T/
Mainline v2 sent today (2026-05-14): https://lore.kernel.org/netdev/ (cover Message-ID
<20260514215459.36109-1-lukasz@raczylo.com>)These rpi-6.18.y-anchored commits track the mainline v2 changes one-for-one but apply on top of the already-merged form in this tree.
Commits
1.
net: macb: gate PCIe posted-write flush behind MACB_CAPS_PCIE_POSTED_WRITES@a9e13da2dc4c added the NCR readback unconditionally to every macb variant. SoC-integrated parts (Atmel, Microchip, SiFive, Xilinx, etc.) write NCR over on-chip MMIO with no fabric posted-write concern, so they pay the non-posted-read latency for nothing.
Introduce
MACB_CAPS_PCIE_POSTED_WRITES(BIT(14) — first unused) and gate both readbacks behind it. Set onraspberrypi_rp1_configonly.2.
net: macb: drop destructive ISR read, use IMR barrier in macb_tx_poll← the real regression fix@ff6914e97386 added
if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) || macb_tx_complete_pending(queue)). On read-clear ISR silicon — which is whatraspberrypi_rp1_configdefaults to (MACB_CAPS_ISR_CLEAR_ON_WRITEnot set) — thequeue_readl(queue, ISR)consumes every bit set in the register, but the use-site masks down to TCOMP and discards the rest. Any RCOMP / ROVR / TXUBR bit set at the moment of the read is silently consumed; the IRQ handler that would have processed it seesISR == 0and returns IRQ_NONE. On a level-triggered IRQ the consumed bit drops the line before GIC delivery, so the IRQ vanishes entirely.Race window is ~200-500 ns per
macb_tx_pollcompletion; small but non-zero, and at moderate-to-heavy RX load the math works out to non-negligible RX-completion loss.Replace with
(void)queue_readl(queue, IMR). IMR is the read-only mask mirror, no side effects on either read-clear or W1C ISR silicon. Same PCIe-barrier effect, so the subsequentmacb_tx_complete_pending()still sees flushed peripheral DMA writes. Loses the "directly sample latched TCOMP" half of the original change; keeps the half that addresses the documentedrmb()race.3.
net: macb: gate TX stall watchdog on netif_carrier_ok, use warn_ratelimited@79dc190b12f9 's watchdog fires a false positive at boot when
tx_headadvances from kernel-queued packets betweenmacb_open()and link autoneg completion whiletx_tailstays at 0 (no TCOMPs yet). Observed at ~25% rate during a 24-node fleet rolling reboot (6 of 24 nodes fired with the canonicaltail=0 head=5-7signature).The re-kick itself is harmless (macb_tx_restart() checks TBQP vs head before re-asserting TSTART) but the warning is misleading.
Two changes:
netif_carrier_ok()gate at the top of the watchdog tick. No carrier means no completion possible; skip the check but keep ticking.netdev_warn_once→netdev_warn_ratelimitedso operators can count real events over time without floods.Testing
All three commits running in production on a 24-node Raspberry Pi 5 fleet on
rpi-6.18.y + macb-v2since 2026-05-14 ~14:00 UTC. ~120 cumulative node-hours so far. Zero mid-runtime TX stalls; zero user-space-watchdog RECOVER events; the fiveTX stall detectedlines in dmesg are all the boot-time false positive the carrier-gate commit eliminates.Pre-patch fleet rate was multiple stalls per hour at fleet level ([JFYI] Raspberry Pi 5 + kernel 6.17: Silent network death cilium/cilium#43198, Ubuntu launchpad #2133877). Pre-patch baseline over the same 120 node-hours would have predicted ~60 stalls; observed: 0.
checkpatch clean on all three commits.
Happy to split / squash / rebase on feedback. CC @pelwell who reviewed the original series.