diff --git a/Documentation/components/net/tcp_network_perf.rst b/Documentation/components/net/tcp_network_perf.rst
index 33e89d8efd265..c55117a912a73 100644
--- a/Documentation/components/net/tcp_network_perf.rst
+++ b/Documentation/components/net/tcp_network_perf.rst
@@ -2,8 +2,8 @@
 TCP Network Performance
 =======================
 
-.. warning:: 
-    Migrated from: 
+.. warning::
+    Migrated from:
     https://cwiki.apache.org/confluence/display/NUTTX/TCP+Network+Performance
 
 
@@ -22,172 +22,194 @@ First let's talk about TCP send performance.
 Source of Performance Bottlenecks
 ---------------------------------
 
-General TCP send performance is not determined by the TCP stack as much 
-as it is by the network device driver. Bad network performance is due 
-to time lost `BETWEEN` packet transfers. The packet transfers themselves 
-go at the wire speed*. So if you want to improve performance on a 
-given network, you have to reduce time lost between transfers. 
+General TCP send performance is not determined by the TCP stack as much
+as it is by the network device driver. Bad network performance is due
+to time lost `BETWEEN` packet transfers. The packet transfers themselves
+go at the wire speed*. So if you want to improve performance on a
+given network, you have to reduce time lost between transfers.
 There is no other way.
 
-Ignoring Ethernet issues like collisions, back-off delays, 
+Ignoring Ethernet issues like collisions, back-off delays,
 inter-packet gaps (IPG), etc.
 
-The time between packets is limited primarily by the buffering 
-design of the network driver. If you want to improve performance, 
-then you must improve the buffering at the network driver. 
-You need to support many full size (1500 byte) packet buffers. 
-You must be able to query the network for new data to transfer, 
-and queue those transfers in packet buffers. In order to reach 
-peak performance, the network driver must have the next transfer 
-buffered and ready-to-go before the previous transfer is finished 
+The time between packets is limited primarily by the buffering
+design of the network driver. If you want to improve performance,
+then you must improve the buffering at the network driver.
+You need to support many full size (1500 byte) packet buffers.
+You must be able to query the network for new data to transfer,
+and queue those transfers in packet buffers. In order to reach
+peak performance, the network driver must have the next transfer
+buffered and ready-to-go before the previous transfer is finished
 to minimize the GAP between packet transfers.
 
-Different network devices also support more or less efficient 
-interfaces: The worst performing support interfaces that can 
-handle only one packet at a time, the best performing are able 
-to retain linked lists of packet buffers in memory and perform 
+Different network devices also support more or less efficient
+interfaces: The worst performing support interfaces that can
+handle only one packet at a time, the best performing are able
+to retain linked lists of packet buffers in memory and perform
 scatter-gather DMA for a sequence of packets.
 
-In the NuttX TCP stack, you can also improve performance by 
+In the NuttX TCP stack, you can also improve performance by
 enabling TCP write buffering. But the driver is the real key.
 
-It would be good to have a real in-depth analysis of the 
-network stack performance to identify bottlenecks and 
-generate ideas for performance improvement. No one has 
-ever done that. If I were aware of any stack related 
+It would be good to have a real in-depth analysis of the
+network stack performance to identify bottlenecks and
+generate ideas for performance improvement. No one has
+ever done that. If I were aware of any stack related
 performance issue, I would certainly address it.
 
 RFC 1122
 --------
 
-There is one important feature missing the NuttX TCP that 
-can help when there is no write buffering: Without write 
-buffering send() will not return until the transfer has 
-been ACKed by the recipient. But under RFC 1122, the host 
-need not ACK each packet immediately; the host may wait 
-for 500 MS before ACKing. This combination can cause very 
-slow performance when small, non-buffered transfers are 
-made to an RFC 1122 client. However, the RFC 1122 must 
-ACK at least every second (odd) packet so sequences of 
-packets with write buffering enabled do not suffer from 
+There is one important feature missing the NuttX TCP that
+can help when there is no write buffering: Without write
+buffering send() will not return until the transfer has
+been ACKed by the recipient. But under RFC 1122, the host
+need not ACK each packet immediately; the host may wait
+for 500 MS before ACKing. This combination can cause very
+slow performance when small, non-buffered transfers are
+made to an RFC 1122 client. However, the RFC 1122 must
+ACK at least every second (odd) packet so sequences of
+packets with write buffering enabled do not suffer from
 this problem.
 
-`Update:  RFC 1122 support was added to the NuttX TCP 
-stack with commit 66ef6d143a627738ad7f3ce1c065f9b1f3f303b0 
-in December of 2019.  That, however, that affects only 
-received packet ACK behavior and has no impact on transmitted 
+`Update:  RFC 1122 support was added to the NuttX TCP
+stack with commit 66ef6d143a627738ad7f3ce1c065f9b1f3f303b0
+in December of 2019.  That, however, that affects only
+received packet ACK behavior and has no impact on transmitted
 packet performance; write buffering is still recommended.`
 
 TCPBlaster
 ----------
 
-I created a new test application at ``apps/examples/tcpblaster`` to 
-measure TCP performance and collected some data for the 
-configuration that happens to be on my desk. The `tcpblaster` 
-test gives you the read and write transfer rates in ``Kb/sec`` 
-(I won't mention the numbers because I don't believe they 
-would translate any other setup and, hence, would be 
+I created a new test application at ``apps/examples/tcpblaster`` to
+measure TCP performance and collected some data for the
+configuration that happens to be on my desk. The `tcpblaster`
+test gives you the read and write transfer rates in ``Kb/sec``
+(I won't mention the numbers because I don't believe they
+would translate any other setup and, hence, would be
 misleading).
 
-There is a nifty `TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_ 
-that gives some theoretical upper limits on performance. 
-The tool needs to know the ``MSS`` (which is the Ethernet 
-packet size that you configured minus the size of the 
-Ethernet header, 14), the round-trip time (``RTT``)in 
-milliseconds (which you can 
-get from the Linux host ping), and a loss constant (which 
-I left at the default). With these values, I can determine 
-that the throughput for the NuttX TCP stack is approximately 
-at the theoretical limits. You should not be able to do 
-better any better than that (actually, it performs above 
-the theoretical limit, but I suppose that is why it is 
+There is a nifty `TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_
+that gives some theoretical upper limits on performance.
+The tool needs to know the ``MSS`` (which is the Ethernet
+packet size that you configured minus the size of the
+Ethernet header, 14), the round-trip time (``RTT``)in
+milliseconds (which you can
+get from the Linux host ping), and a loss constant (which
+I left at the default). With these values, I can determine
+that the throughput for the NuttX TCP stack is approximately
+at the theoretical limits. You should not be able to do
+better any better than that (actually, it performs above
+the theoretical limit, but I suppose that is why it is
 "theoretical").
 
-So, If you are unhappy with your network performance, the I 
-suggest you run the `tcpblaster` test, use that data 
-(along with the ``RTT`` from ping) with the 
-`TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_. 
-If you are still unhappy with the performance, don't go 
-immediately pointing fingers at the stack (which everyone does). 
-Instead, you should focus on optimizing your network 
-configuration settings and reviewing the buffer handling 
+So, If you are unhappy with your network performance, the I
+suggest you run the `tcpblaster` test, use that data
+(along with the ``RTT`` from ping) with the
+`TCP Throughput Tool <https://www.switch.ch/network/tools/tcp_throughput/>`_.
+If you are still unhappy with the performance, don't go
+immediately pointing fingers at the stack (which everyone does).
+Instead, you should focus on optimizing your network
+configuration settings and reviewing the buffer handling
 of the Ethernet driver in you MCU.
 
-If you do discover any significant performance issues 
-with the stack I will of course gladly help you resolve 
-them. Or if you have ideas for improved performance, 
+If you do discover any significant performance issues
+with the stack I will of course gladly help you resolve
+them. Or if you have ideas for improved performance,
 I would also be happy to hear those.
 
 What about Receive Performance?
 -------------------------------
 
-All of the above discussion concerns `transmit performance`, 
-i.e., "How fast can we send data over the network?" The other 
-side is receive performance. Receive performance is very 
-different thing. In this case it is the remote peer who is 
-in complete control of the rate at which packets appear on 
-the network and, hence, responsible for all of the raw bit 
+All of the above discussion concerns `transmit performance`,
+i.e., "How fast can we send data over the network?" The other
+side is receive performance. Receive performance is very
+different thing. In this case it is the remote peer who is
+in complete control of the rate at which packets appear on
+the network and, hence, responsible for all of the raw bit
 transfer rates.
 
-However, we might also redefine performance as the number of 
-bytes that were `successfully` transferred. In order for the 
-bytes to be successfully transferred they must be successfully 
-received and processed on the NuttX target. If we fail in 
-this if the packet is `lost` or `dropped`. A packet is lost if 
-the network driver is not prepared to receive the packet when 
-it was sent. A packet is dropped by the network if it is 
-received but could not be processed either because there 
-is some logical issue with the packet (not the case here) 
+However, we might also redefine performance as the number of
+bytes that were `successfully` transferred. In order for the
+bytes to be successfully transferred they must be successfully
+received and processed on the NuttX target. If we fail in
+this if the packet is `lost` or `dropped`. A packet is lost if
+the network driver is not prepared to receive the packet when
+it was sent. A packet is dropped by the network if it is
+received but could not be processed either because there
+is some logical issue with the packet (not the case here)
 or if we have no space to buffer the newly received packet.
 
-If a TCP packet is lost or dropped, then the penalty is big: 
-The packet will not be ACKed, the remote peer may send a 
-few more out-of-sequence packets which will also be dropped. 
-Eventually, the remote peer will time out and retransmit 
+If a TCP packet is lost or dropped, then the penalty is big:
+The packet will not be ACKed, the remote peer may send a
+few more out-of-sequence packets which will also be dropped.
+Eventually, the remote peer will time out and retransmit
 the data from the point of the lost packet.
 
-There is logic in the TCP protocol to help manage these data 
-overruns. The TCP header includes a TCP `receive window` which 
-tells the remote peer how much data the receiver is able to 
-buffer. This value is sent in the ACK to each received 
-packet. If well tuned, this receive window could possibly 
-prevent packets from being lost due to the lack of 
-read-ahead storage. This is a little better. The remote 
-peer will hold off sending data instead of timing out and 
-re-transmitting. But this is still a loss of performance; 
-the gap between the transfer of packets caused by the hold-off 
+There is logic in the TCP protocol to help manage these data
+overruns. The TCP header includes a TCP `receive window` which
+tells the remote peer how much data the receiver is able to
+buffer. This value is sent in the ACK to each received
+packet. If well tuned, this receive window could possibly
+prevent packets from being lost due to the lack of
+read-ahead storage. This is a little better. The remote
+peer will hold off sending data instead of timing out and
+re-transmitting. But this is still a loss of performance;
+the gap between the transfer of packets caused by the hold-off
 will result in a reduced transfer rate.
 
-So the issues for good reception are buffering and processing 
-time. Buffering again applies to handling within the driver 
-but unlike the transmit performance, this is not typically 
-the bottleneck. And there is also a NuttX configuration 
-option that controls `read-ahead` buffering of TCP packets. 
-The buffering in the driver must be optimized to avoid lost 
-packets; the ` buffering can be tuned to minimize 
+So the issues for good reception are buffering and processing
+time. Buffering again applies to handling within the driver
+but unlike the transmit performance, this is not typically
+the bottleneck. And there is also a NuttX configuration
+option that controls `read-ahead` buffering of TCP packets.
+The buffering in the driver must be optimized to avoid lost
+packets; the ` buffering can be tuned to minimize
 the number packets dropped because we have no space to buffer them.
 
-But the key to receive perform is management of processing 
-delays. Small processing delays can occur in the network 
-driver or in the TCP stack. But the major source of 
-processing delay is the application which is the ultimate 
-consumer of the incoming data. Imagine, for example, 
-and FTP application that is receiving a file over a 
-TCP and writing the file into FLASH memory. The primary 
-bottleneck here will be the write to FLASH memory which 
+But the key to receive perform is management of processing
+delays. Small processing delays can occur in the network
+driver or in the TCP stack. But the major source of
+processing delay is the application which is the ultimate
+consumer of the incoming data. Imagine, for example,
+and FTP application that is receiving a file over a
+TCP and writing the file into FLASH memory. The primary
+bottleneck here will be the write to FLASH memory which
 is out of the control of software.
 
-We obtain optimal receive performance when the processing 
-delays keep up with the rate of the incoming packets. 
-If the processing data rate is even slightly slower 
-then the receive data rate, then there will be a 
-growing `backlog` of buffered, incoming data to be 
-processed. If this backlog continues to grow then 
-eventually our ability to buffer data will be exhausted, 
-packets will be held off or dropped, and performance 
-will deteriorate. In an environment where a high-end, 
-remote peer is interacting with the low-end, embedded 
-system, that remote peer can easily overrun the 
-embedded system due to the embedded system's limited 
-buffering space, its much lower processing capability, 
-and its slower storage peripherals.
\ No newline at end of file
+We obtain optimal receive performance when the processing
+delays keep up with the rate of the incoming packets.
+If the processing data rate is even slightly slower
+then the receive data rate, then there will be a
+growing `backlog` of buffered, incoming data to be
+processed. If this backlog continues to grow then
+eventually our ability to buffer data will be exhausted,
+packets will be held off or dropped, and performance
+will deteriorate. In an environment where a high-end,
+remote peer is interacting with the low-end, embedded
+system, that remote peer can easily overrun the
+embedded system due to the embedded system's limited
+buffering space, its much lower processing capability,
+and its slower storage peripherals.
+
+Performance tips
+-------------------------------
+
+If you are chasing throughput on ARM boards (for example STM32H5),
+double-check which libc string implementation is active. In some
+configurations, ``memcpy()``/``memset()`` may fall back to the generic
+implementation, while the architecture-optimized implementation is gated by
+``CONFIG_ARMV8M_STRING_FUNCTION`` or similar.
+
+Practical tuning checklist:
+
+- Enable ``CONFIG_ARMV8M_STRING_FUNCTION`` or similar when using an ARM target
+  and GNU toolchain.
+- Rebuild and re-run throughput tests (for example with ``iperf``) for both
+  TCP and UDP after changing this option.
+- Validate CPU headroom during the test (for example with ``top`` or idle
+  metrics), not only link throughput.
+
+In one STM32H5 test setup, enabling ``CONFIG_ARMV8M_STRING_FUNCTION`` improved
+performance from roughly ~50 Mbit/s to about 95 Mbit/s (UDP) and 84 Mbit/s
+(TCP), with significant idle time still available during ``iperf``.