Emerging byte-addressable Non-Volatile Memory (NVM) technology, although promising superior memory density and ultra-low energy consumption, poses unique challenges to achieving persistent data privacy and computing security, both of which are critically important to the embedded and IoT applications. Specifically, to successfully restore NVMs to their working states after unexpected system crashes or power failure, maintaining and recovering all the necessary security-related metadata can severely increase memory traffic, degrade runtime performance, exacerbate write endurance problem, and demand costly hardware changes to off-the-shelf processors. In this article, we designed and implemented ARES, a new FPGA-assisted processor-transparent security mechanism that aims at efficiently and effectively achieving all three aspects of a security triad—confidentiality, integrity, and recoverability—in modern embedded computing. Given the growing prominence of CPU-FPGA heterogeneous computing architectures, ARES leverages FPGA’s hardware reconfigurability to offload performance-critical and security-related functions to the programmable hardware without microprocessors’ involvement. In particular, recognizing that the traditional Merkle tree caching scheme cannot fully exploit FPGA’s parallelism due to its sequential and recursive function calls, we (1) proposed a Merkle tree cache architecture that partitions a unified cache into multiple levels with parallel accesses and (2) further designed a novel Merkle tree scheme that flattened and reorganized the computation in the traditional Merkle tree verification and update processes to fully exploit the parallel cache ports and to fully pipeline time-consuming hashing operations. Beyond that, to accelerate the metadata recovery process, multiple parallel recovery units are instantiated to recover counter metadata and multiple Merkle sub-trees. Our hardware prototype of the ARES system on a Xilinx U200 platform shows that ARES achieved up to 1.4× lower latency and 2.6× higher throughput against the baseline implementation, while metadata recovery time was shortened by 1.8 times. When integrated with an embedded processor, neither hardware changes nor software changes are required. We also developed a theoretical framework to analytically model and explain experimental results.