The global financial markets process trillions of dollars daily, making microsecond-level order execution critical for high-frequency trading (HFT) profitability. Standard CPU-based trading engines often struggle with performance bottlenecks due to sequential execution and memory access limitations. This project addresses the need for ultra-low-latency, high-throughput computing to execute large volumes of trades in real time. By overcoming software limitations, the system provides a scalable solution that benefits modern financial institutions and traders relying on speed and reliability. The system utilizes a hybrid hardware-software architecture, employing an FPGA-based Tensor Processing Unit (TPU) to run a lightweight neural network for intelligent order placement. A novel FPGA-based MultiQueue BRAM architecture is used for deterministic, parallel order matching. Meanwhile, the host computer manages over 100,000 orders using lock-free Red-Black Trees for cache-efficient, large-scale storage. The machine learning component leverages a Proximal Policy Optimization (PPO) reinforcement learning agent to dynamically rank trades for caching in the FPGA's local memory.
