2015 journal article

Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching


By: R. Sheikh*, J. Tuck & E. Rotenberg

author keywords: Microarchitecture; software/hardware codesign; branch prediction; predication; pre-execution; separable branches; isa extensions; instruction level parallelism
Source: Web Of Science
Added: August 6, 2018

Mobile and PC/server class processor companies continue to roll out flagship core microarchitectures that are faster than their predecessors. Meanwhile placing more cores on a chip coupled with constant supply voltage puts per-core energy consumption at a premium. Hence, the challenge is to find future microarchitecture optimizations that not only increase performance but also conserve energy. Eliminating branch mispredictions—which waste both time and energy—is valuable in this respect. In this paper, we explore the control-flow landscape by characterizing mispredictions in four benchmark suites. We find that a third of mispredictions-per-1K-instructions (MPKI) come from what we call separable branches: branches with large control-dependent regions (not suitable for if-conversion), whose backward slices do not depend on their control-dependent instructions or have only a short dependence. We propose control-flow decoupling (CFD) to eradicate mispredictions of separable branches. The idea is to separate the loop containing the branch into two loops: the first contains only the branch’s predicate computation and the second contains the branch and its control-dependent instructions. The first loop communicates branch outcomes to the second loop through an architectural queue. Microarchitecturally, the queue resides in the fetch unit to drive timely, non-speculative branching. On a microarchitecture configured similar to Intel’s Sandy Bridge core, CFD increases performance by up to 55 percent, and reduces energy consumption by up to 49 percent (for CFD regions). Moreover, for some applications, CFD is a necessary catalyst for future complexity-effective large-window architectures to tolerate memory latency.