How to make your C++ code faster with extended stack space

Suppose we write a function like this

void f() {
  std::array<double, 100> my_array;
  // operate on my_array
}

Where will my_array be located? The compiler will generate code to set aside a region of memory on the call stack for my_array. This is much more efficient than corresponding code where memory would be taken from the heap

void f_p() {
  std::unique_ptr<double[]> my_array_p{new double[100]};
  // operate on my_array_p
}

Why?

  • Allocating and deallocations of the stack can be done by incrementing and decrementing a counter tracking the top of the stack. By contrast, heap memory is managed via data structures tracking free blocks of memory. Allocating requires traversing the data structures to locate a suitable block and then updating the data structures to indicate that the region of memory used is no longer available.

  • Successive regions of memory allocated from the stack will be co-located. By locating the regions next to each other, we will get fewer cache line misses when we access or modify the memory.

But what happens if we don’t know how big the array needs to be at compile-time?

void f(int n) {
  // we need an array with space for at least n elements
  // ...
}

Can we still allocate the memory from the stack?

The very special function alloca gives us a way. We can write

void f(int n) {
  auto my_array = static_cast<double*>(alloca(sizeof(double)*n));
  // ...
}

But we need to be careful. If n is too large, we could exceed the available stack space and crash our program.

Furthermore, alloca is inconvenient to work with. alloca only operates within the current function, making it impossible to augment it with helper functions.

Fortunately, there’s a solution that provides all the benefits of alloca while being much safer and usable. Here’s how it works:

First we write a simple class to manage a stack of bytes

template <size_t N>
class stack {
 public:
  void* top() noexcept { return static_cast<void*>(data_.data() + size_); }

  void* allocate(size_t num_bytes) noexcept {
    void* ptr = this->top();
    size_ += num_bytes;
    return ptr;
  }

  void deallocate(size_t num_bytes) noexcept {
    size_ -= num_bytes;
  }
 private:
   std::array<std::byte, N> data_{};
   size_t size_{0};
};

We then use the class to define a thread local global variable that acts as extended stack space

constexpr size_t stack_extension_size_v = 1ull << 21;
thread_local constinit stack<stack_extension_size_v> stack_extension_v{};

Now we define a custom memory resource inheriting from std::pmr::memory_resource. The custom resource allocates from the extended stack space and deallocates everything in its destructor. If the stack is full, it falls back to the provided upstream resource.

class stackext_resource final : public std::pmr::memory_resource {
 public:
  explicit stackext_resource(std::pmr::memory_resource* upstream =
                                 std::pmr::get_default_resource()) noexcept;

  ~stackext_resource() noexcept override {
    // deallocate memory taken from stack_extension_v
  }

 private:
  std::pmr::monotonic_buffer_resource upstream_;
  size_t size_{0};

  // std::pmr::memory_resource
  void* do_allocate(size_t num_bytes, size_t alignment) noexcept override {
    // take memory from stack_extension_v if possible; otherwise, from upstream
  }

  void do_deallocate(void* /*ptr*/, size_t /*num_bytes*/,
                     size_t /*alignment*/) noexcept override {
    // do nothing
  }

  bool do_is_equal(
      const std::pmr::memory_resource& other) const noexcept override {
    return this == &other;
  }
};

With the custom resource, we can easily configure standard containers to allocate from our extended stack

void f() {
  stackext_resource resource;
  std::pmr::map<std::pmr::string, my_struct> m{&resource};
  // when we fill m, it will take memory from the extended stack space
}

Like alloca, stackext_resource only requires simple arithmetic operations to manage allocations and provides excellent locality. But unlike alloca, we can easily use it with standard data structures and there’s no danger of exceeding available stack space and crashing the program.

For full source code, check out the project bbai-mem.

Stay up to date