This paper focuses on how to design a store buffer (STB) well suited to first-level multibanked data caches. The goal is to forward data from in-flight stores into dependent loads within the latency of a cache bank. Taking into account the store lifetime in the processor pipeline and the data forwarding behavior, we propose a particular two-level STB design in which forwarding is done speculatively from a distributed first-level STB made of extremely small banks, whereas a centralized, second-level STB enforces correct store-load ordering. Besides, the two-level STB admits two simplifications that leave performance almost unchanged. Regarding the second-level STB, we suggest to remove its data forwarding capability, while for the first-level STB, it is possible to: 1) remove the instruction age checking and 2) compare only the less significant address bits. Experimentation covers both integer and floating point codes executing in dynamically scheduled processors. Following our guidelines and running SPEC-2K over an 8-way processor, a two-level STB with four 8-entry banks in the first level performs similar to an ideal, single-level STB with 128-entry banks working at the first-level cache latency. Also, we show that the proposed two-level design is suitable for a memory-latency-tolerant processor.