Achieving high-performance while reducing power consumption is a key concern as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieve orders of magnitude improvements in efficiency. The question is whether such efficiency can be maintained while providing enough flexibility to implement a broad class of operations. In this paper, we aim to answer this question for the domain of matrix computations. We propose a design of a novel linear algebra core and demonstrate that it can achieve orders of magnitude improvements in efficiency for matrix-matrix multiplication, an operation that is indicative for a broad class of matrix computations. A feasibility study shows that 47 double- and 104 single-precision GFLOPS/W can be achieved in 19.5 and 15.6 GFLOPS/mm2, respectively with current components and standard 45nm technology.