Radio telescopes typically consist of multiple receivers whose signals are cross-correlated to filter out noise. A recent trend is to correlate in software instead of custom-built hardware, taking advantage of the flexibility that software solutions offer.However, the data rates are usually high and the processing requirements challenging. Many-core processors are promising devices to provide the required processing power. In this article, we explain how to implement and optimize signal-processing applications on multicore CPUs and many-core architectures, such as the Intel Core i7, NVIDIA and ATI graphics processor units (CPUs), and the Cell/BE. We use correlation as a running example. The correlator is a streaming, possibly real-time application, and is much more input/ output (I/O) intensive than applications that are typically implemented on many-core hardware today. We compare with the LOFAR production correlator on an IBM Blue Gene/P (BG/P) supercomputer. We discuss several important architectural problems which cause architectures to perform suboptimally, and also deal with programmability.