<?xml version="1.0" ?>
<rss version="2.0">
	<channel>
		<title><![CDATA[ Computer Architecture Letters - new TOC ]]></title>
		<link>http://ieeexplore.ieee.org</link>
		<description>TOC Alert for Publication# 10208 </description>
		<year>2008</year>
		<month>July     </month>
		<day>30</day>
		<item>
			<title><![CDATA[Physical Register Reference Counting]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4384571]]></link>
			<description><![CDATA[Several recently proposed techniques including CPR (Checkpoint Processing and Recovery) and NoSQ (No Store Queue) rely on reference counting to manage physical registers. However, the register reference counting mechanism itself has received surprisingly little attention. This paper fills this gap by describing potential register reference counting schemes for NoSQ, CPR, and a hypothetical NoSQ/CPR hybrid. Although previously described in terms of binary counters, we find that reference counts are actually more naturally represented as matrices. Binary representations can be used as an optimization in specific situations.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4384571]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>9</startPage>
			<endPage>12</endPage>
			<fileSize>88</fileSize>
			<authors><![CDATA[Roth, Amir;]]></authors>
		</item>
		<item>
			<title><![CDATA[Logic-Based Distributed Routing for NoCs]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4407676]]></link>
			<description><![CDATA[The design of scalable and reliable interconnection networks for multicore chips (NoCs) introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are usually proposed for NoCs, heterogeneous cores, manufacturing defects, hard failures, and chip virtualization may lead to irregular topologies. In this context, efficient routing becomes a challenge. Although switches can be easily configured to support most routing algorithms and topologies by using routing tables, this solution does not scale in terms of latency and area. We propose a new circuit that removes the need for using routing tables. The new mechanism, referred to as Logic-Based Distributed Routing (LBDR), enables the implementation in NoCs of many routing algorithms for most of the practical topologies we might find in the near future in a multicore chip. From an initial topology and routing algorithm, a set of three bits per switch output port is computed. By using a small logic block, LBDR mimics (demonstrated by evaluation) the behavior of routing algorithms implemented with routing tables. This result is achieved both in regular and irregular topologies. Therefore, LBDR removes the need for using routing tables for distributed routing, thus enabling flexible, fast and power-efficient routing in NoCs.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4407676]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>13</startPage>
			<endPage>16</endPage>
			<fileSize>115</fileSize>
			<authors><![CDATA[Flich, Jose;Duato, Jose;]]></authors>
		</item>
		<item>
			<title><![CDATA[Corollaries to Amdahl's Law for Energy]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4407677]]></link>
			<description><![CDATA[This paper studies the important interaction between parallelization and energy consumption in a parallelizable application. Given the ratio of serial and parallel portion in an application and the number of processors, we first derive the optimal frequencies allocated to the serial and parallel regions in the application to minimize the total energy consumption, while the execution time is preserved (i.e., speedup = 1). We show that dynamic energy improvement due to parallelization has a function rising faster with the increasing number of processors than the speed improvement function given by the well-known Amdahl's Law. Furthermore, we determine the conditions under which one can obtain both energy and speed improvement, as well as the amount of improvement. The formula we obtain capture the fundamental relationship between parallelization, speedup, and energy consumption and can be directly utilized in energy aware processor resource management. Our results form a basis for several interesting research directions in the area of power and energy aware parallel processing.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4407677]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>25</startPage>
			<endPage>28</endPage>
			<fileSize>185</fileSize>
			<authors><![CDATA[Cho, Sangyeun;Melhem, Rami;]]></authors>
		</item>
		<item>
			<title><![CDATA[Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4408574]]></link>
			<description><![CDATA[ACE (Architecturally Correct Execution) analysis computes AVFs (architectural vulnerability factors) of hardware structures. AVF expresses the fraction of radiation-induced transient faults that result in a user-visible error. Architects usually perform this analysis on a high-level performance model to quickly compute per-structure AVFs. If, however, low-level details of a microarchitecture are not modeled appropriately in a performance model, then their effects may not be reflected in the per-structure AVFs. In this paper we refute Wang, et. al.'s claim that this detail is difficult to model and imposes a practical threshold on ACE analysis that forces its estimates to have a high error margin. We show that carefully choosing a small amount of additional detail can result in a much tighter AVF bound than Wang, et. al. were able to achieve in their refined ACE analysis. Even the inclusion of small details, such as read/write pointers and appropriate inter-structure dependencies, can increase the accuracy of the AVF computation by 40&#x025; or more. We argue that this is no different than modeling the IPC (instructions per cycle) of a microprocessor pipeline. A less detailed performance model will provide less accurate IPCs. AVFs are no different.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4408574]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>21</startPage>
			<endPage>24</endPage>
			<fileSize>108</fileSize>
			<authors><![CDATA[Biswas, Arijit;Racunas, Paul;Emer, Joel;Mukherjee, Shubhendu;]]></authors>
		</item>
		<item>
			<title><![CDATA[Chameleon: A High Performance Flash/FRAM Hybrid Solid State Disk Architecture]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4450603]]></link>
			<description><![CDATA[Flash memory solid state disk (SSD) is gaining popularity and replacing hard disk drive (HDD) in mobile computing systems such as ultra mobile PCs (UMPCs) and notebook PCs because of lower power consumption, faster random access, and higher shock resistance. One of the key challenges in designing a high-performance flash memory SSD is an efficient handling of small random writes to non-volatile data whose performance suffers from the inherent limitation of flash memory that prohibits in-place update. In this paper, we propose a high performance Flash/FRAM hybrid SSD architecture called Chameleon. In Chameleon, metadata used by the flash translation layer (FTL), a software layer in the flash memory SSD, is maintained in a small FRAM since this metadata is a target of intensive small random writes, whereas the bulk data is kept in the flash memory. Performance evaluation based on an FPGA implementation of the Chameleon architecture shows that the use of FRAM in Chameleon improves the performance by 21.3&#x025;. The results also show that even for bulk data that cannot be maintained in FRAM because of the size limitation, the use of fine-grained write buffering is critically important because of the inability of flash memory to perform in-place update.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4450603]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>17</startPage>
			<endPage>20</endPage>
			<fileSize>173</fileSize>
			<authors><![CDATA[Yoon, Jin Hyuk;Nam, Eyee Hyun;Seong, Yoon Jae;Kim, Hongseok;Kim, Bryan;Min, Sang Lyul;Cho, Yookun;]]></authors>
		</item>
		<item>
			<title><![CDATA[An Energy-Efficient Processor Architecture for Embedded Systems]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4484578]]></link>
			<description><![CDATA[We present an efficient programmable architecture for compute-intensive embedded applications. The processor architecture uses instruction registers to reduce the cost of delivering instructions, and a hierarchical and distributed data register organization to deliver data. Instruction registers capture instruction reuse and locality in inexpensive storage structures that are located near to the functional units. The data register organization captures reuse and locality in different levels of the hierarchy to reduce the cost of delivering data. Exposed communication resources eliminate pipeline registers and control logic, and allow the compiler to schedule efficient instruction and data movement. The architecture keeps a significant fraction of instruction and data bandwidth local to the functional units, which reduces the cost of supplying instructions and data to large numbers of functional units. This architecture achieves an energy efficiency that is 23&#x0D7; greater than an embedded RISC processor.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4484578]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>29</startPage>
			<endPage>32</endPage>
			<fileSize>157</fileSize>
			<authors><![CDATA[Balfour, James;Dally, William;Black-Schaffer, David;Parikh, Vishal;Park, JongSoo;]]></authors>
		</item>
		<item>
			<title><![CDATA[Dynamic Predication of Indirect Jumps]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4531781]]></link>
			<description><![CDATA[Indirect jumps are used to implement increasingly common programming language constructs such as virtual function calls, switch-case statements, jump tables, and interface calls. Unfortunately, the prediction accuracy of indirect jumps has remained low because many indirect jumps have multiple targets that are difficult to predict even with specialized hardware. This paper proposes a new way of handling hard-to-predict indirect jumps: dynamically predicating them. The compiler identifies indirect jumps that are suitable for predication along with their control-flow merge (CFM) points. The microarchitecture predicates the instructions between different targets of the jump and its CFM point if the jump turns out to be hardto-predict at run time. We describe the new indirect jump predication architecture, provide code examples showing why it could reduce the performance impact of jumps, derive an analytical cost-benefit model for deciding which jumps and targets to predicate, and present preliminary evaluation results.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4531781]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>1</startPage>
			<endPage>4</endPage>
			<fileSize>136</fileSize>
			<authors><![CDATA[Joao, Jos&#x0E9; A.;Mutlu, Onur;Kim, Hyesoon;Patt, Yale N.;]]></authors>
		</item>
		<item>
			<title><![CDATA[Microarchitectures for Managing Chip Revenues under Process Variations]]></title>
			<link><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4531782]]></link>
			<description><![CDATA[As transistor feature sizes continue to shrink intothe sub-90nm range and beyond, the effects of process variationson critical path delay and chip yields have amplified. A commonconcept to remedy the effects of variation is speed-binning, bywhich chips from a single batch are rated by a discrete range offrequencies and sold at different prices. In this paper, we discussstrategies to modify the number of chips in different bins andhence enhance the profits obtained from them. Particularly, wepropose a scheme that introduces a small Substitute Cacheassociated with each cache way to replicate the data elementsthat will be stored in the high latency lines. Assuming a fixedpricing model, this method increases the revenue by as much as13.8&#x025; without any impact on the performance of the chips.]]></description>
			<pubDate><![CDATA[Jan.  2008]]></pubDate>
			<guid><![CDATA[http://ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4531780&arnumber=4531782]]></guid>
			<volume>7</volume>
			<issue>1</issue>
			<startPage>5</startPage>
			<endPage>8</endPage>
			<fileSize>118</fileSize>
			<authors><![CDATA[Das, Abhishek;Ozdemir, Serkan;Memik, Gokhan;Zambreno, Joseph;Choudhary, Alok;]]></authors>
		</item>
	</channel>
</rss>