7. CONCLUSIONS
In this work we address the barriers to efficient DRAM oper-
ation, which is a key attribute of many-core architectures. We
propose a novel approach that coordinates last-level cache and
DRAM policies, which significantly improves both system perfor-
mance and energy characteristics. We modify existing structures
and mechanisms in order to allow greater sharing of system state
across units, therefore, enabling better scheduling decisions.
Specifically, we expand the memory controller’s visibility into
the last-level cache, greatly increasing write scheduling opportuni-
ties. This enables several improvements in system behavior. We are
able to increase page mode writes to DRAM, which both decreases
power consumption and increases memory bus efficiency. The
longer write bursts achieved through Scheduled Writebacks amor-
tize bus turnaround penalties, increasing bus utilization. The larger
effective write queue improves read/write priority determination in
the DRAM scheduler, enabling burst read operations to proceed
uninhibited by write conflicts for longer periods.
We demonstrate through cycle-accurate simulation that the pro-
posed Virtual Write Queue scheme is able to achieve significant
raw system throughput improvements (10.9%) and power con-
sumption reductions (8.7%) with very low hardware overhead
(≈0.3%). Overall, the Virtual Write Queue demonstrates that
co-optimizations of multiple system components enable low-cost,
high-yield improvements over traditional approaches.
Acknowledgements
The authors would like to thank the anonymous reviewers for
their suggestions that helped improve the quality of this paper.
The authors acknowledge the use of the Archer infrastructure for
their simulations, and Kyu-Hyoun Kim for assistance in DRAM
bus utilization calculations. This work is sponsored in part by
the National Science Foundation under award 0702694 and CRI
collaborative awards: 0751112, 0750847, 0750851, 0750852,
0750860, 0750868, 0750884, 0751091, and by IBM. Any opin-
ions, findings, conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the
views of the National Science Foundation or IBM.
References
[1] J. Borkenhagen, B. Vanderpool & L. Whitley, “Read
prediction algorithm to provide low latency reads with
SDRAM cache,” US Patent 6801982, 2004.
[2] DDR3 SDRAM Standard, JEDEC JESD79-3,
http://www.jedec.org, June 2007.
[3] P. Glaskowsky, “High-end server chips breaking records,”
http://news.cnet.com/8301-13512_3-10321740-23.html,
Aug. 2009.
[4] J. Hruska, “Nehalem by the numbers: The Ars review,”
http://arstechnica.com/hardware/reviews/2008/11/nehalem-
launch-review.ars/3.
[5] B. Jacob, S. Ng & D. Wang, “Memory systems: Cache,
DRAM, disk,” Morgan Kaufmann Publishers Inc., USA,
2007.
[6] W. Jang & D. Pan, “An SDRAM-aware router for
networks-on-chip,” in Proceedings of the 46th Annual
Design Automation Conference, pp. 800–805, 2009.
[7] R. Kalla, B. Sinharoy & J. M. Tendler,“IBM Power5 chip: A
dual-core multithreaded processor,” IEEE Micro, vol. 24, no.
2, pp. 40–47, 2004.
[8] N. Y. Ker & C. H. Chen, “An effective SDRAM power mode
management scheme for performance and energy sensitive
embedded systems,” in Proceedings of the Asia and South
Pacific Design Automation Conference, pp. 515–518, 2003.
[9] H. Lee, G. Tyson & M. Farrens, “Eager writeback - a
technique for improving bandwidth utilization,” in
Proceedings of the International Symposium on
Microarchitecture, pp. 11–21, 2000.
[10] W. Lin, S. Reinhardt & D. Burger, “Reducing DRAM
latencies with an integrated memory hierarchy design,” in
Proceedings of the International Symposium on
High-Performance Computer Architecture, pp 301–312,
2001.
[11] J. Lin, H. Zheng, Z. Zhu, Z. Zhang & H. David,
“DRAM-level prefetching for fully-buffered DIMM: Design,
performance and power saving,” in International Symposium
on Performance Analysis of Systems & Software, pp 94–104,
2008.
[12] S. Liu, S. Memik, Y. Zhang & G. Memik, “A power and
temperature aware DRAM architecture,” in Proceedings of
the 45th Annual Design Automation Conference, pp
878–883, 2008.
[13] M. Martin et al., “Multifacet’s general execution-driven
multiprocessor simulator (GEMS) toolset," Computer
Architecture News (CAN), September 2005.
[14] J. McCalpin, “Memory bandwidth and machine balance in
current high performance computers," IEEE Computer
Society Technical Committee on Computer Architecture
(TCCA) Newsletter, 1995.
[15] Micron Technologies, Inc., “Exploring the RLDRAM II
Feature Set,” Technical Report: TN-49-02, 2004.
[16] Micron Technologies, Inc., DDR3 SDRAM system-power
calculator, revision 0.1, Mar. 2007.
[17] O. Mutlu & T. Moscibroda, “Parallelism-aware batch
scheduling: Enabling high-performance and fair shared
memory controllers,” IEEE Micro vol. 29, pp. 22–32, 2009.
[18] K. Nesbit, N Aggarwal, J. Laudon & J. Smith, “Fair queuing
memory systems,” in Proceedings of the International
Symposium on Microarchitecture, pp. 208–222, 2006.
[19] M. Qureshi, V. Srinivasan & J. Rivers, “Scalable high
performance main memory system using phase-change
memory technology,” in Proceedings of the International
Symposium on Computer Architecture, pp. 24–33, 2009.
[20] K. Rajamani et al., “Power Management for Computer
Systems and Datacenters", tutorial at the International
Symposium on Low Power Electronics and Design
(ISLPED), 2008.
[21] S. Rixner, W. Dally, U. Kapasi, P. Mattson & J. Owens,
“Memory access scheduling,” in Proceedings of
International Symposium on Computer Architecture, pp.
128–138, 2000.
[22] Simics Microarchitect’s Toolset, http://www.virtutech.com.
[23] Standard Performance Evaluation Corporation,
http://www.spec.org.
[24] M. Valero, T. Lang & E. Ayguadé, “Conflict-free access of
vectors with power-of-two strides,” in Proceedings of the
International Conference on Supercomputing, pp. 149–156,
1992.
[25] R. Venkatesan, A. AL-Zawawi & E. Rotenberg, “Tapping
ZettaRAM for Low-Power Memory Systems,” in 11th
International Symposium on High-Performance Computer
Architecture, pp. 83–94, 2005