14 November, 2010 § 1 Comment
Since my initial blog post about the Memristor in May, a couple new developments have surfaced.
Hewlett-Packard has teamed up with Hynix Semiconductor to begin manufacturing memristor memory chips. They are still on schedule to deliver them to market by 2013 and will be marketed as RRAM, short for Resistive Random Access Memory.
The memristor technology will allow much faster write-times compared to flash technology. General computer users will be able to sync up their portable devices in 1/1000th of the time that it takes them today. Not only will devices be able to sync faster, but they will be able to store more data than could have ever been concieved of before. HP expects to reach a storage density of about 20 gigabytes per square centimeter by 2013. As a comparison, the surface area of an Apple iPod Nano is about 64 cm2 which could translate to over 1.25 terabytes of data on the surface of the device alone.
General users can expect to see devices that use this extra storage space to continuously collect data. The vast amounts of data that can be collected are in line with Intel’s “Era of Tera” forecast. Devices will be able to use the data for facial recognition of past acquaintances, weather forecasting through crowd-sourced data collection, and even more interesting applications that have yet to be thought of.
Hynix is going using an nonexclusive license of the technology and has not pushed all resources behind it. They will continue to work on competing technologies, “such as spin torque transfer RAM (which operates using electron spin measurement), phase-change memory (which stores data by alternating two material states), and Z-RAM, a capacitorless version of DRAM that the company licensed from Innovative Silicon in 2007.” 
 Adee, S. Memristor Inside. IEEE Spectrum. September 2010. http://spectrum.ieee.org/semiconductors/devices/memristor-inside
20 May, 2010 § 2 Comments
As computer architectures continued to grow and develop over time, many solutions have been proposed. This paper will discuss the competition between the IBM System/360 and Burroughs B5000 as well as its legacy in terms of the Reduced Instruction Set Computer (RISC) versus Complex Instruction Set Computer (CISC). The end of this paper will comment on the current state of computer architecture and where things are headed.
IBM System/360 vs Burroughs B5000
The System/360 and the B5000 differed in many fundamental ways. Figure 1 shows a table of just some of the differences:
|IBM System/360 ||Burroughs B5000 |
|Address size||24 bits||Program Reference Table with 1024 entries|
|Character size||4 bit for binary coded decimal||6 bit (8 characters for 48 bit word)
|FP size||32/64 bit||48 bit|
|Instruction size||Variable: 16/32/64 bit||12 bit|
|Integer size||32 bit||48 bit|
|Register style||General purpose registers||Stack based|
The IBM System/360 made a significant decision to be binary compatible with their machines and make all of the machines work off a common instruction set . To this day, binary compatibility still survives greatly, with the x86 architecture on desktop PC’s allowing programs written ten years ago to run on machines built yesterday.
Another significant introduction from the IBM System/360 was byte address-ability, whereas the B5000 was word addressed . Byte address-ability allowed a bigger address size than word size, and the ability to use characters located at any location.
The B5000 used a stack based register style , which is still in use today by the Java Virtual Machine. This means that there are no registers available to the assembly programmer, and anything that needs to get referenced must be pushed on to the stack. Another important side-effect of the stack based architecture was that the B5000 descriptor system used a base and limit. This, along with the separation between instructions and data, removed the possibility for stack overflow attacks. This allows modern day virtual machines, like the JVM, to have a more secure sandbox than binary compatible systems that lack the stack architecture.
The B5000’s stack based architecture is today called Segmented Virtual Memory and is used by many machines. Burrough was also the first to start talking about how multiple processors would work in the system . It wasn’t too long after that the IBM 360 series had support for multiple processors.
RISC vs. CISC
In the early 1980s David Patterson and David Ditzel introduced a reversal of directions for instruction set architectures from the Complex Instruction Set Computer to their Reduced Instruction Set Computer (RISC).
The processor field was then dominated by the VAX-11/78x systems. These systems used a CISC architecture and provided the best performance . Patterson and Ditzel believed they could top CISC with a reduced instruction set, including taking advantage of the memory hierarchy and its increasing performance per dollar in the memory market, increased speed of chip designers, and optimizing what is being used the most instead of creating special instructions for edge cases.
RISC took notice of Moore’s Law and the exponential increases in memory performance and tried to reduce the number of instructions available and increase the number of instructions used . RISC was thought to allow chip designers to implement the instruction set in far less time than CISC and in that amount of time, the top of the line performance would have doubled and the gains introduced by the newly designed CISC would be worthless . But these arguments had their detractors. VAX Systems Architecture argued that the gap between performance in instructions versus performance in main memory was large and that this gap continued to lead to requests for an instruction set that favored high level languages, similar to the design of the B5000.
Patterson and Ditzel cited Amdahl’s Law in trying to optimize the most well-used parts of a system to have the largest effect, and in doing so, pulled work away from the processor and gave it to the compiler. This allowed the CPU to get smaller and make room for other components on the chip. Time has agreed with this argument, as the dominant CPU for mobile devices today is the Advanced RISC Machine (ARM) chip.
The IBM System/360 was designed to have an ISA that was good for microcoded processors, paving the way for the CISC, while the Burroughs B5000 was about pipelining and VLSI, leading to RISC.
The key difference in RISC compared to CISC is the separation of hardware and software responsibilities. RISC’s goal was to move the complexity away from the hardware to software where bugs can be fixed cheaper and faster. This also gave more work to compilers to use more efficient instructions. CISC’s goal was to make the hardware more intelligent and make the job of the compiler writer easier.
Today, RISC is used in embedded devices and high-end servers, whereas CISC dominates the desktop market and the lower-end server market.
- Amdahl, Blaauw, and Brooks. Architecture of the IBM System/360. IBM Journal of Research and Development, 8(2):87-101, April 1964.
- Clark, Douglas W. and Strecker, William D. Comments on “The Case for the Reduced Instruction Set Computer,” by Patterson and Ditzel.
- Hennessy, John L. and Patterson, David A. Computer Architecture: A Quantitative Approach. 4th Edition. Page 3. 2006.
- Lonergan and King. Design of the B5000 system. Datamation, vol. 7, no. 5, pp. 28-32, May, 1961.
- Patterson, David A. and Ditzel, David R. The Case for the Reduced Instruction Set Computer.
18 May, 2010 § 1 Comment
This blog post was co-authored by Brendan Grebur and Jared Wein.
As a follow up from the previous post on TLB misses, I’d like to cover a special case of TLB misses.
At boot time there is a “chicken-and-egg” dilemma where the TLB is empty yet TLB values are needed immediately.
How does a computer handle this? First to make some assumptions:
- Linux Kernel
- Hardware-managed TLB
The x86 chips boot up in Real Mode with a very limited memory space and the MMU disabled. The Linux kernel is uncompressed and loaded into low-memory by the boot loader. Assembly code initializes a page directory for the initial kernel process, sets the CR3 register, then enables the PG bit in CR0 to effectively enable the MMU and begin addressing in Protected Mode. Since this area is kernel memory, the virtual address will be identical to the physical address, as kernel memory is never swapped out. The init process begins running C code and making memory references to initialize the rest of the kernel. TLB misses occur, but resolve themselves as the MMU walks the page directory previously set up for the kernel process.
17 May, 2010 § Leave a comment
This blog post was co-authored by Jared Wein and Brendan Grebur.
A Hardware-controlled TLB.
A x86(32-bit) machine using 4kB memory pages.
The CR3 register on the x86 chip will be loaded with the General Page Directory physical address for the current running process.
A General Page Directory (GPD) contains 1024, 4-byte entries of physical addresses to Internal Page Tables (IPT). The Internal Page Tables themselves consist of 1024, 4-byte entries, which contain the physical page number.
This two-level Page Table scheme translates:
The upper 10 bits (31-22) in a Virtual Address (VA) as an offset into the General Page Directory.
The next 10 bits (21-12) are translated as an offset into the Internal Page Table pointed at by the General Page Directory’s entry.
The entry in the Internal Page Table contains the physical page number the VA refers to and the lower 12 bits (11-0) serve as the byte offset into this physical page.
As the correct PTE has been loaded into the TLB, the faulting instruction is restarted, resulting in a TLB hit.
12 May, 2010 § Leave a comment
This article was written by Brendan Grebur and Jared Wein.
Adapting to the next step in processor technology requires a reevaluation of current computational techniques. Many-core architecture presents a fundamentally different perspective on both software and hardware systems. With no clearly effective model to follow, we are left to our own devices for advancement. A dwarf’s approach provides guidance, but does not guarantee an effective realization of the underlying issues. However, it provides a starting point to explore the interactions of data at extreme levels of parallelism. Of course, as with any method, pitfalls may arise.
This paper is a critique of the research presented by [Asanovic 06]. We present the motivation for their work, a comparison between their proposal and present-day solutions, as well as a look at advantages and disadvantages of their proposal.
The past 16 years, from 1986 to 2002, have showed tremendous growth in processor performance. Around 2002, the industry hit a “brick wall” when it ran in to limits in memory, power, and instruction level parallelism [Hennessy and Paterson 07]. While performance gains slowed, performance requirements continued to grow at a higher pace than before.
As an example, Intel has been forecasting an “Era of Tera”. Intel claims that the future holds a time where there is demand for teraflops of computing power, terabits per second of communication bandwidth, and terabytes of data storage [Dubey 05]. To prepare for this era, Intel has claimed a “sea change” in computing and put their resources on multi- and many-core processor architectures [Fried 04].
Previously these multi- and many-core computer architectures have not been successful stays in the mainstream [Asanovic 06]. Some reason that one of the causes for poor adoption is the increased complexity when working with parallel computing. [Asanovic 06] developed thirteen independent paradigms that represent the active areas of parallel computing. After compiling their list, they compared theirs with Intel’s list and noticed heavy overlaps between the two. This reassured [Asanovic 06] that they were on the right path.
Advantages of the Dwarfs
The first advantage to the dwarfs that have been proposed is that there exists only a relatively small number of dwarfs to be concerned with. Psychology research has shown that humans can cope with understanding about seven, plus or minus two, items very easily [Miller 56]. The number of dwarfs, at 13, is not too far off from this desirable number, and allows the test space to be conceivably small. Had the number of dwarfs proposed been in the hundreds, it may become too costly to determine if a system can handle all the dwarfs.
Another advantage focuses on benchmarking software. Some of the programs for SPEC benchmarks work on computational problems that are no longer current research areas and have less of an application when determining the system performance. The dwarf approach has a goal to focus on active areas within computer science, and to adapt the focus areas over time to stay relevant.
The years since assembly programming have shown that higher level languages could be created and provide software developers with increased productivity while maintaining software quality. The goal of the parallel programming dwarfs is to extend these common problems to programming frameworks that can help software developers program at an even higher level to maintain efficiency and quality. These dwarfs provide themselves in a way that they can be looked at in the same regard as object-oriented design patterns. One such study was able to use the dwarfs to easily find places within existing sequentially-programmed applications where refactoring could make use of parallel programming solutions [Pankratius 07]. The researchers in this study were able to achieve performance improvements of up to 8x using autotuning and parallel programming patterns described by the dwarfs.
Disadvantages of the Dwarfs
There are a number of questions that haven’t been answered by the Berkeley researchers. We present a couple of the issues below that we are either unsure of or believe are disadvantages of the dwarf paradigm.
First, it is not well-explained how the combination of dwarfs is supported on a given system. When a system claims to support two dwarfs independently, there does not appear to be a deterministic answer as to if the system will support the combination of the dwarfs. Further, the interaction between these dwarfs does not appear to be well documented.
Next, [Asanovic 06] states the power of autotuning when working with sequential programming, but admits that there have not been successful implementations of an autotuner for parallel programs. The problem space for parallel programs is much larger than that of sequential programs, and various factors will have to be elided to make development of an autotuner reasonable. Further, it is unclear if the parameters used to perform the autotuning will provide the optional parameters for use when running the actual software with vastly different datasets than that of the autotuned dataset.
Last, the area of parallel programs presents a virtual graveyard of companies and ideas that have failed to reach the market successfully. The multi- and many-core architectures have failed at producing the same programmer productivity and quality that the uniprocessor delivered. Only time will tell how the field of software development reacts and if it can adopt the shift in computing paradigms.
Computing performance improvements have recently hit a “brick wall” and there has been a computing shift pushed by major chip makers towards multicore systems. There are many complexities with multi- and many-core systems that can lead to unappreciated performance gains. Some of the problems may reside in the complexity of implementing common parallel computing patterns.
The researchers in the Par Lab at University of California at Berkeley have been attacking the multi- and many-core problems since 2006. From their research, they have created a list of 13 dwarfs that represent active areas in parallel computing. These dwarfs provide common patterns that can be reproduced and used for benchmarking and the creation of programming frameworks that offer an abstraction of complex problems.
Asanovic et al. The Landscape of Parallel Computing Research: A View from Berkeley. Dec 2006.
Fried, I. For Intel, the future has two cores. ZDNet. 2004. http://news.zdnet.com/2100-9584_22-138241.html
Dubey, P. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine. Feb 2005.
Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.
Miller, G. A. “The magical number seven, plus or minus two: Some limits on our capacity for processing information”. Psychological Review. Vol 63, Iss 2, pp 81–97. 1956.
Pankratius et al. Software Engineering for Multicore Systems – An Experience Report. Institute for Program Structures and Data Organization. Dec 2007.
11 May, 2010 § 3 Comments
Hewlett Packard recently announced that they have fabricated memristor chips and plan to bring them to market by 2013 . This paper is an evaluation of the memristor and focuses on the potential impact of its introduction.
The memristor is an electrical circuit element that is similar to a resistor but has the potential to maintain state between turning power on and off. The memristor was first described by Leon Chua in 1971 as the fourth fundamental element, based on a relationship between charge and flux . These memristors are about half the size of the transistors found in current flash storage technology such as iPods and portable USB drives, allowing capacity of these devices to double. Further, Hewlett Packard has claimed the ability to improve the data-write cycle limit 10-fold, from flash technology’s 100,000 to the memristor’s 1,000,000 data-write cycles .
A Computer Architecture Perspective
The memristor hopes to replace the memory hierarchy that is seen in almost all computers today. Because the memristor’s state is nonvolatile and is planned to reach about four nanometers in width, large amounts of memory can be stored directly on the processing chip. The memory hierarchy seen in today’s computers may have three or four levels of volatile cache, another level of volatile DRAM, and finally a level of nonvolatile memory. Putting the nonvolatile memory close to the logical unit on the chip has the potential to allow memory accesses of 1,000 times faster .
Memristor caches may be able to remember state and resume said state even as the power to the devices comes and goes. During current computer start up routines, the bootloader needs to populate the translation lookaside buffer (TLB) at each startup. With memristor technology, the TLB can remember state between power cycles, allowing a much quicker start up process. As the technology matures, this same thinking could be extended all the way to resuming running processes when the machine starts back up.
At its initial implementation, Hewlett Packard’s goal is place memristors between DRAM and disk technology, eventually spreading in both directions to replace disk and DRAM . Due to the memristors speed and data integrity guarantees, this goal is likely to bring hard drive-like reliability with speeds faster than DRAM. Consumers are likely to see these speed increases when writing to their iPods and USB drives, since the higher write-cycle limits will allow the same sector to be written to instead of current flash implementations that try to write to the same sector less often.
Another use for memristors is for computing logic. Using material implication, a group of three memristors can be used to compute any boolean logic equation that is requested . The motivation of using memristors for logic computation comes from the ability to run the computations on the nanoscale. The use of material implication complicates the boolean logic but is not enough to deter the gains in size reduction. The gains are enough to have one memristor replace 15 transistors .
Combining the use of memristors for storage and logic may also present a new application for the device. Hypothetically, a crossbar of memristors will have the ability to re-purpose groups of memristors for data-intensive or compute-intensive operations. Memristors are also very efficient power users . Embedding memristors within sensor networks has the possibility to allow sensor networks to collect more data and have much longer lifetimes.
The power walls and memory walls can effectively be written off if the memristor proves its claims.
Impact for the General User
As previously mentioned, the memristor technology will allow much faster write-times compared to flash technology. General computer users will be able to sync up their portable devices in 1/1000th of the time that it takes them today. Not only will devices be able to sync faster, but they will be able to store more data than could have ever been concieved of before. HP expects to reach a storage density of about 20 gigabytes per square centimeter by 2013. As a comparison, the surface area of an Apple iPod Nano is about 64 cm2 which could translate to over 1.25 terabytes of data on the surface of the device alone.
General users can expect to see devices that use this extra storage space to continuously collect data. The vast amounts of data that can be collected are in line with Intel’s “Era of Tera” forecast . Devices will be able to use the data for facial recognition of past acquaintances, weather forecasting through crowd-sourced data collection, and even more interesting applications that have yet to be thought of.
Critiques of the Memristor
To implement the memristor, titanium dioxide is used as the semiconductor instead of the traditional silicone. There is still much to be understood about titanium dioxide, as a team from the National Institute of Standards and Technology (NIST) said . Silicone has had many years to mature as a semiconductor element in use with electrical circuits, whereas researchers are just beginning to understand how to use titanium dioxide in electrical circuits. “The fundamentals of why these metal oxides switch the way they do are not well understood,” said NIST researcher Curt Richter . Further time and research may resolve the unknowns, but at this point there might not be enough known information about it to be sure of its promise.
Hewlett Packard and Intel have had a long history of working together to introduce new technology, such as their work with the EPIC Itanium architecture. Intel has been approached to work on the memristor technology with Hewlett Packard but has decided not to take Hewlett Packard up on their offer. Instead, Intel is looking to focus their energy on phase-change memory . Intel has already shipped phase-change memory samples in 2008 and plans to start shipping mass quantities in 2010 , as opposed to Hewlett Packard’s expected release date of 2013 for the memristor.
The memristor presents an amazing opportunity for change in the computer memory hierarchy, storage capacity, and nonvolatile state. Lab results for the research have shown that we are just at the cusp of all the capabilities of memristors. With that being said, there are still some unknowns about the technology that will challenge its adoption. The material used as the semiconductor is relatively new to electronic circuits, and research from competitors is appearing to be quicker to market than the memristor.
Much of the publications for the memristor are from it’s main researcher, R. Stanley Williams of Hewlett Packard. Many of the claims have not been validated by a third-party and could very easily be exaggerated. If given the opportunity to invest in memristor technology, I would advise taking a deeper look in to phase-change memory and it’s other competitors that look to be closer to reaching the market. Computing technology changes very quickly and there is a low probability that the computing environment three years from now is unchanged from today.
 Adee, S. The Mysterious Memristor. IEEE Spectrum. May 2008. http://spectrum.ieee.org/semiconductors/design/the-mysterious-memristor
 Borghetti et al. ‘Memristive’ switches enable ‘stateful’ logic operations via material implication. Nature. Vol 464. Apr 8, 2010. http://www.nature.com.proxy2.cl.msu.edu/nature/journal/v464/n7290/pdf/nature08940.pdf
 Bourzac, K. Memristor Memory Readied for Production. Technology Review. MIT Press. Apr 2010. http://www.technologyreview.com/computing/25018/?a=f
 Foremski, T. Tha amazing memristor – beyond Moore’s law and beyond digital computing. ZDNet: IMHO. Apr 19, 2010. http://blogs.zdnet.com/Foremski/?p=1309
 Garver, S. and Crepps, B. The New Era of Tera-scale Computing. Intel Software Network. Jan 15, 2009. http://software.intel.com/en-us/articles/the-new-era-of-tera-scale-computing/
 Jones, W. A New Twist on Memristance. IEEE Spectrum. Jun 2009. http://spectrum.ieee.org/semiconductors/memory/a-new-twist-on-memristance/0
 Miller, M. Memristors: A Flash Competitor that Works Like Brain Synapses. PCMag: Forward Thinking. Apr 14, 2010. http://blogs.pcmag.com/miller/2010/04/memristors_a_flash_memory_comp.php
 Null, C. Memristor technology gets real; commercial release planned for 2013. Yahoo! News: Today in Tech: The Working Guy. Apr 2010. http://news.yahoo.com/s/ytech_wguy/20100412/tc_ytech_wguy/ytech_wguy_tc1570
 Williams, R. S. How We Found the Missing Memristor. IEEE Spectrum. Dec 2008. http://spectrum.ieee.org/semiconductors/processors/how-we-found-the-missing-memristor/0
5 March, 2010 § 4 Comments
This semester I’m taking two courses at MSU on my way to a Masters in Computer Science. As an aid to my studying, I’ve taken up watching videos of other schools lectures online, specifically lectures from the authors of our textbooks.
For my CSE 891 course: Language and Interaction, a course that goes more in-depth on specific aspects of Natural Language Processing, I’ve found lectures online from James Martin at the University of Colorado at Boulder. Martin co-authored Speech and Language Processing with Dan Jurafsky from Stanford.
- CSCI 5832 Lecture 3 (2010-01-19)
- CSCI 5832 Lecture 4 (2010-01-21)
- CSCI 5832 Lecture 5 (2010-01-26)
- CSCI 5832 Lecture 6 (2010-01-28)
- CSCI 5832 Lecture 7 (2010-02-02)
- CSCI 5832 Lecture 8 (2010-02-04)
- CSCI 5832 Lecture 9 (2010-02-09)
- CSCI 5832 Lecture 10 (2010-02-16)
- CSCI 5832 Lecture 11 (2010-02-18)
- CSCI 5832 Lecture 12 (2010-02-23)
- CSCI 5832 Lecture 13 (2010-02-25)
- CSCI 5832 Lecture 14 (2010-03-02)
- CSCI 5832 Lecture 15 (2010-03-04)
- More lectures can be seen at the RSS feed for the course
For my CSE 820 course: Advanced Computer Architecture, I’ve found lectures online from David Patterson at University of California at Berkeley. Patterson co-authored Computer Architecture: A Quantitative Appraoch with John L. Hennessy, who is currently the President of Stanford University.