An update on the Memristor

14 November, 2010 § 1 Comment

Since my initial blog post about the Memristor in May, a couple new developments have surfaced.

Hewlett-Packard has teamed up with Hynix Semiconductor to begin manufacturing memristor memory chips. They are still on schedule to deliver them to market by 2013 and will be marketed as RRAM, short for Resistive Random Access Memory.

The memristor technology will allow much faster write-times compared to flash technology. General computer users will be able to sync up their portable devices in 1/1000th of the time that it takes them today. Not only will devices be able to sync faster, but they will be able to store more data than could have ever been concieved of before. HP expects to reach a storage density of about 20 gigabytes per square centimeter by 2013. As a comparison, the surface area of an Apple iPod Nano is about 64 cm2 which could translate to over 1.25 terabytes of data on the surface of the device alone.

General users can expect to see devices that use this extra storage space to continuously collect data. The vast amounts of data that can be collected are in line with Intel’s “Era of Tera” forecast. Devices will be able to use the data for facial recognition of past acquaintances, weather forecasting through crowd-sourced data collection, and even more interesting applications that have yet to be thought of.

Hynix is going using an nonexclusive license of the technology and has not pushed all resources behind it. They will continue to work on competing technologies, “such as spin torque transfer RAM (which operates using electron spin measurement), phase-change memory (which stores data by alternating two material states), and Z-RAM, a capacitorless version of DRAM that the company licensed from Innovative Silicon in 2007.” [1]

[1] Adee, S. Memristor Inside. IEEE Spectrum. September 2010. http://spectrum.ieee.org/semiconductors/devices/memristor-inside

Comments on the “Great ISA Debate”

20 May, 2010 § 2 Comments

As computer architectures continued to grow and develop over time, many solutions have been proposed. This paper will discuss the competition between the IBM System/360 and Burroughs B5000 as well as its legacy in terms of the Reduced Instruction Set Computer (RISC) versus Complex Instruction Set Computer (CISC). The end of this paper will comment on the current state of computer architecture and where things are headed.

IBM System/360 vs Burroughs B5000

The System/360 and the B5000 differed in many fundamental ways. Figure 1 shows a table of just some of the differences:

Figure 1. Key decisions made by the IBM System/360 and the Burroughs B5000

IBM System/360 [4] Burroughs B5000 [1]
Address size 24 bits Program Reference Table with 1024 entries
Character size 4 bit for binary coded decimal 6 bit (8 characters for 48 bit word)
FP size 32/64 bit 48 bit
Instruction size Variable: 16/32/64 bit 12 bit
Integer size 32 bit 48 bit
Register style General purpose registers Stack based

The IBM System/360 made a significant decision to be binary compatible with their machines and make all of the machines work off a common instruction set [1]. To this day, binary compatibility still survives greatly, with the x86 architecture on desktop PC’s allowing programs written ten years ago to run on machines built yesterday.

Another significant introduction from the IBM System/360 was byte address-ability, whereas the B5000 was word addressed [1]. Byte address-ability allowed a bigger address size than word size, and the ability to use characters located at any location.

The B5000 used a stack based register style [4], which is still in use today by the Java Virtual Machine. This means that there are no registers available to the assembly programmer, and anything that needs to get referenced must be pushed on to the stack. Another important side-effect of the stack based architecture was that the B5000 descriptor system used a base and limit. This, along with the separation between instructions and data, removed the possibility for stack overflow attacks. This allows modern day virtual machines, like the JVM, to have a more secure sandbox than binary compatible systems that lack the stack architecture.

The B5000’s stack based architecture is today called Segmented Virtual Memory and is used by many machines. Burrough was also the first to start talking about how multiple processors would work in the system [4]. It wasn’t too long after that the IBM 360 series had support for multiple processors.

RISC vs. CISC

In the early 1980s David Patterson and David Ditzel introduced a reversal of directions for instruction set architectures from the Complex Instruction Set Computer to their Reduced Instruction Set Computer (RISC).

The processor field was then dominated by the VAX-11/78x systems. These systems used a CISC architecture and provided the best performance [3]. Patterson and Ditzel believed they could top CISC with a reduced instruction set, including taking advantage of the memory hierarchy and its increasing performance per dollar in the memory market, increased speed of chip designers, and optimizing what is being used the most instead of creating special instructions for edge cases.

RISC took notice of Moore’s Law and the exponential increases in memory performance and tried to reduce the number of instructions available and increase the number of instructions used [5]. RISC was thought to allow chip designers to implement the instruction set in far less time than CISC and in that amount of time, the top of the line performance would have doubled and the gains introduced by the newly designed CISC would be worthless [5]. But these arguments had their detractors. VAX Systems Architecture argued that the gap between performance in instructions versus performance in main memory was large and that this gap continued to lead to requests for an instruction set that favored high level languages, similar to the design of the B5000.

Patterson and Ditzel cited Amdahl’s Law in trying to optimize the most well-used parts of a system to have the largest effect, and in doing so, pulled work away from the processor and gave it to the compiler. This allowed the CPU to get smaller and make room for other components on the chip. Time has agreed with this argument, as the dominant CPU for mobile devices today is the Advanced RISC Machine (ARM) chip.

One of the main arguments for RISC was that its reduced instruction set would allow implementers to innovate at a much higher pace and surpass the developments of CISC implementers. This claim has not held its ground, as Intel has continued to use the x86 architecture (CISC) in its desktop CPUs and has been able to keep up with the pace of innovation seen in other ISAs.
Both papers were not clear as to the definition of either RISC or CISC. As time has gone on, RISC has remained a load/store architecture but the complexity of the two has increased.

In Conclusion

The IBM System/360 was designed to have an ISA that was good for microcoded processors, paving the way for the CISC, while the Burroughs B5000 was about pipelining and VLSI, leading to RISC.

The key difference in RISC compared to CISC is the separation of hardware and software responsibilities. RISC’s goal was to move the complexity away from the hardware to software where bugs can be fixed cheaper and faster. This also gave more work to compilers to use more efficient instructions. CISC’s goal was to make the hardware more intelligent and make the job of the compiler writer easier.

Today, RISC is used in embedded devices and high-end servers, whereas CISC dominates the desktop market and the lower-end server market.

References

  1. Amdahl, Blaauw, and Brooks. Architecture of the IBM System/360. IBM Journal of Research and Development, 8(2):87-101, April 1964.
  2. Clark, Douglas W. and Strecker, William D. Comments on “The Case for the Reduced Instruction Set Computer,” by Patterson and Ditzel.
  3. Hennessy, John L. and Patterson, David A. Computer Architecture: A Quantitative Approach. 4th Edition. Page 3. 2006.
  4. Lonergan and King. Design of the B5000 system. Datamation, vol. 7, no. 5, pp. 28-32, May, 1961.
  5. Patterson, David A. and Ditzel, David R. The Case for the Reduced Instruction Set Computer.

How TLB misses are handled (part 2)

18 May, 2010 § 1 Comment

This blog post was co-authored by Brendan Grebur and Jared Wein.

As a follow up from the previous post on TLB misses, I’d like to cover a special case of TLB misses.

At boot time there is a “chicken-and-egg” dilemma where the TLB is empty yet TLB values are needed immediately.

How does a computer handle this? First to make some assumptions:

  • Linux Kernel
  • Hardware-managed TLB

The x86 chips boot up in Real Mode with a very limited memory space and the MMU disabled. The Linux kernel is uncompressed and loaded into low-memory by the boot loader. Assembly code initializes a page directory for the initial kernel process, sets the CR3 register, then enables the PG bit in CR0 to effectively enable the MMU and begin addressing in Protected Mode. Since this area is kernel memory, the virtual address will be identical to the physical address, as kernel memory is never swapped out. The init process begins running C code and making memory references to initialize the rest of the kernel. TLB misses occur, but resolve themselves as the MMU walks the page directory previously set up for the kernel process.

Sources:

  1. http://lkml.indiana.edu/hypermail/linux/kernel/0811.2/01602.html
  2. http://beaversource.oregonstate.edu/projects/cspfl/browser/trunk/software/u-boot-omap3/board/sbc8560/tlb.c?rev=376
  3. http://tldp.org/LDP/tlk/tlk.html
  4. http://lxr.linux.no/linux-old+v2.0.33/arch/i386/kernel/head.S#L286

How TLB misses are handled

17 May, 2010 § Leave a comment

This blog post was co-authored by Jared Wein and Brendan Grebur.

Have you ever wondered how a TLB (Translation Lookaside Buffer) miss is handled? Probably not, but in case you have, or are curious just what I’m talking about, keep on reading.

First, let’s assume that we have a physically-addressed Level 1 cache.

Assume that the desired portion of the page table is resident in memory, but it is not in the cache. For simplicity assume that there is only a L1 cache, no L2.

The dilemma is that to get a new TLB entry one needs correct values in the TLB, but a TLB fault appears to mean that they are not there.

Further assumptions:

  • A Hardware-controlled TLB.

  • A x86(32-bit) machine using 4kB memory pages.

  • The CR3 register on the x86 chip will be loaded with the General Page Directory physical address for the current running process.

  • A General Page Directory (GPD) contains 1024, 4-byte entries of physical addresses to Internal Page Tables (IPT). The Internal Page Tables themselves consist of 1024, 4-byte entries, which contain the physical page number.

  • This two-level Page Table scheme translates:

  • The upper 10 bits (31-22) in a Virtual Address (VA) as an offset into the General Page Directory.

  • The next 10 bits (21-12) are translated as an offset into the Internal Page Table pointed at by the General Page Directory’s entry.

  • The entry in the Internal Page Table contains the physical page number the VA refers to and the lower 12 bits (11-0) serve as the byte offset into this physical page.

When a virtual memory address is referenced by an instruction, for whom a valid page table entry does not exist in the TLB, a fault occurs. The Memory Management Unit (MMU) now bypasses the TLB and attempts to read the address contained in the CR3 register, with the offset contained in the first 10 bits of the faulting VA. The faulting instruction now becomes stalled as the L1 cache is checked for the particular GPD entry. If the location is not found, the L1 cache loads the address from DRAM. Once the GPD entry is read, the MMU checks the valid bit on the entry. Since we can assume these tables are resident in memory, the address contained within the entry is now requested by the MMU, bypassing the TLB again, after the offset from bits 21-12 of the VA have been applied. Again, if this data is not already in the L1 cache, the data must be loaded from DRAM causing further delay. Contained within the IPT entry is the physical page number the VA refers to. Since a new PTE has been found, the TLB must be updated with the value. Assuming the TLB is full, the MMU uses an NRU (Not Recently Used) algorithm to replace the victim PTE with the new PTE contained in the IPT entry. However, before the victim PTE is discarded it must be checked for a set dirty bit. If set, the corresponding PTE must be copied back to main memory.

As the correct PTE has been loaded into the TLB, the faulting instruction is restarted, resulting in a TLB hit.

Sources:

The Thirteen Dwarfs

12 May, 2010 § Leave a comment

This article was written by Brendan Grebur and Jared Wein.

Introduction

Adapting to the next step in processor technology requires a reevaluation of current computational techniques. Many-core architecture presents a fundamentally different perspective on both software and hardware systems. With no clearly effective model to follow, we are left to our own devices for advancement. A dwarf’s approach provides guidance, but does not guarantee an effective realization of the underlying issues. However, it provides a starting point to explore the interactions of data at extreme levels of parallelism. Of course, as with any method, pitfalls may arise.

This paper is a critique of the research presented by [Asanovic 06]. We present the motivation for their work, a comparison between their proposal and present-day solutions, as well as a look at advantages and disadvantages of their proposal.

Motivation

The past 16 years, from 1986 to 2002, have showed tremendous growth in processor performance. Around 2002, the industry hit a “brick wall” when it ran in to limits in memory, power, and instruction level parallelism [Hennessy and Paterson 07]. While performance gains slowed, performance requirements continued to grow at a higher pace than before.

As an example, Intel has been forecasting an “Era of Tera”. Intel claims that the future holds a time where there is demand for teraflops of computing power, terabits per second of communication bandwidth, and terabytes of data storage [Dubey 05]. To prepare for this era, Intel has claimed a “sea change” in computing and put their resources on multi- and many-core processor architectures [Fried 04].

Previously these multi- and many-core computer architectures have not been successful stays in the mainstream [Asanovic 06]. Some reason that one of the causes for poor adoption is the increased complexity when working with parallel computing. [Asanovic 06] developed thirteen independent paradigms that represent the active areas of parallel computing. After compiling their list, they compared theirs with Intel’s list and noticed heavy overlaps between the two. This reassured [Asanovic 06] that they were on the right path.

Advantages of the Dwarfs

The first advantage to the dwarfs that have been proposed is that there exists only a relatively small number of dwarfs to be concerned with. Psychology research has shown that humans can cope with understanding about seven, plus or minus two, items very easily [Miller 56]. The number of dwarfs, at 13, is not too far off from this desirable number, and allows the test space to be conceivably small. Had the number of dwarfs proposed been in the hundreds, it may become too costly to determine if a system can handle all the dwarfs.

Another advantage focuses on benchmarking software. Some of the programs for SPEC benchmarks work on computational problems that are no longer current research areas and have less of an application when determining the system performance. The dwarf approach has a goal to focus on active areas within computer science, and to adapt the focus areas over time to stay relevant.

The years since assembly programming have shown that higher level languages could be created and provide software developers with increased productivity while maintaining software quality. The goal of the parallel programming dwarfs is to extend these common problems to programming frameworks that can help software developers program at an even higher level to maintain efficiency and quality. These dwarfs provide themselves in a way that they can be looked at in the same regard as object-oriented design patterns. One such study was able to use the dwarfs to easily find places within existing sequentially-programmed applications where refactoring could make use of parallel programming solutions [Pankratius 07]. The researchers in this study were able to achieve performance improvements of up to 8x using autotuning and parallel programming patterns described by the dwarfs.

Disadvantages of the Dwarfs

There are a number of questions that haven’t been answered by the Berkeley researchers. We present a couple of the issues below that we are either unsure of or believe are disadvantages of the dwarf paradigm.

First, it is not well-explained how the combination of dwarfs is supported on a given system. When a system claims to support two dwarfs independently, there does not appear to be a deterministic answer as to if the system will support the combination of the dwarfs. Further, the interaction between these dwarfs does not appear to be well documented.

Next, [Asanovic 06] states the power of autotuning when working with sequential programming, but admits that there have not been successful implementations of an autotuner for parallel programs. The problem space for parallel programs is much larger than that of sequential programs, and various factors will have to be elided to make development of an autotuner reasonable. Further, it is unclear if the parameters used to perform the autotuning will provide the optional parameters for use when running the actual software with vastly different datasets than that of the autotuned dataset.

Last, the area of parallel programs presents a virtual graveyard of companies and ideas that have failed to reach the market successfully. The multi- and many-core architectures have failed at producing the same programmer productivity and quality that the uniprocessor delivered. Only time will tell how the field of software development reacts and if it can adopt the shift in computing paradigms.

Conclusion

Computing performance improvements have recently hit a “brick wall” and there has been a computing shift pushed by major chip makers towards multicore systems. There are many complexities with multi- and many-core systems that can lead to unappreciated performance gains. Some of the problems may reside in the complexity of implementing common parallel computing patterns.

The researchers in the Par Lab at University of California at Berkeley have been attacking the multi- and many-core problems since 2006. From their research, they have created a list of 13 dwarfs that represent active areas in parallel computing. These dwarfs provide common patterns that can be reproduced and used for benchmarking and the creation of programming frameworks that offer an abstraction of complex problems.

Citations

Asanovic et al. The Landscape of Parallel Computing Research: A View from Berkeley. Dec 2006.

Fried, I. For Intel, the future has two cores. ZDNet. 2004. http://news.zdnet.com/2100-9584_22-138241.html

Dubey, P. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine. Feb 2005.

Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, 4th edition, Morgan Kauffman, San Francisco, 2007.

Miller, G. A. “The magical number seven, plus or minus two: Some limits on our capacity for processing information”. Psychological Review. Vol 63, Iss 2, pp 81–97. 1956.

Pankratius et al. Software Engineering for Multicore Systems – An Experience Report. Institute for Program Structures and Data Organization. Dec 2007.

Where Am I?

You are currently browsing entries tagged with computer-architecture at JAWS.

Follow

Get every new post delivered to your Inbox.

Join 982 other followers