Lecture 06 Machine Representation

Joseph Haugh

University of New Mexico

Learning Objectives

  • After this module students should be able to:
  • Describe the basic hardware model
  • Describe data representation in this HW model
  • Describe in general, the conversions between C, machine code, and assembly code
  • Describe the assembly basics: registers, operands, and move instructions

Understanding How Software is Mapped to Hardware

  • Things every good programmer needs:
    • A basic model of the underlying hardware system
    • The steps through which software programs are mapped to this hardware system
    • What a program looks like at each step and a basic understanding of how each mapping happens

Basic Hardware Model

  • Stored program on byte-addressable memory:
    • Single memory that stores both program and data
    • Each numbered successive memory location stores a single byte (8-bit) of data
    • Memory contents are untyped
    • Sequence of bytes can be interpreted as characters, unsigned, integers, floats, strings, pointers, etc.
    • Program code is just another way to interpret the bytes

Byte-Addressable Memory Organization

  • Programs refer to data by address
    • Conceptually, envision it as a very large array of bytes (virtual memory)
      • In reality, it’s not, but you can think of it that way (as do machine-level programs)
    • An address is like an index into that array
      • A pointer variable stores the (virtual) address of the 1st byte of some storage block

Byte-Addressable Memory Organization

  • Note: system provides private address spaces to each “process”
    • Think of a process as a program being executed (more on this concept later)
    • So, a program can hammer on its own data, but not that of others

Machine Words (sec 2.1.2)

  • Every computer has a word size
  • Nominal size of integer value data (including addresses)
  • Modern computers are mostly 64 bit
  • Virtual address space = set of all possible addresses
  • Machines still support multiple data formats (char, short, long, etc.)
  • Fixed-size words can have complicated implications!

Example: Data Representations (extension of fig 2.3)

C Data Type Typical 32-bit Typical 64-bit x86-64
char 1 1 1
short 2 2 2
int 4 4 4
long 4 8 8
float 4 4 4
double 8 8 8
long double 10/16
pointer 4 8 8

Standard ISO C99

  • Introduced a “class of data types where the data sizes are fixed regardless of compiler and machine settings.”
    • E.g. int32_t (4 bytes), int64_t (8 bytes)
  • With fixed-size integer types, programmers have more control over data representations
  • char is unsigned by default
  • Advice: programs should be portable across machines and compilers
    • Try to make programs insensitive to the exact sizes of different data types

Word-Oriented Memory Organization (sec 2.1.3)

  • Addresses Specify Byte Locations
  • Address of first byte in word
  • Addresses of successive words differ by 4 (32-bit) or 8 (64-bit)
  • E.g. variable y type int (32-bit), &y = 0x100, y occupies 4 bytes, so the locations used for y are: 0x100, 0x101, 0x102 and 0x103

Byte Ordering

  • How are the bytes within a multi-byte word ordered in memory?
  • Conventions:
    • Big Endian: Sun, PPC Mac, Internet
      • Least significant byte has highest address (0x103 in our e.g.)
    • Little Endian: x86, ARM microprocessors running Android, iOS, and Windows
      • Least significant byte has lowest address (0x100 in our e.g.)
  • When pulled into the processor for math, everything still happens the same!

Byte Ordering Example

  • Variable x has 4-byte value of 0x01234567
  • Address given by &x (address of x) is 0x100

Representing Integers

Decimal: 15213
Binary:  0011 1011 0110 1101
Hex:      3    B    6    D

Representing Characters

  • ASCII 7-bit character encoding most common
    • Only 128 characters total, 256 for expanded ASCII
    • Western character sets covered
    • C’s converts single-quote characters to ASCII
  • 16-bit Unicode encoding
    • Used for more complex character sets
    • Some newer languages support systematically
    • p. 50 (Aside)

Representing Strings (sec 2.1.4)

  • Strings in C:
    • Represented by array of characters
    • Each character encoded in ASCII format
    • String should be null-terminated
      • Final character = 0 (not ‘0’, but the null character)
  • Lots of C string errors
    • Not allocating enough space for the string
    • Forgetting to copy or set the null character
    • Forgetting to allocate space for the null

Representing Strings (sec 2.1.4)

  • Byte ordering not an issue for strings!
    • Array entries always stored in increasing order (of addresses)
    • So one byte characters are in increasing order (of addresses)
  • Using ASCII for encoding makes them independent of byte ordering and word size conventions. Therefore, text data are more platform independent than binary data.

Representing Pointers

  • Different compilers & machines assign different locations to objects
  • Even get different results each time program runs
    • (Discuss pointers and arrays in detail in Sec. 3.8)
int b  = -15213;
int *p = &b;

Examining Data Representations

  • Code to Print Byte Representation of Data
    • Casting pointer to unsigned char * allows treatment as a byte array
    typedef unsigned char *pointer;
    
    void show_bytes(pointer start, size_t len) {
      size_t i;
      for (i = 0; i < len; i++)
        printf("%p\t0x%.2x\n", start+i, start[i]);
      printf("\n");
    }

Using show_bytes

int a = 15213;
printf("int a = 15213;\n");
show_bytes((pointer) &a, sizeof(int));
  • Results on Linux x86-64:
int a = 15213;
0x7ffee15d7e8c  0x6d
0x7ffee15d7e8d  0x3b
0x7ffee15d7e8e  0x00
0x7ffee15d7e8f  0x00

Moving Onto Chapter 3: Machine-Level Representations of Programs

Running Programs is just a Processor Interpreting Memory

  • Processor has a relatively simple model
    1. Read some bytes from a location in memory (the address in the program counter)
    2. Interpret the data in that location as a machine instruction
    3. Do what that instruction says, modifying processor and memory state appropriately
    4. Goto 1
  • The whole system software stack (OS, compilers, linkers, etc) relies on conventions about laying out programs in memory

Assembly/Machine Code View

  • Programmer-Visible State
    • PC: Address of next instruction
    • Register file: Temporary storage
    • Condition Codes: Special state for comparisons
    • Memory
      • Byte addressable array
      • Code and user data
      • Stack to support procedures

C vs Machine Code vs Assembly



int fib(int n)
{
 
    int r = 1;
 
    for (; n > 0; n--) {
 
        r = r * n;
 
      /* Loop decr.    */
      /* Loop end chk. */

    }
    return r;
}
Disassembly of section .text:
Address  Insn Bytes  Assembly Code
40057d:  55          push %rbp
40057e:  48 89 e5    mov %rsp,%rbp
400581:  89 7d ec    mov %edi,-0x14(%rbp)
400584:  c7 45 fc 01 00 00 00  movl   $0x1,-0x4(%rbp)
40058b:  eb 0e       jmp 40059b <fib+0x1e>
40058d:  8b 45 fc    mov -0x4(%rbp),%eax
400590:  0f af 45 ec imul -0x14(%rbp),%eax
400594:  89 45 fc    mov %eax,-0x4(%rbp)
400597:  83 6d ec 01 subl   $0x1,-0x14(%rbp)
40059b:  83 7d ec 00 cmpl   $0x0,-0x14(%rbp)
40059f:  7f ec       jg     40058d <fib+0x10>
4005a1:  8b 45 fc    mov    -0x4(%rbp),%eax
4005a4:  5d          pop    %rbp
4005a5:  c3          retq

Toolchain for Compiled Programs

Why Learn Assembly?

  • Understand optimization capabilities of the compiler and analyze inefficiencies of the code
  • Maximize performance of critical sections of the code
  • Many programs are attacked through weaknesses that are visible at the machine-level representation of the programs
  • In the earlier days we needed to learn assembly to write the programs directly in that language
  • Today we need to be able to read and understand the code generated by compilers

Critical Piece of Advice

  • The textbook provides many examples and exercises that illustrate different aspects of assembly language and compilers.
  • From the authors of the textbook:
  • “This is a subject where mastering the details is a prerequisite to understanding the deeper and more fundamental concepts.”
  • “Those who say”I understand the general principles, I don’t want to bother learning the details” are deluding themselves.”
  • “It is critical that you spend time studying the examples, working through exercises, and checking your solutions with those provided.”

History of Intel Processors

  • Read on your own: sec 3.1

Our Coverage

  • This class focuses on x86-64
  • Relevant textbook errata: here

Learning Objectives

  • Describe how to disassemble a program in C
  • List and describe all the x86-64 registers
  • List and describe the operand specifiers
  • Describe the 9 addressing modes and practice address calculations in an instruction

Definitions

  • Architecture: (also ISA: instruction set architecture) The parts of a processor design that one needs to be able to write and understand assembly/machine code
    • Examples: instruction set specification, registers (processor state)
  • Microarchitecture: Implementation of the architecture
    • Examples: cache sizes and core frequency
  • Code Forms:
    • Machine Code: The byte-level programs that a processor executes
    • Assembly Code: A text representation of machine code
  • Example ISAs:
    • Intel: x86, IA32, Itanium, x86-64
    • ARM: Used in almost all mobile phones and increasingly in other places

Assembly/Machine Code View Revisited

  • PC: Program counter
    • Address of next instruction
    • Called “RIP” (instruction pointer register) x86-64
  • Register file
    • Contains 16 named locations
    • Heavily used program data
  • Condition codes
    • Store status information about most recent arithmetic or logical operation
    • Used for conditional branching

  • Memory
    • Byte addressable array
    • Code and user data
    • Stack to support procedures