Lecture 06 Machine Representation

Joseph Haugh

University of New Mexico

Learning Objectives

  • After this module students should be able to:
  • Describe the basic hardware model
  • Describe data representation in this HW model
  • Describe in general, the conversions between C, machine code, and assembly code
  • Describe the assembly basics: registers, operands, and move instructions

Understanding How Software is Mapped to Hardware

  • Things every good programmer needs:
    • A basic model of the underlying hardware system
    • The steps through which software programs are mapped to this hardware system
    • What a program looks like at each step and a basic understanding of how each mapping happens

Basic Hardware Model

  • Stored program on byte-addressable memory:
    • Single memory that stores both program and data
    • Each numbered successive memory location stores a single byte (8-bit) of data
    • Memory contents are untyped
    • Sequence of bytes can be interpreted as characters, unsigned, integers, floats, strings, pointers, etc.
    • Program code is just another way to interpret the bytes

Byte-Addressable Memory Organization

  • Programs refer to data by address
    • Conceptually, envision it as a very large array of bytes (virtual memory)
      • In reality, it’s not, but you can think of it that way (as do machine-level programs)
    • An address is like an index into that array
      • A pointer variable stores the (virtual) address of the 1st byte of some storage block

Byte-Addressable Memory Organization

  • Note: system provides private address spaces to each “process”
    • Think of a process as a program being executed (more on this concept later)
    • So, a program can hammer on its own data, but not that of others

Machine Words (sec 2.1.2)

  • Every computer has a word size
  • Nominal size of integer value data (including addresses)
  • Modern computers are mostly 64 bit
  • Virtual address space = set of all possible addresses
  • Machines still support multiple data formats (char, short, long, etc.)
  • Fixed-size words can have complicated implications!

Example: Data Representations (extension of fig 2.3)

C Data Type Typical 32-bit Typical 64-bit x86-64
char 1 1 1
short 2 2 2
int 4 4 4
long 4 8 8
float 4 4 4
double 8 8 8
long double 10/16
pointer 4 8 8

Standard ISO C99

  • Introduced a “class of data types where the data sizes are fixed regardless of compiler and machine settings.”
    • E.g. int32_t (4 bytes), int64_t (8 bytes)
  • With fixed-size integer types, programmers have more control over data representations
  • char is unsigned by default
  • Advice: programs should be portable across machines and compilers
    • Try to make programs insensitive to the exact sizes of different data types

Word-Oriented Memory Organization (sec 2.1.3)

  • Addresses Specify Byte Locations
  • Address of first byte in word
  • Addresses of successive words differ by 4 (32-bit) or 8 (64-bit)
  • E.g. variable y type int (32-bit), &y = 0x100, y occupies 4 bytes, so the locations used for y are: 0x100, 0x101, 0x102 and 0x103

Byte Ordering

  • How are the bytes within a multi-byte word ordered in memory?
  • Conventions:
    • Big Endian: Sun, PPC Mac, Internet
      • Least significant byte has highest address (0x103 in our e.g.)
    • Little Endian: x86, ARM microprocessors running Android, iOS, and Windows
      • Least significant byte has lowest address (0x100 in our e.g.)
  • When pulled into the processor for math, everything still happens the same!

Byte Ordering Example

  • Variable x has 4-byte value of 0x01234567
  • Address given by &x (address of x) is 0x100

Representing Integers

Decimal: 15213
Binary:  0011 1011 0110 1101
Hex:      3    B    6    D

Representing Characters

  • ASCII 7-bit character encoding most common
    • Only 128 characters total, 256 for expanded ASCII
    • Western character sets covered
    • C’s converts single-quote characters to ASCII
  • 16-bit Unicode encoding
    • Used for more complex character sets
    • Some newer languages support systematically
    • p. 50 (Aside)

Representing Strings (sec 2.1.4)

  • Strings in C:
    • Represented by array of characters
    • Each character encoded in ASCII format
    • String should be null-terminated
      • Final character = 0 (not ‘0’, but the null character)
  • Lots of C string errors
    • Not allocating enough space for the string
    • Forgetting to copy or set the null character
    • Forgetting to allocate space for the null

Representing Strings (sec 2.1.4)

  • Byte ordering not an issue for strings!
    • Array entries always stored in increasing order (of addresses)
    • So one byte characters are in increasing order (of addresses)
  • Using ASCII for encoding makes them independent of byte ordering and word size conventions. Therefore, text data are more platform independent than binary data.

Representing Pointers

  • Different compilers & machines assign different locations to objects
  • Even get different results each time program runs
    • (Discuss pointers and arrays in detail in Sec. 3.8)
int b  = -15213;
int *p = &b;

Examining Data Representations

  • Code to Print Byte Representation of Data
    • Casting pointer to unsigned char * allows treatment as a byte array
    typedef unsigned char *pointer;
    
    void show_bytes(pointer start, size_t len) {
      size_t i;
      for (i = 0; i < len; i++)
        printf("%p\t0x%.2x\n", start+i, start[i]);
      printf("\n");
    }

Using show_bytes

int a = 15213;
printf("int a = 15213;\n");
show_bytes((pointer) &a, sizeof(int));
  • Results on Linux x86-64:
int a = 15213;
0x7ffee15d7e8c  0x6d
0x7ffee15d7e8d  0x3b
0x7ffee15d7e8e  0x00
0x7ffee15d7e8f  0x00

Moving Onto Chapter 3: Machine-Level Representations of Programs

Running Programs is just a Processor Interpreting Memory

  • Processor has a relatively simple model
    1. Read some bytes from a location in memory (the address in the program counter)
    2. Interpret the data in that location as a machine instruction
    3. Do what that instruction says, modifying processor and memory state appropriately
    4. Goto 1
  • The whole system software stack (OS, compilers, linkers, etc) relies on conventions about laying out programs in memory

Assembly/Machine Code View

  • Programmer-Visible State
    • PC: Address of next instruction
    • Register file: Temporary storage
    • Condition Codes: Special state for comparisons
    • Memory
      • Byte addressable array
      • Code and user data
      • Stack to support procedures

C vs Machine Code vs Assembly



int fib(int n)
{
 
    int r = 1;
 
    for (; n > 0; n--) {
 
        r = r * n;
 
      /* Loop decr.    */
      /* Loop end chk. */

    }
    return r;
}
Disassembly of section .text:
Address  Insn Bytes  Assembly Code
40057d:  55          push %rbp
40057e:  48 89 e5    mov %rsp,%rbp
400581:  89 7d ec    mov %edi,-0x14(%rbp)
400584:  c7 45 fc 01 00 00 00  movl   $0x1,-0x4(%rbp)
40058b:  eb 0e       jmp 40059b &lt;fib+0x1e&gt;
40058d:  8b 45 fc    mov -0x4(%rbp),%eax
400590:  0f af 45 ec imul -0x14(%rbp),%eax
400594:  89 45 fc    mov %eax,-0x4(%rbp)
400597:  83 6d ec 01 subl   $0x1,-0x14(%rbp)
40059b:  83 7d ec 00 cmpl   $0x0,-0x14(%rbp)
40059f:  7f ec       jg     40058d &lt;fib+0x10&gt;
4005a1:  8b 45 fc    mov    -0x4(%rbp),%eax
4005a4:  5d          pop    %rbp
4005a5:  c3          retq

Toolchain for Compiled Programs

Why Learn Assembly?

  • Understand optimization capabilities of the compiler and analyze inefficiencies of the code
  • Maximize performance of critical sections of the code
  • Many programs are attacked through weaknesses that are visible at the machine-level representation of the programs
  • In the earlier days we needed to learn assembly to write the programs directly in that language
  • Today we need to be able to read and understand the code generated by compilers

Critical Piece of Advice

  • The textbook provides many examples and exercises that illustrate different aspects of assembly language and compilers.
  • From the authors of the textbook:
  • “This is a subject where mastering the details is a prerequisite to understanding the deeper and more fundamental concepts.”
  • “Those who say”I understand the general principles, I don’t want to bother learning the details” are deluding themselves.”
  • “It is critical that you spend time studying the examples, working through exercises, and checking your solutions with those provided.”

History of Intel Processors

  • Read on your own: sec 3.1

Our Coverage

  • This class focuses on x86-64
  • Relevant textbook errata: here

Learning Objectives

  • Describe how to disassemble a program in C
  • List and describe all the x86-64 registers
  • List and describe the operand specifiers
  • Describe the 9 addressing modes and practice address calculations in an instruction

Machine Programming I: Basics

  • C, assembly, machine code
  • Assembly Basics: Registers, operands, move
  • Arithmetic & logical operations

Definitions

  • Architecture: (also ISA: instruction set architecture) The parts of a processor design that one needs to be able to write and understand assembly/machine code
    • Examples: instruction set specification, registers (processor state)
  • Microarchitecture: Implementation of the architecture
    • Examples: cache sizes and core frequency
  • Code Forms:
    • Machine Code: The byte-level programs that a processor executes
    • Assembly Code: A text representation of machine code
  • Example ISAs:
    • Intel: x86, IA32, Itanium, x86-64
    • ARM: Used in almost all mobile phones and increasingly in other places

Assembly/Machine Code View Revisited

  • PC: Program counter
    • Address of next instruction
    • Called “RIP” (instruction pointer register) x86-64
  • Register file
    • Contains 16 named locations
    • Heavily used program data
  • Condition codes
    • Store status information about most recent arithmetic or logical operation
    • Used for conditional branching

  • Memory
    • Byte addressable array
    • Code and user data
    • Stack to support procedures

Turning C into Object Code

  • Code in files: p1.c p2.c
  • Compile with command: gcc -Og p1.c p2.c -o p

Compiling into Assembly

long plus(long x, long y); 

void sumstore(long x, long y, 
              long *dest) 
{
    long t = plus(x, y);
    *dest = t;
}
sumstore:
    pushq   %rbx
    movq    %rdx, %rbx
    call    plus
    movq    %rax, (%rbx)
    popq    %rbx
    ret
  • To obtain yourself:
    • gcc -Og -S sum.c (S means only preprocess and compile)
    • Produces file sum.s
  • Potential for different results on different machines!

Assembly Characteristics: Data Types

  • “Integer” data of 1, 2, 4, or 8 bytes
    • Data values
    • Addresses (untyped pointers)
  • Floating point data of 4, 8, or 10 bytes
  • Code: Byte sequences encoding series of instructions
  • No aggregate types such as arrays or structures
    • Just contiguously allocated bytes in memory

Assembly Characteristics: Operations

  • Perform arithmetic function on register or memory data
  • Transfer data between memory and register
    • Load data from memory into register
    • Store register data into memory
  • Transfer control
    • Unconditional jumps to/from procedures
    • Conditional branches

Object Code

sumstore bytes:

0x0400595: 
    0x53
    0x48
    0x89
    0xd3
    0xe8
    0xf2
    0xff
    0xff
    0xff
    0x48
    0x89
    0x03
    0x5b
    0xc3
  • Assembler:
    • Translates .s into .o
    • Binary encoding of each instruction
    • Nearly-complete image of executable code
    • Missing linkages between code in different files
  • Linker:
    • Resolves references between files
    • Combines in static run-time libraries
    • Some libraries are dynamically linked
      • Linking occurs when program runs

From C to Machine Code

*dest = t;
  • C Code
    • Store value t where designated by dest
movq %rax, (%rbx)
  • Assembly
    • Move 8-byte value to memory
      • Quad words in x86-64 parlance
    • Operands:
      • t: Register %rax
      • dest: Register %rbx
      • *dest: Memory M[%rbx]
0x40059e:  48 89 03
  • Object Code
    • 3-byte instruction
    • Stored at address 0x40059e

Disassembling Object Code

0000000000400595 &lt;sumstore&gt;:
  400595:  53               push   %rbx
  400596:  48 89 d3         mov    %rdx,%rbx
  400599:  e8 f2 ff ff ff   callq  400590 &lt;plus&gt;
  40059e:  48 89 03         mov    %rax,(%rbx)
  4005a1:  5b               pop    %rbx
  4005a2:  c3               retq
  • Disassembler (inspects contents of machine-code files)
    • objdump –d sum
    • Useful tool for examining object code
    • Analyzes bit pattern of series of instructions
    • Produces approximate rendition of assembly code
    • Can be run on either a.out (complete executable) or .o file

GDB Disassembly

Object:

0x0400595: 
    0x53
    0x48
    0x89
    0xd3
    0xe8
    0xf2
    0xff
    0xff
    0xff
    0x48
    0x89
    0x03
    0x5b
    0xc3

Disassembled:

Dump of assembler code for function sumstore:
  0x0000000000400595 <+0>: push   %rbx
  0x0000000000400596 <+1>: mov    %rdx,%rbx
  0x0000000000400599 <+4>: callq  0x400590 <plus>
  0x000000000040059e <+9>: mov    %rax,(%rbx)
  0x00000000004005a1 <+12>:pop    %rbx
  0x00000000004005a2 <+13>:retq 
  • Within gdb Debugger:
    • gdb sum
    • disassemble sumstore
      • Disassemble procedure
    • x/14xb sumstore
      • Examine the 14 bytes starting at sumstore

What Can Be Disassembled?

% objdump -d WINWORD.EXE

WINWORD.EXE:   file format pei-i386

No symbols in "WINWORD.EXE".
Disassembly of section .text:

30001000 <.text>:
30001000: FORBIDDEN!
30001001: FORBIDDEN! 
30001003: FORBIDDEN! 
30001005: FORBIDDEN! 
3000100a: FORBIDDEN!
  • Anything that can be interpreted as executable code
  • Disassembler examines bytes and reconstructs assembly source

Machine Programming I: Basics

  • C, assembly, machine code
  • Assembly Basics: Registers, operands, move
  • Arithmetic & logical operations

x86-64 Integer Registers (figure 3.2)

x86-64 Integer Registers

  • 16 registers in total
  • Used to store integer data and pointers
  • Study figure 3.2 and learn the names of the registers
  • Multiple naming conventions have accumulated during HW evolution
  • Instructions can operate on data of different sizes

History: IA32 Registers

Using Registers

  • Unique register: %rsp indicates end position in the run-time stack
  • Some instructions specifically read and write this register
  • Other 15 registers are used more flexibly
  • Some instructions use specific registers
  • Standard programming conventions set the use of the registers to manage the stack (covered in Section 3.7 with the procedures)
  • Have fig. 3.2 handy (for the rest of the semester, may bring to tests)

x86-64 Size Terminology (fig. 3.1)

Desc. Letter Bytes Bits
byte b 1 8
word w 2 16
double word l 4 32
quad word q 8 64
  • Letters used as suffix in operations
  • E.g. data movement instruction variants given size:
    • movb (move byte)
    • movw (move word)
    • movl (move double word)
    • movq (move quad word)

Floating Point Sizes (fig. 3.1)

Desc. Letter Bytes Bits
Single Precision s 4 32
Double Precision l 8 64
  • No ambiguity or problems since the set of instructions and registers for floating point are separate and different from the other instructions

Operand Specifiers (fig. 3.3)

  • Instructions have 0, 1, or more operands
    • operands are used to specify:
      • source values: constants, value in register, or value in memory
      • destination location for result values: registers or memory
  • Operand types (ATT format):
    • immediate: constants have '$' prefix
    • register: array R indexed by register id, R[ra]
    • memory reference: access memory location using effective address.
      • array M of bytes, M[Addr], starting at address addr

Modes For Memory References (fig. 3.3)

  • There are nine (9) forms of memory references
  • These modes are used to specify the index to M
  • Most general mode is Imm(rb, ri, s) which gives an effective address (index to M) computed as:
    • Imm + R[rb] + R[ri] * s
      • b for base and i for index, $ indicates immediate
      • s (scaling factor) can only be: 1, 2, 4, or 8
      • some components might be missing which gives diff. modes
  • More complex modes used for arrays and structure elements

Example: Operand Specification

Operation Result
0x100(%rax) 0x100 + content of %rax
0x100(%rax, %rbx) 0x100 + content of %rax + content of %rbx
(%rax, %rbx, 2) content of %rax + content of %rbx * 2
  • Note: all of these are used to compute a memory address which the contents of is then fetched

Practice: Problem 3.1

Address Value Register Value
0x100 0xFF %rax 0x100
0x104 0xAB %rcx 0x1
0x108 0x13 %rdx 0x3
0x10C 0x11
Operand Value
%rax ________
0x104 ________
$0x108 ________
(%rax) ________
4(%rax) ________
9(%rax, %rdx) ________
260(%rcx, %rdx) ________
0xFC(, %rcx, 4) ________
(%rax, %rdx, 4) ________

Practice: Problem 3.1 Solution

Operand Value Comment
%rax 0x100 Register
0x104 0xAB Absolute address
$0x108 0x108 Immediate
(%rax) 0xFF Address 0x100
4(%rax) 0xAB Address 0x104
9(%rax, %rdx) 0x11 Address 0x10C
260(%rcx, %rdx) 0x13 Address 0x108
0xFC(, %rcx, 4) 0xFF Address 0x100
(%rax, %rdx, 4) 0x11 Address 0x10C

Data Movement Instructions (fig. 3.4)

  • MOV S,D (from source (S) to destination (D))
Operation Desc.
movb move byte
movw move word
movl move double word (long word)
movq move quad word
movabsq move absolute quad word
  • General rule: size of operand determines portion of the register to be updated. Exception: movl, also sets the high-order 4 bytes of the register to 0.
  • movq when given an immediate value, treats it as 32 bit two’s complement value which it sign extends to 64 bit
  • movabsq when given an immediate value, treats it as 64 bit

Moving Data With movq

  • movq Source, Dest
  • Operand Types
    • Immediate: Constant integer data
      • Example: $0x400, $-533
      • Like C constant, but prefixed with '$'
      • Encoded with 1, 2, or 4 bytes
    • Register: One of 16 integer registers
      • Example: %rax, %r13
      • But %rsp reserved for special use
      • Others have special uses for particular instructions
    • Memory: 8 consecutive bytes of memory at address given by register
      • Simplest example: (%rax)
      • Various other “address modes”

movq Operand Combinations