|
Registered User
Join Date: Nov 2006
Location: Sydney, Australia
Posts: 202
OS: WinXP-Home
|
Re: Introduction To x86 Assembly (Review)
Just a bit of an update:
Quote:
1. Preamble
Welcome,
As the title implies, this article has been written to give you a comprehensive introduction to the Assembly programming language. The audience for this text need not have any prior experience in computer programming, although it would be benificial for you to have an intermediate to advanced general ability working with computers, namely Windows.
From this article you will aquire the skill that will allow you to develop your own software, learn some of the lean & mean aspects of how computers work, and most importantly give you a solid foundation for learning higher level languages (such as C).
Please keep in mind for those of you who are technically inclined that I've intentionally not covered alot of the more advanced concepts of systems architecture (e.g, ALU, cache, gates, ect...), and kept things of quite an abstract nature.
2. Preperation
Luckily, there are very little requirements for programming in assembly, as follows:
* An x86 (or 86x64) Intel (or AMD) processor.
* Any Window's operating system.
* A plain text editor (Such as Notepad)
* A copy of Microsoft's freeware assembler (See steps below)
Step 1: Download MASM here: http://website.assemblercode.com/masm32/m32v9r.zip
Step 2: Run the installer, the steps required here are very self-explainatory (Just continue with everything).
Step 3: Create a folder for your work, anywhere you like. From now on this folder will be refered to as your "project folder".
Step 4: Create a copy of the window's shell (cmd.exe) which can be found in your \WINDOWS\System32 folder, and place it in your project folder.
Step 5: Last of all, add the path of MASM's "bin" folder to your system PATH variable. You can do this via Control Panel -> System -> Advanced -> Enviromental variables (remember to append a semi-colon to the end of the current PATH value, followed by the path of the bin directory.) For example, your PATH variable may now look like: %SystemRoot%\system32;%SystemRoot%;%SystemRoot%\System32\Wbem;C:\masm32\bin
Note: You may have to restart for this change to take affect.
3. Mr. Von Neuman
With the exception of some embedded systems, computers follow one very basic model; the Von Neuman Architecture, which stipulates the generalized architecture of a computer as seen below: [TODO]
All the components of a computer communicate via the system bus, and why is this bus metaphor used? Because the system bus essentially runs in circles ("Bus route"), stopping at each component ("Bus stop"), picking up information ("Passenger pickup") and dropping that information off at another component ("Passenger dropoff"). This means that the frequency (measured in MHz typically) of the system bus is what determines how fast devices can be accessed, something you may want to keep this in mind when purchasing memory (Memory can only be accessed as fast as the system bus, e.g, if you have a 600MHz bus yet the memory module operates at 1600MHz, then there's going to be a bottle-neck between the two, reducing practical memory bandwidth.)
Memory consists of a linear array of 8-bit groupings called bytes which we use to store data, each byte is referenced using a unique numeric identifier; which is where the term 'byte-addressable memory' comes from in relation to the majority of modern processors. Although the most common implementation of 'data' is numeric values, it is truely up to the programmer to define what their 'data' is, and subsequently how to interpret it. For example, ASCII characters are encoded as bytes. A better example would be bit flags, where each bit is used to indicate a certain state, for example you may have a byte which indicates various options of a user's account like so:
[Bit 0] - Account enabled?
[Bit 1] - Account banned?
[Bit 2] - Account muted?
[Bit 3] - Account pending deletion?
[Bit 4-7] - Unused.
We manipulate these options by toggling the individual bits, to either 1 or 0 (on or off, whatever). Noting that in the above example we diden't use a 'whole' byte (we only used 4 bits which is called a 'nibble'), of which is perfectly legal. Also, when we are refering to bits within a byte we label them numerically from least-significant to most-significant starting at 0. Hence, in the above example the "Account Enabled" bit would be the least-significant digit if we were to interpret it as an integer.
So far I've only spoken about bytes, although you can use as much memory (within availability limits) as you like to store your data, the
number of bytes that can be manipulated (or accessed) by a single instruction (Which you'll learn about soon) depends primarily on the processor's architecture. For example, Intel's desktop processors can all manipulate a 4 byte chunk of memory using a single instruction, while their server processors can manipulate 8 bytes. This is where the terms "32 bit" and "64 bit", respectively, processors come from. Ofcourse, they can both manipulate values smaller than 32 (or 64 bits), but the size must strictly be on a multiple of bytes and be of an even number (So really, 1, 2, 4, or 8 bytes).
Many common naming conventions exist for naming these byte groupings of different sizes, the following are what we will use:
Byte - 1 byte (8 bits.)
Word - 2 bytes (16 bits.)
Dword - 4 bytes (32 bits.)
One important thing to note is that when you want to refer to a chunk of data, whether it be a word or dword that is larger than a byte you use the address of the least-sigificant byte (that which has the lowest address), for example if you using memory 10 - 13 as a dword then you would specify 10 as the address (how much data to access is dependent upon the instruction, but will always be presumed to be from lower to higher orders of memory addresses)
The processor is no doubt the most well-known component of a system. Often I've found people are daunted by computers due to their seemingly magical ways, but in reality, everything accomplished involves a series of very primitive steps which are called 'instructions'.
As you've most likely guessed, an instruction is what you use to order the processor to do some task, and like all tangible data resides in memory. Although the format of instructions varies from one to another, what is consistant is that all instructions have a unique identifier (which is a number as far as we're concerned) which tells the processor what we want it to do, the rest of the instructions contents largely depend on the instruction itself, for example an instruction that tells the processor to set the value (integer) of a chunk of memory would contain the address of the memory to be set, the actual value to set it to, along with a few other tidbits (such as an indicator of what size the chunk is; byte, word or dword). So how do we get the processor to execute our instructions? On-board the processor there are a handful of dedicated data containers called 'registers', one of these register's is named EIP, or Instruction Pointer which contains the memory address of the current instruction that is being executed, after it has finished executing that instruction it will increment EIP by the size (in bytes) of the previous instruction and execute the next, so on and so forth, forever.
Registers mentioned before are an important topic, essentially they are like main memory except they are much smaller, independent, and are physically contained on the processor chip (making for much faster access). There are 8 general purpose registers at your disposal, which are all 32-bits in size:
[ EAX [ AX [ AH, AL ] ] ]
[ EDX [ DX [ DH, DL ] ] ]
[ ECX [ CX [ CH, CL ] ] ]
[ EBX [ BX [ BH, BL ] ] ]
[ ESI [SI] ]
[ EDI [DI] ]
[ EBP [BP] ]
[ ESP [SP] ]
The above may seem a little confusing, but don't worry I'll explain. Since registers are technically not the same as main memory, they do not have a numerical address, therefor you refer to them by their actual name (like EAX). There are some instances ofcourse where you do not wish to use a whole 32-bit register (since they are quite the commodity), and want to use only a portion. To accomodate this the registers are broken down into multiple parts to reference a specific 'area' of the register, take EAX for example, it is broken down into 4 seperately addressable (although overlapping) registers. AX is the low-order word of EAX, AH is the high-order byte of AX and AL is the low-order byte of AX (There is no way to reference the high-order word of EAX). Don't forget, although they are referenced seperately EAX, AX, AL, AH are all overlapped and thus if you were to modify AX, then essentially EAX would be modified (although only the low 16 bits). The same rule applies with the other registers, except for ESI/EDI/EBP/ESP which you can only access either the whole 32-bits, or the low 16-bits.
4. Assembly, The Language
Now that the theory is over, it's time for the practical.
Unlike other languages Assembly is not one specific language, rather it's the general name given to the mnemonic-oriented languages used to program physical devices at the lowest possible level, which for this text we'll assume are all processors. So what is a 'mnemonic-language'? In the the most simplest case it's the assignment of textual tokens which directly represent the components and various facets of a processor (e.g, instructions), enabling the programmer to work with more human-friendly names rather than dragging out a hex editor and writing code directly (which is still an abstraction, and is still supported by MASM, but is well beyond the scope of this text). Infact, you've already seen some of these tokens, like the registers covered in the previous section ("EAX", "EBX", ect...). Ofcourse you cannot just simply feed this text-based code to the processor, which is why we have an assembler (ML.exe) which performs the relatively simple process of converting our source code to actual machine code, then the linker (link.exe) which will sort out the machine code into the desired file-format (which will be an .EXE throughout this tutorial)
Before we begin, you should be aware that although the bulk of the language syntax is directly related to the (act of programming the-) processor there are other language concepts that exist to both slightly simplify the structure of our code, and to actually enable us to generate an executable. This is because we are not alone in the computer, we must be aware of the operating system and be able to interface with the protocols it defines (such as the format of executables). You will also find that other assemblers (opposed to MASM) have their own conventions, but there are only slight differences.
First, open your text-editor and type (or copy) the following:
.486
.model flat
.code
_start:
end _start
Save the file as test.asm in your project folder (The file must have the .ASM extension). Next, run your copy of the command-line (cmd.exe) and run the assembler:
ml /c /coff test.asm
You should now have a file called test.obj, now invoke the linker as so:
link /SUBSYSTEM:CONSOLE /MACHINE:IX86 test.obj
You should have an executable called test.exe, go ahead and run it and you'll find an error occurs which is expected and will be explained soon. First we'll cover the code we just wrote:
.486
This line indicates which specific processor we are targetting, in our case it's a 80486 which is the CPU that preceeded the very first pentium. Keep in mind, the x86 family of processors are for the most part backwards compatible (hence why we can specify such an old architecture).
.model flat
The processor is capable of using a variety of addressing modes (memory models), which we won't get into here. All you need to know is that this line specifies for the 'flat' mode to be used (which is what has been used by Window's ever since NT)
.code
In one way or another, a program is seperated into different segments for various purposes, but mainly for maintainence and debugging. The above line declares the start of the "code" segment (sometimes called the "text" segment) which is as you might've guessed where we place our code.
_start:
The executive needs to know where a program should start executing, this line does just that. Therefor, after this line is where we write our instructions (code).
end _start
This line is the complement to _start, indicating the end of the code.
The above example provides a basic skeleton program, which for the most part will remain the same throughout the rest of this tutorial and will be built upon appropiately. Ofcourse, it is only a skeleton and has no meat (so to speak), thus does not actually do anything because we did not write any meaningful code (ie, instructions); this is the reason for the error you received.
Now that it's time to actually write some instructions, I'll explain the format MASM expects. An instruction is comprised of two things, number one is the instruction's name (which is a mnemonic of the instructions identifier, as was explained in Chapter 3.) and secondly an instruction can take one, two or three 'operands' (Infact, some instructions don't require any). An operand is simply extra information that the instruction needs to perform, for example, the integer subtraction instruction (mnemonic "SUB") requires two operands the first being the value to subtract from and secondly the value to subtract from it. These values are not magical and therefor must reside somewhere, that said, there are three different types of operands:
Register - The operand is refering to one of the CPU registers (e.g, ECX)
Immediate - The operand is encoded into the instruction itself, in memory.
Memory - The operand is located in memory somewhere, the operand is the address of that memory.
Implicit - The operand is implicit, that is, one or more of the instruction's operands are predefined.
The format for writing an instruction in MASM is as follows:
Instruction Name [, Operands ]
In English, first the instruction mnemonic is written then each operand (if it has any) is seperated by a comma (Note: Atleast one white-space must seperate the instruction name and it's operand list). For example, the MOV instruction which moves (Actually, it copies) a piece of data from one location to another could be done like so:
MOV eax, 500
The above instruction sets the value of the EAX register to 500, the first operand is a register and the second operand is an immediate. MOV will consitute anywhere from 40-60% of your code, it's very common. Be aware that an instruction cannot always take any combination of operand types, such as MOV, MOV's format is:
MOV dst, src
Where 'dst' (destination) is where you are 'moving' data to, and 'src' (source) is where you are moving data from. In simpler terms, the first operand is given the value of the second. The following is the valid combination of operand types for MOV:
MOV register, immediate (As in the above example)
MOV register, memory
MOV memory, immediate
MOV memory, register
You cannot MOV from memory to memory, you must first MOV to a register then MOV from that register to the other memory location.
Unfortunately (or fortunately, depends on how you look at it), access to devices such as video or network is not available to user-mode applications such as the one's we'll be writing, rather you will need to call upon the operating system to perform these tasks which leads us to one of the oldest programming notions: Functions (Or "procedures", "routines")
Ever since programmers started writing code, they realized that much of it could be reused to a large extent with very little modification (or even none at all). This was the case with many general purpose tasks such as calculating the power of a value raised to another, and this this the convention of "functions" was invented.
So what is a function exactly? Essentially, it's a sequence of instructions that perform some specific task which can be called (executed) at any point in time (re-enterable) without causing a problem. I feel the best way to explain this would be to start where they (the original programmers) started, so without further adue here's an example of a sequence of instructions that raises the power of value:
; Initialize.
xor edx, edx
; ECX will store the value we're raising.
mov ecx, 10
; EBX will store the power we're raising it to.
mov ebx, 6
; EAX will store the result (Which we'll initialize to the base)
mov eax, ecx
; Now we calculate
RAISE_POWER:
; Raise the base.
mul ecx
; Check if we're done yet.
sub ebx, 1
jnz RAISE_POWER
For the moment, I do not expect you to understand the above code (It will be explained soon). What you do need to know about it is that it uses the ECX register to store the value to be raised, and EBX to store the power to be raised to. So, what happens if you need to do the exact same operation again, but just with values other than 10 and 6 (10^6)? Well, you could either write the code above where-ever you needed to which would result in much more complex and larger code (image), or what about instead you revisit that sequence of code after setting ECX and EBX appropiately? This is exactly what a function is, and ECX and EBX would be called the function's "parameters". Although, there would be a problem with this design since after the function (code sequence above) finishes executing it would just continue on, no doubt causing a hell of a mess and most probably a GPF (General Protection Fault). To solve this problem, there needs to be some way for the function to know where to return execution to after it's finished (that is, return to the sequence of code that invoked it). There are a few ways to do this, we'll first explore the register flavour which would go something like this:
PROC_RAISE_POWER:
; Initialize.
xor edx, edx
; EAX will store the result (Which we'll initialize to the base)
mov eax, ecx
; Now we calculate
RAISE_POWER:
; Raise the base.
mul ecx
; Check if we're done yet.
sub ebx, 1
jnz RAISE_POWER
; Go back to calling code.
jmp esi
; Lets call (execute) our raise power function.
mov ecx, 2
mov ebx, 5
mov esi, $
|
__________________
8 Years C++
7 Years x86 assembly.
Network programming veteran.
ADA, Java, BASIC, Pascal, BCPL, FORTRAN, COBOL, HTML, PHP, CSS, JavaScript.
|