A C program can consist of one or more files; the text of the program is kept in units called source files. The phases of translation are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a preprocessing translation unit, then into a translation unit, and finally into an executable program. It is also possible to translate individual units separately and then later link them to produce an executable program.[1]
Contents
Translation phases
The latest C standard, C11, specifies eight translation phases:
- Character mapping
- Line splicing
- Tokenization
- Preprocessing
- Character-set mapping
- String concatenation
- Translation
- Linkage
Character mapping
During the first phase of translation, the physical source is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.[2]
In addition to the character mapping; trigraph sequences are replaced by corresponding single-character internal representations. For example:
#include <stdio.h>
int main()??<
char hello??(??) = "Hello World!";
puts(hello);
return 0;
??>
Becomes
#include <stdio.h>
int main(){
char hello[] = "Hello World!";
puts(hello);
return 0;
}
Line splicing
During the second phase of translation, any instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing the physical source lines to form logical source lines.[3] For example
#include <stdio.h>
int main() {
p\
u\
t\
s("Hello World");
return 0;
}
becomes:
int main() {
puts("Hello World");
return 0;
}
Tokenization
In the third phase of translation, the preprocessor tokenizes the source file into preprocessing tokens and sequences of whitespace characters. Comments are placed by a single whitespace character and new-line characters are retained.[4]
Preprocessing
During this stage all processing directives are executed, macro invocations are expanded, and the _Pragma operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.[5]
Character-set mapping
In the fifth phase of translation, each source character set member and escape sequence in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.[6]
String concatenation
In this phase, all adjacent string literal tokens are concatenated. For example: "A" "B" C"
becomes "ABC"
and "A" u"B" "C"
becomes u"ABC"
[7]
Translation
In the seventh phase, all whitespace characters separating tokens becomes insignificant. Every preprocessing token is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.[8]
Linkage
In the final phase of translation, all external object and function references are resolved. Library components are linked to resolve all external references to functions and objects. All translation units are collected together into a single program image which contains the necessary information needed for execution in its execution environment.[9]
References
- ↑ ISO/IEC 9899:2011 §5.1.1.1 p1
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.1
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.2
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.3
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.4
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.5
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.6
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.7
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.8