m (Inject moved page Phases of translation - C to c/phases of translation) |
|||
(One intermediate revision by one other user not shown) | |||
Line 1: | Line 1: | ||
− | A [[C]] program can consist of one or more files; the text of the program is kept in units called ''[[source file]]s''. The '''phases of translation''' are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a | + | {{c title|Phases of Translation}} |
+ | A [[C]] program can consist of one or more files; the text of the program is kept in units called ''[[source file]]s''. The '''phases of translation''' are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a {{C|preprocessing translation unit}}, then into a {{C|translation unit}}, and finally into an [[executable program]]. It is also possible to translate individual units separately and then later link them to produce an executable program.<ref>ISO/IEC 9899:2011 §5.1.1.1 p1</ref> | ||
== Translation phases == | == Translation phases == | ||
Line 14: | Line 15: | ||
== Character mapping == | == Character mapping == | ||
− | During the first phase of translation, the physical source | + | During the first phase of translation, the physical source is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.1</ref> |
− | In addition to the character mapping; | + | In addition to the character mapping; {{C|trigraphs|trigraph sequences}} are replaced by corresponding single-character internal representations. For example: |
<source lang="C"> | <source lang="C"> | ||
Line 62: | Line 63: | ||
== Tokenization == | == Tokenization == | ||
− | In the third phase of translation, the | + | In the third phase of translation, the {{C|preprocessor}} tokenizes the source file into preprocessing [[token]]s and sequences of whitespace characters. Comments are placed by a single [[whitespace]] character and [[new-line]] characters are retained.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.3</ref> |
== Preprocessing == | == Preprocessing == | ||
− | During this stage all processing directives are executed, macro invocations are expanded, and the | + | During this stage all processing directives are executed, macro invocations are expanded, and the {{C|_Pragma}} operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.4</ref> |
== Character-set mapping == | == Character-set mapping == | ||
− | In the fifth phase of translation, each source character set member and | + | In the fifth phase of translation, each source character set member and {{C|escape sequence}} in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.5</ref> |
== String concatenation == | == String concatenation == | ||
− | In this phase, all adjacent | + | In this phase, all adjacent {{C|string literal}} tokens are [[concatenation|concatenated]]. For example: <code>"A" "B" C"</code> becomes <code>"ABC"</code> and <code>"A" u"B" "C"</code> becomes <code>u"ABC"</code><ref>ISO/IEC 9899:2011 §5.1.1.2 p1.6</ref> |
== Translation == | == Translation == | ||
− | In the seventh phase, all [[whitespace]] characters separating tokens becomes insignificant. Every | + | In the seventh phase, all [[whitespace]] characters separating tokens becomes insignificant. Every {{C|preprocessing token}} is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.7</ref> |
== Linkage == | == Linkage == |
Latest revision as of 12:43, 5 February 2021
A C program can consist of one or more files; the text of the program is kept in units called source files. The phases of translation are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a preprocessing translation unit, then into a translation unit, and finally into an executable program. It is also possible to translate individual units separately and then later link them to produce an executable program.[1]
Contents
Translation phases[edit]
The latest C standard, C11, specifies eight translation phases:
- Character mapping
- Line splicing
- Tokenization
- Preprocessing
- Character-set mapping
- String concatenation
- Translation
- Linkage
Character mapping[edit]
During the first phase of translation, the physical source is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.[2]
In addition to the character mapping; trigraph sequences are replaced by corresponding single-character internal representations. For example:
#include <stdio.h>
int main()??<
char hello??(??) = "Hello World!";
puts(hello);
return 0;
??>
Becomes
#include <stdio.h>
int main(){
char hello[] = "Hello World!";
puts(hello);
return 0;
}
Line splicing[edit]
During the second phase of translation, any instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing the physical source lines to form logical source lines.[3] For example
#include <stdio.h>
int main() {
p\
u\
t\
s("Hello World");
return 0;
}
becomes:
int main() {
puts("Hello World");
return 0;
}
Tokenization[edit]
In the third phase of translation, the preprocessor tokenizes the source file into preprocessing tokens and sequences of whitespace characters. Comments are placed by a single whitespace character and new-line characters are retained.[4]
Preprocessing[edit]
During this stage all processing directives are executed, macro invocations are expanded, and the _Pragma operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.[5]
Character-set mapping[edit]
In the fifth phase of translation, each source character set member and escape sequence in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.[6]
String concatenation[edit]
In this phase, all adjacent string literal tokens are concatenated. For example: "A" "B" C"
becomes "ABC"
and "A" u"B" "C"
becomes u"ABC"
[7]
Translation[edit]
In the seventh phase, all whitespace characters separating tokens becomes insignificant. Every preprocessing token is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.[8]
Linkage[edit]
In the final phase of translation, all external object and function references are resolved. Library components are linked to resolve all external references to functions and objects. All translation units are collected together into a single program image which contains the necessary information needed for execution in its execution environment.[9]
References[edit]
- ↑ ISO/IEC 9899:2011 §5.1.1.1 p1
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.1
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.2
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.3
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.4
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.5
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.6
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.7
- ↑ ISO/IEC 9899:2011 §5.1.1.2 p1.8