From WikiChip
Difference between revisions of "c/phases of translation"
< c

m
Line 1: Line 1:
A [[C]] program can consist of one or more files; the text of the program is kept in units called ''[[source file]]s''. The '''phases of translation''' are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a [[preprocessing translation unit - C|preprocessing translation unit]], then into a [[translation unit - C|translation unit]], and finally into an [[executable program]]. It is also possible to translate individual units separately and then later link them to produce an executable program.<ref>ISO/IEC 9899:2011 §5.1.1.1 p1</ref>
+
{{c title|Phases of Translation}}
 +
A [[C]] program can consist of one or more files; the text of the program is kept in units called ''[[source file]]s''. The '''phases of translation''' are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a {{C|preprocessing translation unit}}, then into a {{C|translation unit}}, and finally into an [[executable program]]. It is also possible to translate individual units separately and then later link them to produce an executable program.<ref>ISO/IEC 9899:2011 §5.1.1.1 p1</ref>
  
 
== Translation phases ==
 
== Translation phases ==
Line 16: Line 17:
 
During the first phase of translation, the physical source file is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.1</ref>
 
During the first phase of translation, the physical source file is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.1</ref>
  
In addition to the character mapping; [[trigraphs - C|trigraph sequences]] are replaced by corresponding single-character internal representations. For example:
+
In addition to the character mapping; {{C|trigraphs|trigraph sequences}} are replaced by corresponding single-character internal representations. For example:
  
 
<source lang="C">
 
<source lang="C">
Line 62: Line 63:
 
   
 
   
 
== Tokenization ==
 
== Tokenization ==
In the third phase of translation, the [[C preprocessor|preprocessor]] tokenizes the source file into preprocessing [[token]]s and sequences of whitespace characters. Comments are placed by a single [[whitespace]] character and [[new-line]] characters are retained.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.3</ref>
+
In the third phase of translation, the {{C|preprocessor}} tokenizes the source file into preprocessing [[token]]s and sequences of whitespace characters. Comments are placed by a single [[whitespace]] character and [[new-line]] characters are retained.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.3</ref>
  
 
== Preprocessing ==
 
== Preprocessing ==
During this stage all processing directives are executed, macro invocations are expanded, and the [[pragma operator - C|_Pragma]] operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.4</ref>
+
During this stage all processing directives are executed, macro invocations are expanded, and the {{C|_Pragma}} operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.4</ref>
  
 
== Character-set mapping ==
 
== Character-set mapping ==
In the fifth phase of translation, each source character set member and [[escape sequence - C|escape sequence]] in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.5</ref>
+
In the fifth phase of translation, each source character set member and {{C|escape sequence}} in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.5</ref>
  
 
== String concatenation ==
 
== String concatenation ==
In this phase, all adjacent [[string literals - C|string literal]] tokens are [[concatenation|concatenated]]. For example: <code>"A" "B" C"</code> becomes <code>"ABC"</code> and <code>"A" u"B" "C"</code> becomes <code>u"ABC"</code><ref>ISO/IEC 9899:2011 §5.1.1.2 p1.6</ref>
+
In this phase, all adjacent {{C|string literal}} tokens are [[concatenation|concatenated]]. For example: <code>"A" "B" C"</code> becomes <code>"ABC"</code> and <code>"A" u"B" "C"</code> becomes <code>u"ABC"</code><ref>ISO/IEC 9899:2011 §5.1.1.2 p1.6</ref>
  
 
== Translation ==
 
== Translation ==
In the seventh phase, all [[whitespace]] characters separating tokens becomes insignificant. Every [[preprocessing token - C|preprocessing token]] is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.7</ref>
+
In the seventh phase, all [[whitespace]] characters separating tokens becomes insignificant. Every {{C|preprocessing token}} is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.<ref>ISO/IEC 9899:2011 §5.1.1.2 p1.7</ref>
  
 
== Linkage ==
 
== Linkage ==

Revision as of 09:05, 4 January 2015

A C program can consist of one or more files; the text of the program is kept in units called source files. The phases of translation are a series of steps a translator, or compiler, must go through to convert a source file into an executable program. During these phases, the source file gets converted into a preprocessing translation unit, then into a translation unit, and finally into an executable program. It is also possible to translate individual units separately and then later link them to produce an executable program.[1]

Translation phases

The latest C standard, C11, specifies eight translation phases:

  1. Character mapping
  2. Line splicing
  3. Tokenization
  4. Preprocessing
  5. Character-set mapping
  6. String concatenation
  7. Translation
  8. Linkage

Character mapping

During the first phase of translation, the physical source file is mapped to the source character set in an implementation-defined manner. For example, the compiler may choose to interpret the source as UTF-8 or simply as ASCII and convert it to the implementation's internal source representation if necessary.[2]

In addition to the character mapping; trigraph sequences are replaced by corresponding single-character internal representations. For example:

#include <stdio.h>
int main()??<
    char hello??(??) = "Hello World!";
    puts(hello);
    return 0;
??>

Becomes

#include <stdio.h>
int main(){
    char hello[] = "Hello World!";
    puts(hello);
    return 0;
}

Line splicing

During the second phase of translation, any instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing the physical source lines to form logical source lines.[3] For example

#include <stdio.h>
int main() {
    p\
u\
t\
s("Hello World");
    return 0;
}

becomes:

int main() {
    puts("Hello World");
    return 0;
}

Tokenization

In the third phase of translation, the preprocessor tokenizes the source file into preprocessing tokens and sequences of whitespace characters. Comments are placed by a single whitespace character and new-line characters are retained.[4]

Preprocessing

During this stage all processing directives are executed, macro invocations are expanded, and the _Pragma operator expressions are executed. Any included file is processed from phase 1 through phase 4, recursively. By the conclusion of this phase, all preprocessing directives are deleted.[5]

Character-set mapping

In the fifth phase of translation, each source character set member and escape sequence in character constants and string literals are converted to their corresponding execution character set member. Whenever that's not possible, the character is convert in an implementation-defined manner to some character other than null.[6]

String concatenation

In this phase, all adjacent string literal tokens are concatenated. For example: "A" "B" C" becomes "ABC" and "A" u"B" "C" becomes u"ABC"[7]

Translation

In the seventh phase, all whitespace characters separating tokens becomes insignificant. Every preprocessing token is converted into a token. Tokens are syntactically and semantically analyzed and translated as a translation unit.[8]

Linkage

In the final phase of translation, all external object and function references are resolved. Library components are linked to resolve all external references to functions and objects. All translation units are collected together into a single program image which contains the necessary information needed for execution in its execution environment.[9]


References

  1. ISO/IEC 9899:2011 §5.1.1.1 p1
  2. ISO/IEC 9899:2011 §5.1.1.2 p1.1
  3. ISO/IEC 9899:2011 §5.1.1.2 p1.2
  4. ISO/IEC 9899:2011 §5.1.1.2 p1.3
  5. ISO/IEC 9899:2011 §5.1.1.2 p1.4
  6. ISO/IEC 9899:2011 §5.1.1.2 p1.5
  7. ISO/IEC 9899:2011 §5.1.1.2 p1.6
  8. ISO/IEC 9899:2011 §5.1.1.2 p1.7
  9. ISO/IEC 9899:2011 §5.1.1.2 p1.8