Difference between revisions of "mirc/regex"

Revision as of 20:22, 22 September 2014

Template:mIRC Guide Regular expressions can be used to perform complicated pattern matching operations. You should already know how to use Regular expressions as this page won't teach them.

Informations

mIRC uses the PCRE library to implement regex with the following options enabled:

--enable-utf8
--enable-unicode-properties
--with-match-limit - around 1,000,000
--with-match-limit-recursion - 999

mIRC has two custom modifier:

S - strips control code from the input before matching (not supported by $hfind).
g - perform a global matches: after one match has been found, mIRC tries to match again from the current position

mIRC remembers up to 50 regex matches. After 50 matches, the first match is overwritten.

$regex, $regsub and $regsubex can take an optional name as a parameter, to reference that call later, if you do not specify a name, mIRC use a default.

$regex([name],<input>,<regex>)

Perform a regular expression match, returns the number of matches found. Returns a negative value to indicate an error (-8 if you reach the maximum number of match allowed or -21 if you read the maximum number of recursion allowed)

mIRC remembers up to 32 captured text (backreference), you can use $regml([name],N) to returns the Nth backreference, or the total number of backreferences with N = 0

$regml() also has a .pos property, which returns the position in the input where this was captured.

//noop $regex(name,test,/[es]/g) | echo -a $regml(name,0) : $regml(name,1) -- $regml(name,2)

$regsub([name],<input>,<regex>,<subtext>,<%varname>)

Performs a regular expression match, like $regex(), and then performs a substitution using <subtext>.

Returns N, the number of substitutions made, and assigns the result to <%varname>.

//noop $regsub(name,test,/([es])/g,t) | echo -a $regml(name,0) : $regml(name,1) -- $regml(name,2)

$regsubex([name],<input>,<regex>,<subtext>)

$regsubex is a more modern version of $regsub, it performs the match, and then the substitution, returns the result of the substitution

This time, <subtext> is evaluated during substitution and can be an identifier.

<subtext> can also contain special markers:

\0 - returns the number of matches
\n - returns the current match number
\t - returns the current match text (same as $regml(\n))
\a - returns all match items
\A - returns a non-spaced version of \a.
\1 \2 \N ... - returns the Nth backreference for the current match

Notes on $regsubex:

The main steps when mIRC evaluates an identifier are:

Processes [ ] (evaluating any variables/identifiers inside them once) and [[ ]] (turning them into [ ])
Separates the identifier's parameters and evaluates each parameter (in left-to-right order).
Passes the parameters to the identifier

It's a bit different in $regsubex, it has its own parsing routine. Indeed it needs not to evaluate the 'subtext' parameter before making the regex match, the steps are:

Processes [ ] and [[ ]]
Seperates parameters, evaluate the 'input' and the 'regex' parameters
Performs the regex match
* Tokenizes $1- according to the number of markers used in the 'subtext' parameters
Replaces any markers used in the subtext by their corresponding $N identifiers
Evaluates the 'subtext' parameter (one or more times, if /g is used)
Performs the substitutions and returns the result.

* mIRC internally use $1- to store the values of the markers, this means you cannot use the previous tokenization of $1- in the subtext.

The way mIRC does this is pretty ugly, it checks how many markers you have and create a list of token ($1-).

Each token is assigned a value and mIRC then replaces the marker with the corresponding $N value.

Let's say your subtext is "\t \t \1 \n", mIRC assignes the matched text to $1, to $2, the first backreferences in the pattern is assigned to $3 and the Nth iteration to $4

If you use a form \N where N is a positive number greater or equal to 1 (like \1) and there is no such backreference number in the pattern, mIRC will fill that value (internally, using $1-) with the value of $regml(\n + N - 1):

$regsubex(abcdefgij,/([a-z])/g,<\6>)

\6 doesn't mean anything, there is no 6 backreferences made.

When 'a' is matched \n is 1, only one marker used so $1 is filled with $regml(1 + 6 -1) = $regml(6) which is 'f'
When 'b' is matched, \n is 2, $1 is filled with $regml(2 + 6 - 1) = $regml(7) which is 'g'
And so on until \n + N - 1 is greater than the number of backref, at this point the characters are replaced with $null

Because of this, you cannot use the previous $N- value in the subtext.

Nested $regsubex calls are possible but let's remember the main steps:

Processes [ ] and [[ ]]
Seperates parameters, evaluate the 'input' and the 'regex' parameters
Performs the regex match
Tokenizes $1- according to the number of markers used in the 'subtext' parameters
Replaces any markers used in the subtext by their corresponding $N identifiers
Evaluates the 'subtext' parameter (one or more times, if /g is used)
Performs the substitutions and returns the result.

When mIRC replaces the markers, it will do so on the whole subtext parameter:

$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper(\t)) : \t)

The outer $regsubex will make the regex match, then it will replace \t everywhere in the subtext, the subtext of the outer $regsubex is:

$regsubex(\t,/(.)/g,$upper(\t))

Here all \t's gets the value of the matched text of the outer $regsubex, even the one inside $upper(), meaning that it won't work as expected. Indeed you want the \t inside the $upper to be the value of the matchted text of the inner $regsubex, not the outer one.

What we want to do is to get mIRC to see something different than "\t" when looking at the markers inside $upper in the subtext of the outer $regsubex.

If we were to use $regsubex(\t,/(.)/g,$upper( \ $+ t )) well you would just end up with calling $upper(\t) with plain text \t, because that $+ is going to be evaluated when $upper is evaluated. We want to interact after the outer $regsubex finished replacing markers but before $upper() is called.

The solution is to use the \ $+ t construct:

$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper( \ $+ t )))

As we know $regsubex doesn't evaluate the subtext parameter but the processing of [] and [[ ]] is done for the whole line. So mIRC first change this line into:

$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper( [ \ $+ t ] )))

Notice how only the [[ ]] changed, $+ was not evaluated because that subtext parameter is not evaluated, the [ ] processing happens before.

Now the outer $regsubex gets its parameters (mIRC will fail to see \t there, it will see \ $+ t, which is what we wanted), makes the regex match and call the subtext:

$regsubex(<value of \t in the outer $regsubex>,/(.)/g,$upper( [ \ $+ t ] ))

And as usual, [ ] is processed first and \ $+ t gives \t before this inner $regsubex start to replaces its own markers. bingo.

Note also that you cannot use a marker in the inner $regsubex subtext itself to get the value of that marker in the outer $regsubex context:

$regsubex(abcdefcdab,/(cd)/g,@\t : $regsubex(\t,/(.)/g, <why is \ $+ t not cd : \t>) $+ @)

This is because mIRC use the intermediate $1- value, when mirc replaces markers of the outer regsubex:

4 markers used in the subtext of the outer $regsubex
$1 = $2 = $3 = $4 = \t = the matchtex text

The code becomes:

$regsubex($1,/(.)/g, <why is \ $+ t not cd : $1 $+ >) $+ @)

mIRC adds the $+ if the markes has text surrounding it.

Now that inner $regsubex is evaluated, at this point, $1- is still what the outer $regsubex's tokenization produced, so before replacing the markers, you have:

$regsubex(<value of $1>,/(.)/g, <why is \ $+ t not cd : $1 $+ >) $+ @)

The subtext is not evaluated as we saw, remember? So that $1 in the subtext is not evaluated, then we have the replacements of markers:

0 marker used
$1 = $null

And since $1 is $null, well so is $1 in that inner $regsubex's subtext parameter.

/filter

/filter supports the -g switch to use a regular expression, you cannot get the backreference value using $regml() if you use a custom alias as the output (-k), you need to use a $regex call on that line.

$hfind

$hfind can be used with regex, it doesn't support the custom S modifier

/write, $read, $fline etc

They are various places in which regex can be used.

@@ Line 55: / Line 55: @@
 * \1 \2 \N ... - returns the Nth backreference for the current match
-mIRC internally use $N-, $1 etc to store the value of the backreferences and replace the markers accordingly:
+'''Notes''' on $regsubex:
-When mIRC is going to evaluate the subtext here
+The main steps when mIRC evaluates an identifier are:
+* Processes [ ] (evaluating any variables/identifiers inside them once) and [[ ]] (turning them into [ ])
+* Separates the identifier's parameters and evaluates each parameter (in left-to-right order).
+* Passes the parameters to the identifier
-$regsubex(name,test,/([es])/g,\1)
+It's a bit different in $regsubex, it has its own parsing routine. Indeed it needs not to evaluate the 'subtext' parameter before making the regex match, the steps are:
-\1 becomes "$+ $1 $+"
+* Processes [ ] and [[ ]]
+* Seperates parameters, evaluate the 'input' and the 'regex' parameters
+* Performs the regex match
+* <nowiki>*</nowiki> Tokenizes $1- according to the number of markers used in the 'subtext' parameters
+* Replaces any markers used in the subtext by their corresponding $N identifiers
+* Evaluates the 'subtext' parameter (one or more times, if /g is used)
+* Performs the substitutions and returns the result.
+<nowiki>*</nowiki> mIRC internally use $1- to store the values of the markers, this means you cannot use the previous tokenization of $1- in the subtext.
+The way mIRC does this is pretty ugly, it checks how many markers you have and create a list of token ($1-).
+Each token is assigned a value and mIRC then replaces the marker with the corresponding $N value.
+Let's say your subtext is "\t \t \1 \n", mIRC assignes the matched text to $1, to $2, the first backreferences in the pattern is assigned to $3 and the Nth iteration to $4
+If you use a form \N where N is a positive number greater or equal to 1 (like \1) and there is no such backreference number in the pattern, mIRC will fill that value (internally, using $1-) with the value of $regml(\n + N - 1):
+<source lang="mIRC">$regsubex(abcdefgij,/([a-z])/g,<\6>)</source>
+\6 doesn't mean anything, there is no 6 backreferences made.
+* When 'a' is matched \n is 1, only one marker used so $1 is filled with $regml(1 + 6 -1) = $regml(6) which is 'f'
+* When 'b' is matched, \n is 2, $1 is filled with $regml(2 + 6 - 1) = $regml(7) which is 'g'
+* And so on until \n + N - 1 is greater than the number of backref, at this point the characters are replaced with $null
 Because of this, you cannot use the previous $N- value in the subtext.
-Another thing to note is that, because the backreferences are in $1-, if you try to reference \N and there was no backreference captured for that number for that match, the next backreference is used.
+Nested $regsubex calls are possible but let's remember the main steps:
+* Processes [ ] and [[ ]]
+* Seperates parameters, evaluate the 'input' and the 'regex' parameters
+* Performs the regex match
+* Tokenizes $1- according to the number of markers used in the 'subtext' parameters
+* Replaces any markers used in the subtext by their corresponding $N identifiers
+* Evaluates the 'subtext' parameter (one or more times, if /g is used)
+* Performs the substitutions and returns the result.
+When mIRC replaces the markers, it will do so on the whole subtext parameter:
+$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper(\t)) : \t)
+The outer $regsubex will make the regex match, then it will replace \t everywhere in the subtext, the subtext of the outer $regsubex is:
+$regsubex(\t,/(.)/g,$upper(\t))
+Here all \t's gets the value of the matched text of the outer $regsubex, even the one inside $upper(), meaning that it won't work as expected. Indeed you want the \t inside the $upper to be the value of the matchted text of the inner $regsubex, not the outer one.
+What we want to do is to get mIRC to see something different than "\t" when looking at the markers inside $upper in the subtext of the outer $regsubex.
+If we were to use $regsubex(\t,/(.)/g,$upper( \ $+ t )) well you would just end up with calling $upper(\t) with plain text \t, because that $+ is going to be evaluated when $upper is evaluated. We want to interact after the outer $regsubex finished replacing markers but before $upper() is called.
+The solution is to use the [[ \ $+ t ]] construct:
+$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper( [[ \ $+ t ]] )))
+As we know $regsubex doesn't evaluate the subtext parameter but the processing of [] and [[ ]] is done for the whole line. So mIRC first change this line into:
+$regsubex(abcdefcdab,/(cd)/g,\t : $regsubex(\t,/(.)/g,$upper( [ \ $+ t ] )))
+Notice how only the [[ ]] changed, $+ was not evaluated because that subtext parameter is not evaluated, the [ ] processing happens before.
+Now the outer $regsubex gets its parameters (mIRC will fail to see \t there, it will see \ $+ t, which is what we wanted), makes the regex match and call the subtext:
+$regsubex(<value of \t in the outer $regsubex>,/(.)/g,$upper( [ \ $+ t ] ))
+And as usual, [ ] is processed first and \ $+ t gives \t before this inner $regsubex start to replaces its own markers. bingo.
+Note also that you cannot use a marker in the inner $regsubex subtext itself to get the value of that marker in the outer $regsubex context:
+$regsubex(abcdefcdab,/(cd)/g,@\t : $regsubex(\t,/(.)/g, <why is \ $+ t not cd : \t>) $+ @)
+This is because mIRC use the intermediate $1- value, when mirc replaces markers of the outer regsubex:
+* 4 markers used in the subtext of the outer $regsubex
+* $1 = $2 = $3 = $4 = \t = the matchtex text
+The code becomes:
+$regsubex($1,/(.)/g, <why is \ $+ t not cd : $1 $+ >) $+ @)
+mIRC adds the $+ if the markes has text surrounding it.
+Now that inner $regsubex is evaluated, at this point, $1- is still what the outer $regsubex's tokenization produced, so before replacing the markers, you have:
+$regsubex(<value of $1>,/(.)/g, <why is \ $+ t not cd : $1 $+ >) $+ @)
+The subtext is not evaluated as we saw, remember? So that $1 in the subtext is not evaluated, then we have the replacements of markers:
+* 0 marker used
+* $1 = $null
+And since $1 is $null, well so is $1 in that inner $regsubex's subtext parameter.
 == /filter ==

WikiChip

The Fuse Coverage

Social Media

Companies

Microarchitectures

Technology Nodes

Intel

AMD

ARM

Cavium

Samsung

Intel

AMD

Ampere

Apple

Cavium

HiSilicon

MediaTek

NXP

Qualcomm

Renesas