From WikiChip
Difference between revisions of "mirc/hash tables"
< mirc

Line 210: Line 210:
 
mIRC's hash tables are implemented as follows.
 
mIRC's hash tables are implemented as follows.
  
# When you create a hash table, it is created with a defined number of "slots".  
+
# When you create a hash table, it is created with a defined number of "slots" (or "buckets").  
# When you add an item to the table using {{mIRC|/hadd}}, a hash algorithm calculates a slot number based on the item name. Each slot holds a linked-list of items whose names map to that slot number, with each new item being inserted at the beginning of the list.
+
# When you add an item to the table using {{mIRC|/hadd}}, a hash algorithm calculates a slot number based on the item name. The formula for the hashing algorithm is assumed to be "$hash(item,32) % no-of-slots". Each slot holds a linked-list of items whose names map to that slot number, with each new item being inserted at the beginning of the list.
 
# When you look up an item by name using {{mIRC|$hget}}, then the same hash algorithm is used to locate the slot it will be stored under and then the linked-list in that slot is searched sequentially for the item. The purpose of hashing is to support this kind of lookup on potentially large tables with fast performance.
 
# When you look up an item by name using {{mIRC|$hget}}, then the same hash algorithm is used to locate the slot it will be stored under and then the linked-list in that slot is searched sequentially for the item. The purpose of hashing is to support this kind of lookup on potentially large tables with fast performance.
 
# When you get an item by position using {{mIRC|$hget|$hget(table,position)}}, or use $hfind to search data or search using wildcards or regular expressions, then the has algorithm cannot be used to identify the correct slot and instead mIRC has to iterate across part or all of the hash table to count or to find the record you want.  
 
# When you get an item by position using {{mIRC|$hget|$hget(table,position)}}, or use $hfind to search data or search using wildcards or regular expressions, then the has algorithm cannot be used to identify the correct slot and instead mIRC has to iterate across part or all of the hash table to count or to find the record you want.  
Line 220: Line 220:
 
* If you try to find a non-existent item, mIRC will need to iterate over the whole list before determining that the item doesn't exist.
 
* If you try to find a non-existent item, mIRC will need to iterate over the whole list before determining that the item doesn't exist.
  
As you might imagine, if mIRC has to iterate over a large number of hash table entries to find what you need, the it might spend quite a long time (in computer terms) doing this and mIRC may start to feel unresponsive.
+
As you might imagine, mIRC iterating over a large number of hash table entries to find the item needed is CPU intensive and mIRC might start to feel unresponsive.
  
So for look-ups by item name, the best performance will be achieved when mIRC's hashing algorithm points to a slot with a single table entry (or failing that a small number of entries). Worst case scenario is if your hash table has only a single slot, then all entries are stored in a single linked-list. On the other hand, if you have a large number of slots, then the likelihood is that every item will be stored in its own slot, so the hash function will take you to a slot with a single entry, and no iteration will be needed to find the item.
+
So for look-ups by item name, the best performance will be achieved when mIRC's hashing algorithm points to a slot with a single table entry (or failing that a small number of entries). Worst case scenario is if your hash table has only a single slot, then all entries are stored in a single linked-list and every look-up needs to be iterated. On the other hand, if you have a large number of slots (much greater than the number of items in the hash table), then the likelihood is that every item will be stored in its own slot, so the hash function will take you to a slot with a single entry, and no iteration will be needed to find the item.
  
 
All that said, even with a large number of slots, you cannot guarantee that every item in the table will have a unique hash / slot number. As an analogy, consider a class of 30 students. What is the probability that all students have birthdays on different days of the year? This is equivalent to asking whether a hash table with 365 slots and 30 entries will have every entry using a different slot. It turns out that in a class of 30 students there is significantly more than 50% probability that at least two students will share a birthday - indeed it only takes 23 students for the probability to be more than 50%. This seems weird - but for the mathematically inclined, the probability can be calculated by determining what the probability is that M students have all different birthdays:
 
All that said, even with a large number of slots, you cannot guarantee that every item in the table will have a unique hash / slot number. As an analogy, consider a class of 30 students. What is the probability that all students have birthdays on different days of the year? This is equivalent to asking whether a hash table with 365 slots and 30 entries will have every entry using a different slot. It turns out that in a class of 30 students there is significantly more than 50% probability that at least two students will share a birthday - indeed it only takes 23 students for the probability to be more than 50%. This seems weird - but for the mathematically inclined, the probability can be calculated by determining what the probability is that M students have all different birthdays:
Line 248: Line 248:
 
</math>
 
</math>
  
Using the student birthday example for simplicity and relating it to hash tables and slots, if we turn it on its head then we can say that if we have a hash table holding 23 entries, and we want to have a probability that each entry has its own slot > 50%, then we need to have more than 365 slots. '''I bet you weren't expecting that!!''' Fortunately the performance overhead of iterating over a relatively short linked-list is also small - but unfortunately the calculation of probability of slot lists being (say) a maximum of 2 or 3 entries are much more complicated - but equally fortunately, a slot only uses 4-bytes (which is very small indeed compared to the size of a table entry, which consists of the item name and the data and the overhead of storing these and linking them into a list). Indeed mIRC's maximum slot size is 10,000, requiring c. 40KB of memory - which in today's PCs with several GB of memory is relatively small.
+
Using the student birthday example for simplicity and relating it to hash tables and slots, if we turn it on its head then we can say that if we have a hash table holding 23 entries, and we want to have a probability that each entry has its own slot > 50%, then we need to have more than 365 slots. '''I bet you weren't expecting that!!''' Fortunately the performance overhead of iterating over a relatively short linked-list is also small, and equally fortunately a slot only uses 4-bytes (which is very small indeed compared to the size of a table entry, which consists of the item name and the data and the overhead of storing these and linking them into a list). Indeed mIRC's maximum slot size is 10,000, requiring c. 40KB of memory - which in today's PCs with several GB of memory is relatively small.
  
'''Summary:''' If you are doing lookup by item name on a frequent basis on a large table, then you should use the largest possible slot size to avoid mIRC iterating over long linked-lists when doing these lookups.
+
'''Summary:''' If you are doing any lookups by item name on a frequent basis on a large table, then you should use the largest sensible slot size to avoid mIRC iterating over long linked-lists when doing these lookups.
  
 
'''HOWEVER...'''
 
'''HOWEVER...'''
  
 
Not all hash table look-ups are able to use the hash to calculate the correct slot - only look-ups by item name. If you want to access hash table entries by position using {{mIRC|$hget|$hget(table,n)}}, or if you want to use {{mIRC|$hfind}}, then mIRC is going to have to iterate over a significant proportion (or all) of a hash table regardless of the number of slots that you define. Indeed, if you are never going to look-up by item-name, you might as well save memory and use a single slot.  
 
Not all hash table look-ups are able to use the hash to calculate the correct slot - only look-ups by item name. If you want to access hash table entries by position using {{mIRC|$hget|$hget(table,n)}}, or if you want to use {{mIRC|$hfind}}, then mIRC is going to have to iterate over a significant proportion (or all) of a hash table regardless of the number of slots that you define. Indeed, if you are never going to look-up by item-name, you might as well save memory and use a single slot.  
 +
 +
'''Summary:''' If you are doing only lookup by position or using $hfind, then you should use a slot size of 1 to save memory and avoid the small overhead of iterating over empty slots.
  
 
Finally, if you have a large hash table (perhaps several thousand records) that you want to search flexibly, then you might wish to consider whether something like mIRC SQLlite might suit your needs better.
 
Finally, if you have a large hash table (perhaps several thousand records) that you want to search flexibly, then you might wish to consider whether something like mIRC SQLlite might suit your needs better.

Revision as of 07:45, 14 October 2018

A hash table is an associative array with item-data pairing. That is, data stored in the table is associated with a specific item. Logically speaking, a basic table would like something like this:

Item Data
Item1 Data1
Item2 Data2
Item3 Data3

mIRC provides facilities for manipulating the table and the values in variety of ways.

General Details

Hash tables, unlike INI files, are stored completely in memory and are never written to disk (unless the /hsave command is used), making them much faster when it comes to storing and retrieving information. The performance gain is much more obvious with a large amount of item/data pairs.

Note: Because hash tables are only in memory, it must be saved to a file if mIRC has to exit for any reason. You can reload the table from a file at a later period.

Creating a table

A hash table must be created before you can work with it. This also applies to loading a hash table from a file. To create a table you need to use the /hmake command. The syntax is:

hmake <table_name>
hmake <table_name> <buckets>
;hmake also have an -s switch which prints debug info

If you don't specify the number of buckets, the default is used which is 100. Assuming that you are going to look up a specific item by name using $hget, then generally speaking the number of buckets should be decided based on the following equation:

Equation buckets equals StartFraction number of items that will be used Over 0.78 EndFraction

For example: a table with 100 buckets is optimal for 78 items. For 1000 items, 1300 buckets is best. Note: The maximum valid number of buckets is 10000.

Or to put this another way, the optimum number of slots is 1.282x the number of items you are going to store in the hash table.

Note: It is unclear where the value of 0.78 is derived from, but a likely reason is that this is the optimum number to try to ensure that each slot only has a single entry assigned to it. See below for a technical explanation of how mIRCs hash tables work and why you might want to have only a single item per slot.

Adding Items

The /hadd command is used to add an item/data pair to the table. The syntax is:

hadd <table_name> <item> <data>
hadd -b <table_name> <item> <&bvar>
An item can be added to the table with null data:
hadd <table_name> <item>

Let's consider a table of favorite colors:

/hadd -ms colors John Blue
/hadd -s colors Mary Green
/hadd -s colors Gary Orange

The code above will produce the following result:

* Made hash table 'colors' (100)
* Added item 'John' to hash table 'colors'
* Added item 'Mary' to hash table 'colors'
* Added item 'Gary' to hash table 'colors'
"Colors" Hash Table
Item Data
John Blue
Mary Green
Gary Orange

Deleting Items

To delete pairs from the table, you need to use the /hdel command. Its syntax is:

hdel <table_name> <item>
hdel -w <table_name> <wild_item>
;hdel has a -s switch which is the same as /hmake's

A wildcard pattern for the item can be specified to delete multiple items at once. If we go back to our example:

/hdel colors Mary

Will leave our table looking like this:

"Colors" Hash Table
Item Data
John Blue
Gary Orange

Value retrieval

To get a data value associated with a given item we will use the $hget identifier which has the following syntax:

$hget(<table_name>, <item>)

For example, if we were to check what is John's favorite color from our table; we will use the following piece of code:

//echo -a $hget(colors, John)
;Blue

The $hget identifier can also be used to check if a table exists using the following syntax:

$hget(<table_name>)
; returns $null if the table does not exist

Saving/Loading Hash Table to/from file

Because a hash table is stored exclusively in memory, it is important to save it to a file if one wishes to keep its content after a reboot or shut down. If a hash table is not stored in a file before mIRC closes, it will be gone for good.

mIRC offers the /hsave and /hload commands to handle the saving and loading of hash tables from your hard disk.

The syntax for the /hsave command is:

/hsave <table_name> <filename>
; The -s switch displays debug information
; The -a switch will append to an existing file, instead of the default overwriting
; The -i switch will create an ini file
; The -n switch saves the file containing only the data and not the item names.
; The -u switch avoids skipping temporary items created with /hadd's -uN switch.
; The -b switch will treat the file as a binary file, making it possible to save things like carriage returns and line feeds.

If we wanted to save our little colors table to an INI file, we could use the following piece of code.:

/hsave -i Colors colors.ini
;colors.ini will have:
;  [hashtable]
;  Gary=Orange
;  John=Blue

To load a hash table we use the following syntax:

; NOTE: The table must exists. I.e. you must have called /hmake first
/hload <table_name> <filename>
; The -s switch displays debug information
; The -i switch will read from an ini file containing lines of item=data
; The -n switch interprets the file as if it were /hsave'ed with the -n switch to contain only data, assigning item names as sequential integers beginning with 1.
; The -m[N] switch creates the hashtable if it does not already exist, optionally giving it N buckets different than the default 100 buckets.
; The -b switch will treat the file as a binary file, making it possible to save things like carriage returns and line feeds.

To load the table we've just saved we would use the following code:

/hload -i Colors colors.ini

Deleting a Table

To complete destroy a table and all its values, you can use the hfree command:

/hfree <table_name>
/hfree -w <*wild*table*>
;hfree has a -s switch which is the same as /hmake's

With the -w switch you can specify a wildcard pattern. All matching tables will be freed. If you already deleted a table and try to delete it again, "hfree tablename" halts your script. You must either use $hget(tablename) to verify the table's existence, or use -w without a wildcard. "hfree -w tablename_without_wildcards"

Iterating Over a Hash Table

The $hget identifier can be used to iterate over the hash table. The syntax is:

; Total Number of items in the table:
$hget(<table_name>, 0).item
; Get the Nth Item
$hget(<table_name>, <Nth>).item
; Get the value associated with the Nth Item
$hget(<table_name>, <Nth>).data

Note: Iterating over a hash table like this is an inefficient way to retrieve values and items. It's best to get a value using its item name.

An example of looping over every value in our Colors table will look like this:

Alias print_fav_colors {
  var %i = 1
  echo Colors Table:
  ; iterate over each item
  while ($hget(Colors, %i).item) {
    ; print the item/value pair
    echo -a %i $+ ) $v1 => $hget(Colors, $v1)
    inc %i
  }
}

The execution of the alias (/print_fav_colors) will produce the following result:

Colors Table:
1) Gary => Orange
2) John => Blue

Searching for a Item and value pair

The $hfind identifier can be used to search the table for a particular pair.

; The Nth Item name that matches the wildcard pattern
$hfind(<table_name>, <pattern>, <Nth>, w)
; The Nth Item that matches the RegExp pattern
$hfind(<table_name>, <pattern>, <Nth>, r)
; The Nth Item that wildcard matches the text
$hfind(<table_name>, <text>, <Nth>, W)
; The Nth Item that RegExp matches the text
$hfind(<table_name>, <text>, <Nth>, R)
; $hfind(...).data will search the data instead of the item name.

If you specify 0 for Nth Item, the total number of matches will be returned instead. An example from our Colors table would be:

//echo -a $hfind(Colors, *ary*, 1, w)

Which will return "Gary".

Technical Explanation

mIRC's hash tables are implemented as follows.

  1. When you create a hash table, it is created with a defined number of "slots" (or "buckets").
  2. When you add an item to the table using /hadd, a hash algorithm calculates a slot number based on the item name. The formula for the hashing algorithm is assumed to be "$hash(item,32) % no-of-slots". Each slot holds a linked-list of items whose names map to that slot number, with each new item being inserted at the beginning of the list.
  3. When you look up an item by name using $hget, then the same hash algorithm is used to locate the slot it will be stored under and then the linked-list in that slot is searched sequentially for the item. The purpose of hashing is to support this kind of lookup on potentially large tables with fast performance.
  4. When you get an item by position using $hget(table,position), or use $hfind to search data or search using wildcards or regular expressions, then the has algorithm cannot be used to identify the correct slot and instead mIRC has to iterate across part or all of the hash table to count or to find the record you want.

If your hash table has a small number of slots compared to the number of records, then each slot will have a large number of records:

  • For a lookup of an existing item, on average mIRC will have to iterate over 50% of the entries before locating the one you want
  • If you try to find a non-existent item, mIRC will need to iterate over the whole list before determining that the item doesn't exist.

As you might imagine, mIRC iterating over a large number of hash table entries to find the item needed is CPU intensive and mIRC might start to feel unresponsive.

So for look-ups by item name, the best performance will be achieved when mIRC's hashing algorithm points to a slot with a single table entry (or failing that a small number of entries). Worst case scenario is if your hash table has only a single slot, then all entries are stored in a single linked-list and every look-up needs to be iterated. On the other hand, if you have a large number of slots (much greater than the number of items in the hash table), then the likelihood is that every item will be stored in its own slot, so the hash function will take you to a slot with a single entry, and no iteration will be needed to find the item.

All that said, even with a large number of slots, you cannot guarantee that every item in the table will have a unique hash / slot number. As an analogy, consider a class of 30 students. What is the probability that all students have birthdays on different days of the year? This is equivalent to asking whether a hash table with 365 slots and 30 entries will have every entry using a different slot. It turns out that in a class of 30 students there is significantly more than 50% probability that at least two students will share a birthday - indeed it only takes 23 students for the probability to be more than 50%. This seems weird - but for the mathematically inclined, the probability can be calculated by determining what the probability is that M students have all different birthdays:

The first student can have any birthday. The second student can have 364 of 365 days and still be different. The third student can have 363 or 365 days and still be different.

So the probability that M students all have different birthdays is therefore:

Equation StartFraction 364 Over 365 EndFraction asterisk StartFraction 363 Over 365 EndFraction asterisk StartFraction 362 Over 365 EndFraction period period period StartFraction 365 hyphen upper M plus 1 Over 365 EndFraction equals StartFraction left-parenthesis 365 minus 1 right-parenthesis left-parenthesis 365 minus 2 right-parenthesis period period period left-parenthesis 365 minus upper M plus 1 right-parenthesis Over 365 Superscript left-parenthesis upper M minus 1 right-parenthesis Baseline EndFraction equals StartFraction 365 factorial Over left-parenthesis 365 minus upper M right-parenthesis factorial asterisk 365 Superscript upper M Baseline EndFraction

Returning to hash tables and slots, the equivalent formula for a table with M entries and N slots is:

Equation StartFraction left-parenthesis upper N minus 1 right-parenthesis left-parenthesis upper N minus 2 right-parenthesis period period period left-parenthesis upper N minus upper M plus 1 right-parenthesis Over upper N Superscript left-parenthesis upper M minus 1 right-parenthesis Baseline EndFraction equals StartFraction upper N factorial Over left-parenthesis upper N minus upper M right-parenthesis factorial asterisk upper N Superscript upper M Baseline EndFraction

Using the student birthday example for simplicity and relating it to hash tables and slots, if we turn it on its head then we can say that if we have a hash table holding 23 entries, and we want to have a probability that each entry has its own slot > 50%, then we need to have more than 365 slots. I bet you weren't expecting that!! Fortunately the performance overhead of iterating over a relatively short linked-list is also small, and equally fortunately a slot only uses 4-bytes (which is very small indeed compared to the size of a table entry, which consists of the item name and the data and the overhead of storing these and linking them into a list). Indeed mIRC's maximum slot size is 10,000, requiring c. 40KB of memory - which in today's PCs with several GB of memory is relatively small.

Summary: If you are doing any lookups by item name on a frequent basis on a large table, then you should use the largest sensible slot size to avoid mIRC iterating over long linked-lists when doing these lookups.

HOWEVER...

Not all hash table look-ups are able to use the hash to calculate the correct slot - only look-ups by item name. If you want to access hash table entries by position using $hget(table,n), or if you want to use $hfind, then mIRC is going to have to iterate over a significant proportion (or all) of a hash table regardless of the number of slots that you define. Indeed, if you are never going to look-up by item-name, you might as well save memory and use a single slot.

Summary: If you are doing only lookup by position or using $hfind, then you should use a slot size of 1 to save memory and avoid the small overhead of iterating over empty slots.

Finally, if you have a large hash table (perhaps several thousand records) that you want to search flexibly, then you might wish to consider whether something like mIRC SQLlite might suit your needs better.

See Also

More detailed information can be found at the pages for each hashtable-related command and identifier.