Tuesday, September 15, 2015

Hekaton Part 7 - Hash Collisions & Hash Buckets

As explained in the earlier post, a hash index places pointers to the rows in hash buckets, depending upon the value returned by hash function. Hash bucket count for a index is fixed at the time of creation and doesn't grow as data grows. So, what happens when the index column contains duplicates and they hash to the same hash bucket? Sometimes, two values, even though they are different, they may hash to the same hash bucket. Such a scenario is termed as hash collision.

Hash Indexes in "Hekaton in memory tables" handles the hash collision in the following way
1) When two values hash to the same hash bucket(h1), the pointer to the latest row (R2) inserted is stored in the hash bucket
2) The latest inserted row(R2) would contain a pointer to the row(R1) that was present previously  in hash bucket(h1).

Refer to the pic below

Hash Index is created on column c1. Row with value "Alice" for Column c1 resides at address 2000. Hash("Alice") maps to hash bucket 1 and hence hash bucket 1 contains the address 2000.

Refer to the pic below, to see what happens when another row with value "Alice" for c1 is inserted.

Lets say the new row inserted is at address 3000. Hash bucket "h1" now contains the address 3000. The new row's meta data ( index pointer)  at address 3000, contains a pointer to the previous row at Address 2000.
Please note 2 important points here.
  1. In "in memory" tables, rows contain index pointers unlike disk tables where indexes contain pointers / row locators to table.
  2. These index pointers actually link the table together. This is the reason for atleast one index rule in any in memory table
In the next post, we will see a few demos to take a closer look at hash collision and index chains

Tuesday, September 8, 2015

Hekaton Part 6:- Hash Indexes - Intro

With SQL Server 2014, new type of Index called the hash index was introduced. Little introduction to hash function would help understanding hash indexes better.

When a "key value" is passed to hash function, depending upon the result hash function provides, the key value will be placed in corresponding hash buckets.

For example, let us assume modulo ten ( % 10 ) is the hash function. if key value 1525 is passed to hash function, then it will be placed in hash bucket 5 as 1525 % 10 = 5. Similarly 537 would map to bucket 7 and 2982 would map to hash bucket 2 and so on..

So, similarly, in hash indexes, the indexed column value is passed to hash function and depending upon the result, a pointer to the actual row is stored in the table are stored in hash buckets. So, to find a row or to process a where clause, all that sql engine needs to 
do is

  1. Apply the hash function on the search parameter
  2. Go to the hash bucket where the row pointer is stored
  3. fetch the result.
Simplicity and no tree traversals makes it one of the fastest methods of access.

Sample hash function usage or syntax provided below.

( [ID] Int identity(1,1) Not null PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT = 100000),
[Data] char(32) COLLATE Latin1_General_100_BIN2 null,
[dt] datetime not null,

As Indexes can't be added after table creation in "in memory tables" in SQL Server 2014, one has to state the index definition inline. SQL Server 2016 supports adding indexes after table creation, but it is still a offline operation.

Bucket count specifies number of buckets to be used for the index. Bucket count is fixed and is always rounded off to next power of 2 ( 1024, 2048, 4096 etc..)

In subsequent posts, we get into more details on hash indexes and its pros and cons, hash bucket count and many more.


Friday, September 4, 2015

Hekaton Part 5: Introduction to Indexes in "In Memory Tables"

After a short little break on Hekaton series, we are now back with the 5th post.

Two new index types have been introduced. They are

1) Hash Index -  Based on Hash algorithm - useful for equality searches
2) Range Index - sometimes referred as Non clustered Index in Memory tables - Similar to traditional Non clustered indexes and useful for range query searches

Indexes on "In Memory" tables are never physically stored in the disk. They are always in purely memory structures. Indexes are re created and loaded into the memory every time database is restarted or when the database restored.

Besides providing the traditional performance improvements, Indexes in " in memory table" another major role. Indexes are actually the chain or the link which holds the entire table together. In other words, each row in the table is linked to the next row using indexes. So, Indexes are a must for in Memory tables. Just to add on

1) On a Schema and Data in memory table ( table which persists on disk ) , primary key is a must. Primary key can be enforced by hash or range index

2) On a schema table ( the table doesn't retain the data after restarts. Only structure is persistent ), there should be atleast one index

 Please note that existing concept of cluster index is not present in "In Memory tables". Also, as they are purely stored in Memory, impact of fragmentation, page splits are bare minimum or non existent in "In memory" tables.

Intro to indexing done. Next post will deal with hash indexes in detail.