Hash Tables & Hash Sets
Should you use a hash table?
- Does data need to be compared without knowledge of order? If so, a hash table is a good bet as it allows
O(1)
lookups.
Fundamentals
- The fundamental idea of a hash table is that keys and (optionally) associated values are stored in an array by which their location slot is determined by a hash function that produces a unqiue hash code. The hash code is an integer computed via the hash function.
- If two keys map to the same address then there is said to be a collision. Collisions are resolved by holding a linked list at each entry such that multiple items can be held at the same location. Iteration is used to find the correct one on lookup.
- The load is represented by $n/m$ where $n$ is the number of elements and $m$ is the size of the array in which the hash table is stored. If the load grows large, then re-hashing should be computed. That is, ideally a hash tableis sparse.
- Without collisions or re-hashing, insertions, deletions, and lookups have constant O(1) time complexity.
- Hard requirement: equal keys should have equal hashes
- Soft requirement: the hash function is efficient and spreads keys uniformly across the underlying array
- It is important to understand the relationship of logical equality and its relation to their hash code. These must match or else you will get false lookups
# Function that computes a good hash for string-type keys
def string_hash(string, modulus):
MULT = 993
return functools.reduce(lambda acc, c: (acc*MULT + ord(c)) % modulus, s, 0)
# Function that computes a good hash for string-type keys
def string_hash(string, modulus):
MULT = 993
return functools.reduce(lambda acc, c: (acc*MULT + ord(c)) % modulus, s, 0)
A simple hashable class implements a __hash__
and __eq__
method as defined below (note below the __hash__
and __eq__
methods are inefficient, ideally their outputs should be cached).
class Hashable:
def __init__(self, names: list[str]):
self.names = names
def __hash__(self):
return hash(frozenset(self.names))
def __eq__(self, other: Hashable):
return set(self.names) == set(other.names)
class Hashable:
def __init__(self, names: list[str]):
self.names = names
def __hash__(self):
return hash(frozenset(self.names))
def __eq__(self, other: Hashable):
return set(self.names) == set(other.names)
Working with hash tables
- Avoid using mutable objects as keys; good options are typically strings, ints.
- You cannot alter immutable objects, so don't use them as hash table values if you intend on updating them, e.g., tuples.
- A hash table is an excellent choice when making comparisons where order isn't, or need not be preserved
- A hash table is an excellent choice anytime you need to store a set of strings
- A set is simply a hash table that only stores keys. Use hash tables as lookup tables instead of large if-else code for mapping characters to values or similar.
Python
- The main hash table data types in Python are
set
: stores only keys, no valuesdict
: throws key error exception when access non-existent keyscollections.defaultdict(type)
: returns the dict with an empty defaulttype
when accessing non-existent keyscollections.Counter
: Counts the number of occurrences of a key with support for+, -, &, |
operators, otherwise operates like adict
- For sorted set/hashmaps use the
sortedcontainers
module:SortedDict
:pop(key)
,update(pairs)
,setdefault(key, default)
,get(key)
,keys()
,items()
,values()
SortedList
:add(val)
,update(iterable)
,discard(value)
C++
Use map
for an ordered hash table. Use unordered_map
for an unordered hash table.
Example: Increment a counter.
std::map<int, int> map{};
for (auto item : nums)
map[item] += 1;
std::map<int, int> map{};
for (auto item : nums)
map[item] += 1;
Rust
Use HashMap
for an unordered hash table. Use BTreeMap
for an ordered hash table (slower operations since it uses a binary tree to manage sortedness). Use HashSet<T>
for an unordered set Use BTreeSet<T>
for an ordered set
Recall that, fundamentally, a set is a hash table with only a key, however, it's sometimes useful to think of it as an array of unique entries.
Essential Methods
Increment a counter.
let mut map = HashMap::new();
let num = 1;
match map.get(num) {
None => map.insert(num, 1),
Some(curr) => map.insert(num, curr + 1)
}
// Option 2: `.entry()` API
map.entry(num).and_modify(|x| *x += 1).or_insert(0);
let mut map = HashMap::new();
let num = 1;
match map.get(num) {
None => map.insert(num, 1),
Some(curr) => map.insert(num, curr + 1)
}
// Option 2: `.entry()` API
map.entry(num).and_modify(|x| *x += 1).or_insert(0);
Problems
Relevant problems
- Cows & Bulls: Given 4 ints and a guess (4 more ints), return the number of matches in correct location, in correct location, and wrong guesses. (REVIEW THIS PROBLEM)
- Stock Price Fluctutation: Implement a class that updates, and reports the max/min of stock data streamed out of order.
- Use a hashmap to store data with gaps (instead of in-filling an array)
- Use a sorted map to keep track of max and min values
- Longest Consecutive Sequence: Find the longest consequtive sequence of values in an unsorted array
- Use a hash set to turn the initial array into a constant time lookup table
- Loop and check if the next value is in the LUT
- Longest Repeating Character Replacement: Find the longest sequence of the same characters given k replacements.
- Use a hash map to encode the frequency within a sliding window
- Move the front and back of the sliding window to accomodate valid substrings
- Key operation:
if window_size - frequency_max > k:
12.1 test for palindromic permutations
My solution was correct, though a bit verbose compared to the author, but that's okay. Good work.
Lessons Learned
- Turn an array or string into a frequency-based hash table by passing it to the constructor of
Counter
, .e.g.,map = Counter(s)
for this question, instead of creating an empty hash table and filling it. - This is a solution where you have to find the underlying rules of the problem to find the solution.
12.2 Is an anonumous letter constructible?
Good solution, I managed to get this one, test it, and after evaluating agains the authors my solution was almost identical. The only difference was that for each hash table entry I held 2 counters one for "count" and one for "found". When found == count I deleted the entry (as did the authors!). Instead the authors just subtracted the count that was established on the first pass. This is the better solution, cleaner implementation and less overall space, required, but still happy with my solution.
Lessons Learned:
- Instead of characters, which the question asked, I used words in the letter/magazine. This is just an attention to detail issue. Make sure you focus on solving and clarifying the specific problem when you're doing them in an interview.
- Tuples are immutable! Using a named tuple as the entry of the dictionary didn't work because tuples are immutable (duh!), so that wasted some time. Eventually I ended up using a nested dict
{'count': 0, 'found': 0}
12.3 Implement an ISBN cache
Donezo! implemented learn recently used eviction with a queue. Time complexity isn't great though because I have to remove items from the queue: conceptually I could actually track their placement in the queue depending on how many had been added since the last update with an extra field in the hash table entty, e.g., a counter. Reading solution... Via the Authors, this was the perfect solution! The only thing they did better was use a better (linked list) implementation of the Dict collections.OrderedDict
.
Lessons Learned
- Cache eviction: when a cache meets full capacity, new elements placed into the cache must replace other elements. Ideally the elements that are removed are said to be evicted. Ideally Eviction occurs on elements that have low relevance, but this isn't always easy to achieve.
- OrderedDicts (presumably same with C++ std::map type?) retain the order in which items are added, gain some knowledge about how these work exactly and the methods on them.
12.5 Nearest repeated entries
Redo this one... solved the wrong problem.
LC 1. Two Sum
My C++ solution
vector<int> twoSum(vector<int>& nums, int target) {
std::unordered_map<int, int> hashmap;
int difference = 0;
for (int i=0; i<nums.size(); i++) {
int difference = target - nums[i];
if (hashmap.contains(difference))
return vector{ i, hashmap[difference] };
hashmap.insert(std::pair(nums[i], i));
}
throw;
}
vector<int> twoSum(vector<int>& nums, int target) {
std::unordered_map<int, int> hashmap;
int difference = 0;
for (int i=0; i<nums.size(); i++) {
int difference = target - nums[i];
if (hashmap.contains(difference))
return vector{ i, hashmap[difference] };
hashmap.insert(std::pair(nums[i], i));
}
throw;
}
My Rust solution
pub fn two_sum(nums: Vec<i32>, target: i32) -> Vec<i32> {
let mut map = std::collections::HashMap::<i32, usize>::new();
let mut difference = 0;
for i in 0..nums.len() {
difference = target - nums[i];
if map.contains_key(&difference) {
return vec![i as i32, map[&difference] as i32];
}
map.insert(nums[i], i);
}
panic!("We should never see this!");
}
pub fn two_sum(nums: Vec<i32>, target: i32) -> Vec<i32> {
let mut map = std::collections::HashMap::<i32, usize>::new();
let mut difference = 0;
for i in 0..nums.len() {
difference = target - nums[i];
if map.contains_key(&difference) {
return vec![i as i32, map[&difference] as i32];
}
map.insert(nums[i], i);
}
panic!("We should never see this!");
}
LC 383 Ransom Note
Solution: This is a simple hash table problem. Build up a hash table from the entries in magazine, then as you loop over the characters in the ransom note, decrement the entries in the hash table. If at any time you reach a 0 entry, or the entry cannot be found, then return False
. An optimization can be made where you use an array instead of a hash table to store the elements of the magazine. For this, you'll have to subtract by ord('a')
to normalize the character to a value that can be indexed.