three ways to check for anagrams
problem
Given 2 strings, check that they are anagrams.
solutions
worst
Perhaps the most intuitive approach: look at each character in the first
string and check it’s in the second string by iterating through its
characters. If a character in the first string isn’t present in the first,
return False
. Otherwise, remove the matched character from the second string so
it isn’t matched again going forward.
def are_anagrams(s1: str, s2: str) -> bool:
# no point proceeding if the strings aren't even of the same length
if len(s1) != len(s2):
return False
for ch1 in s1:
in_s2 = False
for i, ch2 in enumerate(s2):
if ch1 == ch2:
in_s2 = True
break
if in_s2:
# remove matched char from s2
s2 = s2[:i] + s2[i+1:]
else:
return False
return True
x = 'fried'
y = 'fired'
z = 'fixed'
print(are_anagrams(x,y))
print(are_anagrams(x,z))
True
False
While straightforward, it’s the worst approach because its \(O(n^{2})\).
Consider checking whether 2 strings of \(n\) characters are anagrams of one another, where the second string is just the reverse of the first. For each character in the first string we would have to iterate until the end of the second to find its match.
In practice, this means that matching the first character in the first string, would require iterating though all \(n\) characters of the second string. Matching the first string’s second character would require iterating through the second strings remaining \(n-1\) characters. Matching the third string would require iterating through the remaining \(n-2\), and so on…
In the end, we look at \(n + (n-2) + (n-3) + \cdots\) chars, which is just the sum of the first \(n\) natural numbers, i.e. \(\frac{n(n+1)}{2}\). Hence, \(O(n^{2})\).
better
Probably the simplest approach. We can just sort the characters in both strings and check that the results are equal.
def are_anagrams(s1: str, s2: str) -> bool:
return sorted(s1) == sorted(s2)
x = 'fried'
y = 'fired'
z = 'fixed'
print(are_anagrams(x,y))
print(are_anagrams(x,z))
True
False
This is better because it ends up being as fast as your sorting algorithm. So let’s say we sort in \(O(n\log n)\), then this approach would be \(O(n\log n)\), and \(n^{2} \gg n \log n\).
best?
An even better approach would be to go through each character in the first
string and keep track of the number of occurrences of each character in a hash
map/dictionary. We can then check if the second string would produce the same
hash map/dictionary by iterating through the characters in the second string and
checking if each one in the first string’s character hash map/dictionary. If it is,
then we decrement the count of that character to make note of the fact that
we’ve matched one of its occurrences. And we remove the key from the hash
map/dictionary all together if the character/key only has a count of 1. But if a
character is not a key in the hash map/dictionary we simply return False
.
from collections import defaultdict
def are_anagrams(s1: str, s2: str) -> bool:
# no point proceeding if strings aren't even of same length
if len(s1) != len(s2):
return False
counter = defaultdict(int)
# create the dictionary of characters in s1
for char in s1:
counter[char] += 1
for char in s2:
if char in counter:
if counter[char] == 1:
del counter[char]
else:
counter[char] -= 1
else:
return False
return True
x = 'gainly'
y = 'laying'
z = 'lyingi'
print(are_anagrams(x,y))
print(are_anagrams(x,z))
True
False
This beats out the rest because hashing is \(O(1)\), so it will only take \(n\) operations to create the first string’s character counting hash map/dictionary. And it will only take iterating through the second string’s \(n\) characters performing \(O(1)\) key checks to establish that the strings are anagrams of one another.
Overall, \(n+n=2n\) steps are required, which is just \(O(n)\).
bonus!
Another (more interesting) \(O(n)\) approach.
What if we created a kind of direct hash of a string, like some number that uniquely represented each possible string, but that was indifferent to the order of its characters?
If we know the strings are only composed of lower case letters for example. We could just assign a prime number to each letter, say each letter ‘a’ through to ‘z’ is assigned the $n$th prime. So ‘a’ gets 2, ‘b’ gets 3, ‘c’ gets 5, and so on…
We could then “hash” a string by just taking the product of each letter’s assigned prime. The resultant number would have a unique prime factorisation that would only be achievable by another string with the same letters with the same number of respective occurrences.
letter_to_prime = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17,'h': 19,
'i': 23, 'j': 29, 'k': 31, 'l': 37, 'm': 41, 'n': 43, 'o': 47, 'p': 53,
'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89,
'y': 97, 'z': 101}
def hash_str(s: str) -> int:
hash = 1
for char in s:
hash *= letter_to_prime[char]
return hash
def are_anagrams(s1: str, s2: str) -> bool:
return hash_str(s1) == hash_str(s2)
x = 'gainly'
y = 'laying'
z = 'lyingi'
print(are_anagrams(x,y))
print(are_anagrams(x,z))
True
False
As before, we end up iterating through each string’s characters so its also \(O(n)\).
The only trouble with this approach is that hashing certain strings can get big really fast. A string of only ‘z’s will grow exponentially \(O(101^{n})\), for example.
But if we assume that we are only looking to check if 2 English words are anagrams of each other, we can assign our primes to each letter of the alphabet based on the frequency distribution of letters in the language. This way, the smallest primes can be assigned to the most frequently occurring letters in English words1.
from collections import Counter
import string
alphabet = {letter for letter in string.ascii_lowercase}
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101]
with open('./english_words.txt','r') as f:
letters = [char for char in f.read().lower() if char in alphabet]
freqs = Counter(letters)
sorted_alphabet = sorted(freqs,key=lambda x: freqs[x],reverse=True)
freq_hash = {letter:prime for letter, prime in zip(sorted_alphabet, primes)}
print(freq_hash)
{'e': 2, 's': 3, 'i': 5, 'a': 7, 'r': 11, 'n': 13, 't': 17, 'o': 19, 'l': 23, 'c': 29, 'd': 31, 'u': 37, 'g': 41, 'p': 43, 'm': 47, 'h': 53, 'b': 59, 'y': 61, 'f': 67, 'v': 71, 'k': 73, 'w': 79, 'z': 83, 'x': 89, 'q': 97, 'j': 101}
Is this approach any better though?
old_hash = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17,'h': 19,
'i': 23, 'j': 29, 'k': 31, 'l': 37, 'm': 41, 'n': 43, 'o': 47, 'p': 53,
'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89,
'y': 97, 'z': 101}
freq_hash = {'e': 2, 's': 3, 'i': 5, 'a': 7, 'r': 11, 'n': 13, 't': 17, 'o': 19,
'l': 23, 'c': 29, 'd': 31, 'u': 37, 'g': 41, 'p': 43, 'm': 47, 'h': 53,
'b': 59, 'y': 61, 'f': 67, 'v': 71, 'k': 73, 'w': 79, 'z': 83, 'x': 89,
'q': 97, 'j': 101}
def hash_str(s: str, hash_dict: dict) -> int:
hash = 1
for char in s:
hash *= hash_dict[char]
return hash
print(hash_str('fried', old_hash))
print(hash_str('fried', freq_hash))
1404403
228470
On the surface it seems to be.
But there’s a better way to test this.
What’s the average hash across the entire dataset of English words of each hashing approach:
import string
old_hash = {'a': 2, 'b': 3, 'c': 5, 'd': 7, 'e': 11, 'f': 13, 'g': 17,'h': 19,
'i': 23, 'j': 29, 'k': 31, 'l': 37, 'm': 41, 'n': 43, 'o': 47, 'p': 53,
'q': 59, 'r': 61, 's': 67, 't': 71, 'u': 73, 'v': 79, 'w': 83, 'x': 89,
'y': 97, 'z': 101}
freq_hash = {'e': 2, 's': 3, 'i': 5, 'a': 7, 'r': 11, 'n': 13, 't': 17, 'o': 19,
'l': 23, 'c': 29, 'd': 31, 'u': 37, 'g': 41, 'p': 43, 'm': 47, 'h': 53,
'b': 59, 'y': 61, 'f': 67, 'v': 71, 'k': 73, 'w': 79, 'z': 83, 'x': 89,
'q': 97, 'j': 101}
def hash_str(s: str, hash_map: dict) -> int:
hash = 1
for char in s:
hash *= hash_map[char]
return hash
def only_alphabet_chars(s):
for char in s:
if char not in string.ascii_lowercase:
return False
return True
with open('./english_words.txt','r') as f:
words = [word for word in f.read().splitlines() if only_alphabet_chars(word)]
old_avg = sum(hash_str(word,old_hash) for word in words)/len(words)
new_avg = sum(hash_str(word,freq_hash) for word in words)/len(words)
print(old_avg)
print(new_avg)
print(old_avg/new_avg)
2.53296602423254e+28
1.1692706272592385e+20
216627867.42277047
That’s quite an improvement!
The only question remains whether this is the optimum assignment of primes to characters to minimise the average hashing value across this dataset?
I assume there’s a way of answering this.
Maybe some kind of linear programming problem?
Definitely an optimisation one…
-
We are using English words(American and British) | Kaggle here. ↩︎