recreating nyt games' spelling bee

I’ve been spending enough time playing NYT’s Spelling Bee puzzle game to be compelled to reproduce it in Python; especially seeing how straightforward it seems to be.

All I need is a list of English words and the alphabet, no?

Without any clever workarounds the creation of a game just requires the following:

randomly select 7 letters from the alphabet, without replacement.
take the first letter as the center letter in the bee hive (hexagon grid).
get the set of English words that can be formed from our letters (most finicky part)
if that set includes a word containing all 7 letters (i.e. a pangram), then we have a valid spelling bee game!

random letters

Easy! Python has the random package:

import string
import random
alphabet = [letter for letter in string.ascii_uppercase]
print(random.sample(alphabet,k=7))

['C', 'P', 'J', 'S', 'F', 'T', 'M']

In reality, we probably want to weight this so that unpopular letters like ‘Z’ don’t appear so often.

One way to do this is to just copy English Scrabble’s weights and sample according to them. (By scrabble weights, we mean the number of tiles included.)

The only trouble is that random.sample doesn’t take a weights argument, neither does random.shuffle. The only one that does is random.choices but it performs a sample with replacement.

So, let’s turn our attention to numpy’s random.choice:

import numpy as np
import string
alphabet = [letter for letter in string.ascii_uppercase]
scrabble_tile_counts = [9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1]
letter_probs = [tc / sum(scrabble_tile_counts) for tc in scrabble_tile_counts]
print(np.random.choice(alphabet,replace=False,p=letter_probs,size=7))

['H' 'E' 'S' 'I' 'M' 'W' 'C']

And while those are better samples. An even better letter weighting could be calculated from the letter frequency of the dictionary of English words we are going to use for our spelling bee.

Let’s just use English words(American and British) | Kaggle for our words, and remove all the words with apostrophes with a Regex expression in vim: %s/.+'.\n//g. And we’ll just ignore the words with non-alphabetic characters for now.

from collections import Counter
import string
import numpy as np
alphabet = {letter for letter in string.ascii_uppercase}
with open('./english_words.txt','r') as f:
    letters = [char for char in f.read().upper() if char in alphabet]
    freqs = Counter(letters)
    letter_probs = [freqs[letter]/freqs.total() for letter in sorted(alphabet)]
print(np.random.choice(sorted(alphabet),replace=False,p=letter_probs,size=7))

['A' 'M' 'R' 'O' 'N' 'U' 'F']

solution words for a given set of letters

This is a bit more interesting, the trick here is going to be to create a dictionary that maps each word to its set of letters, and write a function that, given a word, check the following:

is the word’s set of letters a subset of the game’s available letters?
is the center letter an element of the word’s set of letters?
is it a pangram?

english_words_path = './english_words.txt'
def solve_spelling_bee(words_file:str, letters:set, center_letter:str):
    with open(words_file,'r') as f:
        return [word for word in f.read().split('\n') if len(word) > 3 and set(word.upper()) <= letters and center_letter in word.upper()]
print(solve_spelling_bee(english_words_path,{'O','R','N','P','C','E','S'},center_letter='N'))

We can go beyond this by removing all proper nouns from our word list, and all words that are shorter than 4 letters long. We could also go as far as to load the dictionary of word to sets of letters, instead of generating it on the fly. We also need to finish off by checking for a pangram.

We can write a quick pangram checker:

def has_pangram(words:list):
    for word in words:
        if len(set(word)) == 7:
            return True
    return False

bringing it all together

We can remove all the proper nouns with the regex: %s/^[A-Z].*\n//g. We can also remove all words with fewer than 4 letters with: %s/^\a{,3}\n//g.

We can then write a gen_spelling_bee function that will generate a set of 7 letters, a center letter, and the word solutions:

from collections import Counter
import string
import numpy as np

english_words = './english_words.txt'

def sample_seven(word_list_path:str):
    '''
    Sample seven letters, weighted by frequency, from a word list.
    '''
    alphabet = {letter for letter in string.ascii_uppercase}
    with open(word_list_path,'r') as f:
        letters = [char for char in f.read().upper() if char in alphabet]
    freqs = Counter(letters)
    letter_probs = [freqs[letter]/freqs.total() for letter in sorted(alphabet)]
    return np.random.choice(sorted(alphabet),replace=False,p=letter_probs,size=7)

def solve_spelling_bee(words_file:str,letters:set,center_letter:str):
    '''
    Returns the list of solution words based on a set of available letters, and
    a word list.
    '''
    with open(words_file,'r') as f:
        return [word for word in f.read().split('\n') if len(word) > 3 and set(word.upper()) <= letters and center_letter in word.upper()]

def has_pangram(words:list):
    '''
    Tests if a list of words contain a pangram.
    '''
    for word in words:
        if len(set(word)) == 7:
            return True
    return False

def gen_spelling_bee(word_list_path:str):
    '''
    Generates a valid spelling bee game from a word list.
    '''
    while True:
        seven_letters = sample_seven(word_list_path)
        center_letter = seven_letters[0]
        solutions = solve_spelling_bee(word_list_path,set(seven_letters),center_letter)
        if has_pangram(solutions):
            return seven_letters, center_letter, solutions

letters, centre, solutions = gen_spelling_bee(english_words)
print(f'Letters: {letters}')
print(f'Center letter: {centre}')
print(f'Solutions: {solutions}')

Having done all this, I feel like you can solve a Spelling Bee with a single regular expression; but doing so would require actually learning regular expressions…

The fun in all this is in realising that this solutions generalises to any word list, not just English words. Given a less hasty, and more considered implementation, you could create Spelling Bee variants for any language or sufficiently large set of words, like anatomical words, for example.

what about a quick scoring function?

Really straightforward this one:

def score_word(word:str):
    if len(word) == 4:
        return 1
    score = len(word)
    if len(set(word)) == 7:
        score += 7
    return score
print(score_word('mistrials'))

*edit

Another way to go about this is to precompute all the pangrams in your word list and randomly select one to generate your Spelling Bee game. (It also avoids the representative letter sampling problem entirely…):

import string
import random

english_words = './english_words.txt'
english_alphabet = {letter for letter in string.ascii_uppercase}

def get_pangrams(word_list_path: str, alphabet: set = english_alphabet) -> list:
    '''
    Gets the list of pangrams from a word list.
    '''
    with open(word_list_path, 'r') as f:
        pangrams = [word for word in f.read().splitlines() if len(set(word)) == 7]
    return pangrams

def solve_spelling_bee(word_list_path: str, letters: set, centre_letter: str) -> list:
    '''
    Solves a particular Spelling Bee based on the 7 letters and centre letter.
    '''
    with open(word_list_path, 'r') as f:
        return [word for word in f.read().splitlines()
                if len(word) > 3 and
                set(word.upper()) <= letters and
                centre_letter in word.upper()]

def gen_spelling_bee(word_list_path: str) -> tuple:
    '''
    Generates a Spelling Bee game based off a word list. This includes
    the 7 letters, the centre letter, and the solutions to the game.
    '''
    pangram = random.sample(get_pangrams(word_list_path),k=1)[0]
    letters = set(pangram.upper())
    centre_letter = random.sample(letters,k=1)[0]
    solutions = solve_spelling_bee(word_list_path, letters, centre_letter)
    return letters, centre_letter, solutions

letters, centre, solutions = gen_spelling_bee(english_words)
print(f'Letters: {letters}')
print(f'Center letter: {centre}')
print(f'Solutions: {solutions}')

Letters: {'R', 'I', 'U', 'M', 'C', 'O', 'F'}
Center letter: I
Solutions: ['cocci', 'coif', 'comic', 'croci', 'cruciform', 'curio', 'firm', 'foci', 'miff', 'mimic', 'mirror', 'riff', 'uric']

This is arguably a better implementation.