CS50 Week 4 – Learn | Think

Speller problem set explained

I’m going to do my best to explain the solution in small, manageable steps so you can follow along and not go through the pain and suffering I did!

Maybe running a gauntlet is harsh, but some days were brutal, and at one point I even deleted tons of code after being completely fed up.

If this happens just take a break. Work 30min to 45 min at a time and go for walks or just stand up for 5 to 10 min in between. Make small pieces of programs that prove a belief about what you think the code is doing.

The last C problem set is called speller and is simple when you explain it at a super high level.

Main Goals:

load() the dictionary word by word into a data structure
Keep track of each word added and return the total number of words in our dictionary by implementing a size() function
We are given a single word from the passage in a loop we do not have to implement ourselves. Looping through each word in the passage and pulling it has been done for us already! We only have to take a single word and compare it against our data structure within the check() function
Finally we have to loop through each index of our data structure (it’s just an array folks) and free all the dynamically allocated memory (this is not as hard as it sounds). This happens within our unload() function

It processes a piece of text, looks at each word against a dictionary of words and returns any words not found in the dictionary as “misspelled.”

Yeah…simple right?

Implementations

So before I show my implementations, let’s go through each function. Remember a lot of this program is already done for us! Also the program puts each function in alphabetical order instead of the order of when they are called within the program. I will explain each in the order they are called from speller.c.

We need to know what our functions are expected to do, and the only way to know is to check what is being passed to us by the main program!

Use GDB! Printf is great for little things but when you’re really lost in a bigger program gdb is printf on steroids! (click on the text to go back to the article where I walk you through GDB)

bool load(const char* dictionary) part 1

The argument passed is a char* (a “string”). The const part means the value can not be changed. If we tried to change the string we are passed we will get an error. We should return the value true o

Let’s use GDB to see what is going on in the main program speller. We’ll create a break point right on the line load() is called, and step into the function so see what is being passed. In my speller.c file the call to load() happens on line 45.

first we cd into the pset5 directory, then run gdb on speller

Remember we start gdb by typing gdb and the filename without the .c part. Now GDB will startup and give us a prompt “(gdb).”

So this is all just preparation before we run the program. First we have to tell gdb where to pause in our program. I’ve typed in b 45, which means when I hit run go through the program and pause, or in other words break at line 45.

Now we just have to type the letter r to run the program:

Usage: speller [dictionary] text gives us the hint we need. When we typed in r gdb ran the program as if we typed ./speller but the program has 2 arguments we need to pass in the terminal.

The first [dictionary] is optional, which is why it’s surrounded by square brackets. The second argument text is mandatory, which is why we couldn’t run it.

Check out the folders that are within pset5 using the command ls once you cd into pset5 from the ~/workspace/ directory.

You will see texts/ and dictionaries/

[dictionary] is optional because our main program will automatically choose the large file within that folder.

text has to be choosen. I did some exploring in the text folder and found a small piece of text called quote.txt:

It is known that there are an infinite number of worlds, simply because there is an infinite amount of space for them to be in. However, not every one of them is inhabited. Therefore, there must be a finite number of inhabited worlds. Any finite number divided by infinity is as near to nothing as makes no odds, so the average population of all planets in the universe can be said to be zero. From this it follows that the population of the whole universe is zero, and that all people that you may meet from time to time are merely the products of a deranged imagination.

-Douglas Adams – The Restaurant at the End of the Universe

So what we really need to type is r texts/quote.txt. Let’s try again:

So we’re in speller.c on line 45 (speller.c:45) and the program is loading in something called dictionary. So the question is what is dictionary? Before we step into the function let’s just check using the p command which is short for print.

So instead of typing printf and trying to guess the value we can just use the command p and any variable of any type a

So back to our output p dictionary:

So it passes the string “dictionaries/large.”

From the lecture and shorts videos, we know that loading the dictionary means creating a data structure. This is a super fancy word. Don’t be fooled into thinking a data structure is some super complex thing only a genius can create. It’s just an array, that is all…

An array of structs…

connected to other structs known as a linked list…

which we insert using a hash function…

oh… and the array is a global variable so we can access it within other functions…

OKAY MAYBE IT’S A LITTLE COMPLICATED.

Let’s get out of GDB for a moment, open dictionary.c and start coding within the load() function.

(to quit out of GDB just type q for quit. It will ask if you are sure. Just type y for yes.)

First we need to use the filename we’re passed (const char* dictionary) and use it to open the file.

Now paste this code into the load() function and run the make command:

    // create a FILE pointer and fopen() to start the stream
    FILE * dictionary_stream = fopen(dictionary, "r");

    // print out the first character
    printf("first character of the file is: %c\n", fgetc(dictionary_stream));

    // use GDB to inspect the test_string variable after it is used each time.
    char test_string[LENGTH+1];

    // fgets will give us everything on the current line 
    // and put into test_string
    fgets(test_string, (LENGTH+1), dictionary_stream);

    // fscanf gives us whatever we say in the format argument (like %s)
    // and put into test_string
    fscanf(dictionary_stream, "%s", test_string);

    printf("end of example\n");

run GDB again using the same commands we did before. Here is a quick list:

gdb speller
b 45
r texts/quote.txt

At this point when we run quote.txt we get:

45          bool loaded = load(dictionary);

load(dictionary) is about to run.

It has not yet run, but if I type n and go to line 46 (the next line), load() will run.

Everything that load() needs to do will be over and passed to the variable bool loaded.

I don’t want that.

What I want is to step

For this we have to type s for step-into.

now we’re inside the function and get this as our output in GDB:

FILE * dictionary_stream = fopen(dictionary, "r");

We can now type the n (stands for “next”) command to see what our variables are doing in the next few lines.

type n enter

then type n enter one more time.

Your output should be something like this:

first character of the file is: a
158         fgets(test_string, (LENGTH+1), dictionary_stream);

Ignore the fgets part for now…it won’t run that line until we hit the n key.

Here we can see our first output, which is the printf we added to our function. It reads “first character of the file is: a.”

So our dictionary_stream was at the beginning of the file, (think of a play head on a youtube video) we used fgetc() which collected the first character, which then in turn moved the FILE pointer dictionary_stream to the next character.

You don’t have to open dictionaries/large, I will tell you the first few lines are as follows:

a
aaa
aaas
aachen
aalborg

So fget(dictionary_stream) grabbed the first a a

But wait a minute…

Did it really move the pointer to the next newline?

Let’s move ahead and see…

hit the n key again and press enter.

fgets(test_string, (LENGTH+1), dictionary_stream);
(gdb) n
161         fscanf(dictionary_stream, "%s", test_string);

So the first line above is the same one I told you to ignore a moment ago. Once we hit n and the enter key it brought us to fscanf. Which has not yet run so we can’t look at it’s value yet.

So let’s see what test_string is holding after running the fgets function. If you did not know fgets() takes everything from a single line and puts it in the char* variable we specify, in this case test_string.

run p test_string

$1 = "\n$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0@", '$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0' , "067777$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0047777$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
00N5777"

Your garbage values may vary but the important part is near the beginning of the output. Disregard the $1 = “ at the start.

Why did our function grab \n first?

Look at the text at the beginning of our dictionaries/large again and see if you get why:

a\n <- this \n is on every line but is invisible in the text editor
aaa
aaas
aachen
aalborg

It’s because our first function fgetc() only grabbed the letter, not the invisible character \n that represents a line-break.

If we never ran the first function fgetc() and only ran fgets() we would have gotten this output instead:

$1 = "a\n$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0@", '$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0' , "067777$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0047777$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "a\n\000@", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
00N5777"

So now we know running fgets() will grab everything from the current file pointer position up to and including the \n (line-break special character).

Continue on by typing n a

161         fscanf(dictionary_stream, "%s", test_string);
(gdb) n
164         fscanf(dictionary_stream, "%s", test_string);

So again the first line we already saw, but since we’ve move one more line ahead we can now see what test_string is holding after running fscanf.

type p test_string. The output should be something like this:

$1 = "aaa", '$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0' , "067777$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0047777$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0$1 = "aaa", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
00N5777"

Again the garbage values are not important and will most likely be different for you. But the first part is the most interesting:

"aaa", '"aaa", '\000'
0'

Since the previous function grabbed the line-break. The pointer moved to the start of the next line. If you scroll up and check the second line of dictionaries/large was in fact the string “aaa.”

Let’s run the next fscanf()

type n

type p test_string

now the output should be something like this:

166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0' , "067777166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0047777166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
0166         printf("end of example\n");
(gdb) p test_string 
$2 = "aaas", '\000' , "\220\336\377\377\377\177\000\000\360\334\377\377\377\177\000\000\000\000\000\000\000\000\000\000\200N\335\367\377\177"
00N5777"

BINGO. Ignoring all the garbage values and previous printf statement we can clearly see “aaas” followed by the null terminator and some more zeros.

!IMPORTANT

See all those weird garbage values? That’s because when we declared our variable test_string[LENGTH+1], We didn’t initialize it to anything. Here is a trick I learned after I couldn’t get my code to work and went on a google binge for a few hours:

Go to the line char test_string[LENGTH+1];

and change it to char test_string[LENGTH+1] = {‘\0’};

Quit out of gdb, run make and then go back into GDB and step into the function and p the values by moving through our load()

Now you’ll get output similar to this:

$1 = "aaa", '$1 = "aaa", '\000' <repeats 42 times>
0' <repeats 42 times>

When you initialize the first index, it automatically makes all other indicies a null terminator.

This might not matter now, but when comparing strings we will have to worry about every index of our array (in my implementation anyways).

bool load(const char* dictionary) part 2

Now erase all the test code we’ve written in our load() function.

We know fscanf is the way to go for reading in each word, so let’s talk about it’s technical definition and how we can use it in a loop to get all our words and start creating our data structure.

From tutorials point:

Description

The C library function int fscanf(FILE *stream, const char *format, …) reads formatted input from a stream.

Return Value

This function returns the number of input items successfully matched and assigned, which can be fewer than provided for, or even zero in the event of an early matching failure.

So we’re given the filename from the main function. We still need that dictionary_stream pointer to the file so let’s put that back into the function:

// create a FILE pointer and fopen() to start the stream
    FILE * dictionary_stream = fopen(dictionary, "r");

We know there is a #define in the main function where LENGTH is a constant for the longest word possible. Let’s also create a place to hold the current word in the loop:

char temp[LENGTH+1] = {'char temp[LENGTH+1] = {'\0'};
'};

Remember we need that +1 for the ‘\0’ null terminator if we ever come across a word that is the max length.

So now fscanf() will return the number of things we tell it to pull on success. So what are we asking it for? The second argument we pass it is the format specifier.

returns an int: fscanf(the file pointer, format you want, a place to store the thing you want)

So if we say hey fscanf here is the dictionary_stream I want you to pull a %s which is a string, and store it in temp,

Because the file pointer dictionary_stream moves automatically everytime it’s called we just have to keep running fscanf() until it no longer returns the number we are expecting.

So each time we call fscanf it pulls in a single word. We know this because we specify “%s” which is a one string. So by that logic we can say while fscanf returns the number 1 keep calling the function.

Let’s create a small test to see if this is working. Here is the entire function as it stands to avoid confusion:

If we run ./speller texts/quote.txt while in the pset5 directory we get this output:

143091
Could not load dictionaries/large.

Since we’re returning false the main program returns early with the response “Could not load dictionaries/large. So hopefully you understand what the number 143091 represents.

We’ve successfully pulled in each word in our loop! We aren’t currently doing anything with that variable temp which holds the current word in that loop, but we now have the total number of words available within the counter variable.

A Quick Detour

Before we continue loading each temp variable in the while loop into a data structure, let’s score any easy win by implementing another function using only a few lines of code. We have that counter v

Within dictionary.c we have another function just below the load() function called size().

Above the function the comment reads: “Returns number of words in dictionary if loaded else 0 if not yet loaded.”

Now if you didn’t already know, variables declared within a function only exist in that function.

Go to https://repl.it and spin up a quick C programming environment to run this code for a second:

#include <stdio.h>

char *global_animal = "bears";

char *lions(void){
  printf("You can not escape the %s!\n", global_animal);
  char *animal = "lions";
  return animal;
}

char *tigers(void){
  // the variable animal is created
  // within lions but this function
  // has no idea.

  // try to run the code with the
  // line below uncommented and you
  // will get an error
  //printf("NOPE CITY %s\n", animal);


  printf("%s are everywhere!\n", global_animal);
  char *animal = "tigers";
  return animal;
}

int main(void){
  printf("%s, %s, %s, oh my!\n", lions(), tigers(), global_animal);
}

So given this information we know we have a counter variable in the load() function where it is building up the final count of words. But we need to pass it along to the size() function.

Let’s declare the variable outside of the load() function at the top near our #include directives.

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>

// global variable
int dictionary_word_counter = 0;

I’ve changed the name from counter to dictionary_word_counter. Go back to the function load() and remember to change the name there too!

Now our load() f

All we have to do for size() n

/**
 * Returns number of words in dictionary if loaded else 0 if not yet loaded.
 */
unsigned int size(void)
{
    return dictionary_word_counter;
}

Some cats…

bool load(const char* dictionary) part 3: The Hashtable explained

Getting values from an array is easy when you know the index. If I have an array called numbers[42, 18, 9001] and I want 42, it’s easy if I know to look in array location [0] right? But if I don’t know where my number is I need to run a loop and check each array location until I get a matching number and then return that index.

Running that loop on 3 items isn’t a big deal.

Running that loop on 143,000+ words is a big deal.

It will either take a very long time, or crash. So we have to find a way to place a word in the array quickly and get it back quickly and it has to work exactly the same.

The good news is some smarty pants person already found a solution for this, we just have to implement the process.

This process of fast insertion and retrieval is called a hashtable and it involves taking

a regular array

our regular char* text

a funky function we’ll called generate_hash

generate_hash() takes our text and calculates our index based on some math it will perform on the characters. Remember characters have ascii codes that are integers so we can do some clever mathematics like adding up each word and using the output to determine the index we will place it. DO NOT IMPLEMENT YOUR OWN HASH FUNCTION. It’s a big to do that this course doesn’t cover. Here is a link:

https://stackoverflow.com/questions/628790/have-a-good-hash-function-for-a-c-hash-table

Confused? Here is a drawing:

So now that you understand how we can place words in our data structure we need to take things one step further, using something called a singly linked list or linked list for short. We can not hold 143,000 array indexes because our resources on the stack are limited. We have to hold a pointer to a linked list and search for our word. I’ll do my best to explain all this.

The Stack and The Heap

We put everything we declare in our program on the stack. Anything dynamically allocated at runtime using malloc() we put on the heap. Unfortunately for us the stack won’t hold 143,000 array indexes. We need to lower the number of indexes and somehow use the heap.

Imagine the stack as this nice orderly piece of data like the array in the picture above, while the heap is a mess of chaos. We have to get clever to leverage the space the heap gives us without losing the stuff we throw in there. Try to use this image as a reference we’ll talk about later:

The heap is massive in comparison to the stack but it’s complete chaos. Using malloc() we can utilize a portion of heap memory. When we run the memory-allocation function it returns the address to the first block of memory since we can allocate multiple blocks. We can allocate 4 blocks of size(char) for example.

The biggest challenge when dealing with heap memory is that we only get an small address. Confused what that means?

The address to your house is a few words and numbers. That can lead a person to your actual house where you have your computer, toaster, clothes and other house hold items. If someone is 50 miles away and they have no idea where you live, and no address they won’t find you.

The address to a soccer stadium is still just a few words and numbers but can hold thousands of people. Lose this address, you will miss the soccer game.

Because we are given a pointer to the memory address, the stack can handle that tiny piece of information. We can leave the heap to deal with the actual values. We always have to keep track of the addresses. Keep this all in mind as we talk about building a linked list.

Singly Linked Lists

Run this little c program:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *one_char = malloc(sizeof(char));
  int *two_char = malloc(sizeof(char)*2);

  *(one_char + 0) = 'a'; // same as *one_char = 'a';

  *(two_char + 0) = 'b'; // same as two_char = 'b';

  *(two_char+1) = 'c';

  printf("%c, %c, %c\n", *one_char, *two_char, *(two_char+1));

  return 0;
}

Here is a diagram of what is happening in our little program:

First we malloc enough space for one character. This creates the first red circle.

Then we create the second bigger red circle to hold 2 blocks, each the size of 1 character.

Then we assign a value to that block of memory. We can use pointer arithmetic to access different blocks of memory. I’ve put a number below each character to show the corresponding number you have to add to the dereferenced pointer.

Notice how our stack only has to hold a small memory address that points to the chunk of memory. Is it possible then to string along multiple chunks in one array index on the stack? Let’s make a struct in C and call it a node, then let’s make a tiny linked list:

Check this diagram to visualize the code above and try to figure out what is happening:

bool load(const char* dictionary) part 4: The finale

So here are all the things our load() function should do:

Access a global FILE * dictionary_stream and fopen() the filename passed in from the main speller program
Declare and initialize a char array of size length + 1 for the NULL terminator
Create a while loop while fscanf() returns the number 1
1. within the loop create space on the heap for our struct node (do not get confused with node_ptr we need room for a node but we are passed a node_ptr to store on the stack which points to our newly acquired space on the heap.)
2. save the current word in the loop to the node’s word property
3. generate the correct index (generate a hash) to place the node_ptr
4. when inserting into the array make sure to check if there is an item at that array location, which means we have to do some singly list rearrangement techniques, for example placing the new item to the “head” of the list. Just keep in mind we need to create a temp node_ptr that knows the address of the old head of the list (the original array value on the stack). We update the array index value to the newest address to the newest node, Then we take the newest node’s next property and assign it to the temp node_ptr. Which means we’ve weaved another item to that array index. The diagram above is an example of what it should look like at the end.

Hopefully if you piece together all the ideas in these little programs i’ve shared, plus some creative programing you can come up with your own solution. Here is what I came up with:

Do you want more explanations?

unload() loops through each array location and free’s each node from the linked list. I created a function free_linked_list() to make it look cleaner.

check() takes in a word from the passage text and after “cleaning” (all lower case) the data, generates a hash number (an index in the array to search), and loops through however many nodes exist at that location, comparing each dictionary word to the given passage word. We just have to return true or false.

You will see at the top of my solution I created functions for working with nodes and generating the hash. If you want me to go more in depth on any of those other functions please leave a comment and i’ll do my best to share some more diagrams and clarify what is happening. My goal in this article was to try and simplify the idea of data structures and sharing the solutions to some of the most frustrating moments I had while trying to write up a working program.

As always,

rest, but never quit.

0 0 vote

Article Rating

15 Comments

oldest

newest most voted

Inline Feedbacks

View all comments

Isaac Owens

6 years ago

Hello! This has completely saved my life and answered to major questions I had concerning implementing hashtables. I do have a couple questions. 1. Your syntax for defining your node structure is different than what they teach in the course. Can you explain why you chose to use this syntax? 2. Why do you not initialize your hashtable with the code that was given in the original dictionary.c? I’m guess I may have different code since I’m doing this assignment in a different year. I really dig your clarity and attention to detail! Hope to get a convo started with… Read more »

Author

Johnny

Reply to Isaac Owens

Hi Isaac! I’m so happy this has helped you. I think the differences you see are a combination of my own research outside of the course notes (hackerearth.com, geeksforgeeks.org) and the changes made from the new course year. Best of luck and stop back for more articles coming soon =)!

Isaac-Joshua Owens

Reply to Johnny

I see! That explains a few other things as well. For instance I wasn’t able to try the size_t in your hash function even when #including the proper header. I also tried strcasecmp in the check section and that also did not work with the proper header. May be because of the IDE that we are using now.

I’m currently debugging my check implementation because the tolower function does not seem to be working. Every word is being returned as misspelled,

Reply to Isaac-Joshua Owens

I would definitely recommend debugging. Step through the program, step into and out of functions and see the variable values as things loop.

Vineet Suhas Soni

Hello!!! Thanks a lot for being kind enough to share your code and help others. It has helped me a lot in understanding most of the errors that I was facing in my code. But there still is some thing wrong with my code that I am not able to figure out. The number of words in the dictionary turn out to be 71456 which is incorrect and hence the number of misspelled words also reflects an incorrect value. I am sharing my code and I hope you will be able to find some time to go through my to… Read more »

Reply to Vineet Suhas Soni

Thanks so much for your kind words! The best advice I can give with these complex bugs is using the debugger, because you can freeze the program in time, inside loops or just before if statements and check the values of any variable available. Here is the link that takes you directly to a section where I write about the debugger. It is VERY clunky at first. But stick with it and you will discover why it’s so powerful when you get stuck after writing a lot of code.

https://johnyzaguirre.com/2018/07/31/cs50-week-2-shorts-part-2/?preview_id=607&preview_nonce=0f10722c45&preview=true&_thumbnail_id=1100#gdb

Best of luck and don’t give up!

Adrian

THANK YOU!!!!!!!!!!!!!
You save my life, I have no idea how to solve the problem, it works 100%

Reply to Adrian

I’m so glad this helped you! Welcome to the other side 😁

Tin

I have tried out your codes. But I have got some errors.
https://submit.cs50.io/check50/65c506c4d07e9bed3abbc79f1e09fa1c20bb6a53
Seems like there is a lot of memory leaking. Any idea to fix? Thank you

Reply to Tin

Unfortunately you may have to use Valgrind, which I can honestly say is not my specialty. What I can suggest with confidence is to use GDB. If you are familiar with GDB, step into and out of the functions and checking the state of all your variables in loops step by step. In some of my past articles I talk about GDB and how I use it. I hope that helps and best of luck. This is definitely the hardest part of CS50 but you are almost there don’t give up!

Selim

Thank you very much for the through explanation. Your post helped me understand the error with my code. I was stuck for hours until I found this!

Reply to Selim

I’m glad this helped you!

Federico

First of all, thank you so much for writing this post. You can’t imagine how helpful it is for someone going through CS50 with a full time job, a family and very little time. Regarding what you wrote 🙂 “You will see at the top of my solution I created functions for working with nodes and generating the hash. If you want me to go more in depth on any of those other functions please leave a comment and i’ll do my best to share some more diagrams and clarify what is happening” If you have the time to go… Read more »

Reply to Federico

Happy to help! What specifically would you like clarified? Great job getting this far, This was the toughest part for me!

Hi! Lines 55 from 75. To be more precise, I get a little confused on the use of “node_ptr” and “node”.