Lesson 7: Working with text

Concepts discussed:	Strings, dictionaries, sorting, regular expressions.
Work method:	If you are having trouble understanding parts of the material or have other questions, feel free to grab a teaching assistant and ask them. Try to do the exercises before you look at the answer.
Estimated working time:	6 hours.
Examination:	Mandatory examination of the assignments in accordance with the course page.

Text strings

Thus far we've only used strings (str) as variables and in printing. In this lesson we'll look further into the details of strings and what can be done with them.

Most of the actions we've performed on lists can also be done with strings. However, strings, just like tuples, are immutable and thus we can't modify them.

Read the mini lesson on strings to see some frequently used operations on strings.

Here, we define a string to be used in the following examples:

text = """Kvällens gullmoln fästet kransa.
Älvorna på ängen dansa,
och den bladbekrönta näcken
gigan rör i silverbäcken.
"""

The string contains letters, delimiters, spaces and end-of-line (EOL) characters. Normally, EOL characters are not seen but they ensure that the text will be correctly printed:

>>> print(text)
Kvällens gullmoln fästet kransa.
Älvorna på ängen dansa,
och den bladbekrönta näcken
gigan rör i silverbäcken.

>>> text
'Kvällens gullmoln fästet kransa.\nÄlvorna på ängen dansa,\noch den bladbekrönta näcken\ngigan rör i
silverbäcken.\n'

Count characters and strings

Code	Value	Comment
len(text)	111	The number of characters.
text.count(' ')	12	Numbers of spaces.
text.count('n ')	4	Note that 'n' at the end of lines are not followed by a space.
text.count('\n')	4	The number of rows in a text (more exactly, the number of EOL characters —, the last row may not be ended by an EOL character).

Example: Count the number of letters in the string text.

# Count letters
n = 0
for c in text: # c will assume the value of every character in the string
    if c.isalpha():   # Add 1 if c is a letter
        n += 1
print(f'Number of letters: {n}')

The string method isalpha returns True if every character in the string is a letter, else False.

An alternative solution is to use list comprehension:

print(len([c for c in text if c.isalpha()]))

Example: Count the number of Swedish national letters.

As seen above, the method isalpha counts the numbers of Swedish letters and it will also counts a series of other national characters, e.g. ü, é, á, è, î, Ç, ô, Á and more.

There is no method "isswedish" corresponding to the isalpha but we can write a function that counts the number of occurrences of the characters å, ä and ö.

# Count 'Swedish' letters
n = 0
for c in text:
    if c.lower() in 'åäö': # or if c in 'åäöÅÄÖ':
        n += 1
print(f'Number of Swedish letters: {n}')

Exercise

Write a script that counts and then prints the number of national letters å, ä, ö, ü, é, á, è, î, Ç, ô, Á, etc. ...
Tip: It's easier to know the international letters than the national! Answer
```
test_text = 'aVcdåäöÅÄÖüéáèîçÇôÜÁxyZX'
iletters = 0
letters = 0
for c in test_text:
    if c.isalpha():
        letters += 1
    if c.lower() in 'abcdefghijklmnopqrstuvwxyz':
        iletters += 1
print(f'Number of national letters: {letters - iletters}')
```
An alternative solution uses a list comprehension with an if statement. An if statement returns True if two conditions are simultaneously satisfied:
1. c.isalpha(), i.e. a given character c is a letter, and
2. ord(c.lower()) > 122, i.e. the Unicode value of character c is above 122. The first 128 Unicode code points represent the ASCII characters, where a lower case letter "z" is the last letter and has a code point of 122. Therefore, any national letter from the Unicode table has a code point above 122.
Answer
```
national_letters = [
	c for c in test_text
	if c.isalpha() and ord(c.lower()) > 122]

print(f'Number of national letters: {len(national_letters)}')
```

Replacing characters

Assume that we want to "internationalize" a text by replacing å with aa, ä with ae, and ö with oe.

Method 1: Method `replace`

txt = text.replace('å', 'aa') txt = txt.replace('ä', 'ae') txt = txt.replace('ö', 'oe') txt = txt.replace('Å', 'Aa') txt = txt.replace('Ä', 'Ae') txt = txt.replace('Ö', 'Oe') print(txt)

Remember that, since strings are immutable, we need to assign the returned value of the method replace to a variable since the string is not modified by the method.

Method 2: Use a dictionary

We can create a translation table using a dictionary ("dictionary" - see the mini lesson on dictionaries!) We add the Swedish letters as keys with the corresponding combination of international letters as their values:

transtab = {'å': 'aa', 'ä': 'ae', 'ö': 'oe',
            'Å': 'Aa', 'Ä': 'Ae', 'Ö': 'Oe'}
txt = ''
for c in text:              # Iterate over all characters in the text
    if c in transtab:       # If the character c exists as a key in the dictionary
        txt += transtab[c]  # Add the value corresponding to c in the dictionary to txt
    else:                   # or, if c is not in the dictionary
        txt += c            # Add the character c to txt
print(txt)

Calculate the frequency of characters

Calculating the frequency of characters in a text, the use of a dictionary is good practice. We use the characters as keys and the number of occurrences as values.

freq = {}                  # Create an empty dictionary
for c in text:             # Iterate over the characters in the text
    if c.isalpha():        # We count only letters
        c = c.lower()      # We consider 'a' and 'A' to be the same letter
        if c in freq:      # If c already is in the dictionary
            freq[c] += 1   # Add 1 to the value stored in freq[c]
        else:              # or, if c is not in freq
            freq[c] = 1    # Add c to freq and set its value to 1
print(freq)

Result:

{'k': 5, 'v': 3, 'ä': 6, 'l': 8, 'e': 8, 'n': 13, 's': 5, 'g': 4, 'u': 1, 'm': 1, 'o': 3, 'f': 1, 't': 3, 'r':
6, 'a': 8, 'p': 1, 'å': 1, 'd': 3, 'c': 3, 'h': 1, 'b': 3, 'ö': 2, 'i': 3}

The result is presented in the order the letter were are added to the dictionary which makes reading difficult. It should be presented either in alphabetical order or ordered by the frequency.

In the mini-lesson on dictionaries, multiple methods of iterating over a dictionary are presented. Here we will create a list of tuples of the key-value pairs in the dictionary and then sort the list using the method sort:

lista = list(freq.items())
print('Lista   :', lista)
lista.sort()
print('Sorterat:', lista)

We now get this output:

Lista : [('k', 5), ('v', 3), ('ä', 6), ('l', 8), ('e', 8), ('n', 13), ('s', 5), ('g', 4), ('u', 1), ('m',
1), ('o', 3), ('f', 1), ('t', 3), ('r', 6), ('a', 8), ('p', 1), ('å', 1), ('d', 3), ('c', 3), ('h', 1),
('b', 3), ('ö', 2), ('i', 3)]
Sorterat: [('a', 8), ('b', 3), ('c', 3), ('d', 3), ('e', 8), ('f', 1), ('g', 4), ('h', 1), ('i', 3), ('k',
5), ('l', 8), ('m', 1), ('n', 13), ('o', 3), ('p', 1), ('r', 6), ('s', 5), ('t', 3), ('u', 1), ('v', 3),
('ä', 6), ('å', 1), ('ö', 2)

In the mini-lesson on lists we described how to make the output of a list span multiple rows:

for index, e in enumerate(lista, start=1):
    print(e, end=' ')
        if index % 8 == 0:
            print()
print('\n')

This code will output:

('a', 8) ('b', 3) ('c', 3) ('d', 3) ('e', 8) ('f', 1) ('g', 4) ('h', 1)
('i', 3) ('k', 5) ('l', 8) ('m', 1) ('n', 13) ('o', 3) ('p', 1) ('r', 6)
('s', 5) ('t', 3) ('u', 1) ('v', 3) ('ä', 6) ('å', 1) ('ö', 2)

Though, it will look even better if one extracts the values from the tuples:

for index, e in enumerate(lista, start=1):
    print(f'{e[0]}:{e[1]:2d}', end='  ')
    if index % 8 == 0:
        print()
print('\n\n')

Which outputs:

a:  8 b:  3 c:  3 d:  3 e: 8  f: 1  g: 4 h: 1
i:  3 k:  5 l:  8 m:  1 n:13  o: 3  p: 1 r: 6
s:  5 t:  3 u:  1 v:  3 ä: 6  å: 1  ö: 2

Ordering output by frequency

By default, when you sort a list of tuples it is sorted using the first value of the tuple. If we want to sort by frequency we need to find a way of changing the variable used by the sorting function.

We can specify how the function sorted and the method list.sort compare elements. This is done by passing a function that describes how the elements are to be compared. In our case, we can use the following function:

def part2(e):
   return e[1]

The above function returns the second element of a tuple. To sort by frequency, we can write the following:

freq_order = sorted(lista, key=part2, reverse=True)
for i, e in enumerate(freq_order, start=1):
    print(f'{e[1]:2d} {e[0]}', end='\t')
    if i % 6 == 0:
        print()
print()

which outputs:

13 n   8 a   8 e   8 l   6 r   6 ä
 5 k   5 s   4 g   3 b   3 c   3 d
 3 i   3 o   3 t   3 v   2 ö   1 f
 1 h   1 m   1 p   1 u   1 å

Note:

When passing the function name as the value of the key argument, we tell the sorting function that this function gives the value to be used when sorting the elements.
The parameter reverse=True makes it so that the result is presented in descending order, i.e. the largest value first.
We used the tab character as the value of the end parameter to the print function.

Exercises

Set l = ['alpha', 'bravo', 'charlie', 'delta', 'echo', 'foxtrot']
What is the result of l.sort(key=part2)? Answer
A list sorted using the second character in the word:
```
  ['echo', 'delta', 'charlie', 'alpha', 'foxtrot', 'bravo']
```
Another method of sorting by frequency is by first switching the first and second element in each tuple in lista and storing that in a new list, and then sort the new list. Write a function that does this! Answer
It can be done elegantly using list comprehension:
```
swapped = sorted([(x[1], x[0]) for x in lista], reverse=True)
```
It can also be done using a for-loop:
```
swapped = []
for x in lista:
    swapped.append((x[1], x[0]))
swapped.sort(reverse=True)
```

Word analysis

Thus far we've only operated on single letters. In this section we will describe how it can be done on entire words in a text.

Finding words in a text - regular expressions

The first problem is to find all words in a string. In the mini-lesson on strings, the method split is described. The example given in the mini-lesson:

'Take it easy'.split(' ')

returns a list containing the words of the string: ['Take', 'it', 'easy'].

If we use that method on the Stagnelius text from the start of the lesson, i.e. print(text.split(' ')), we get the following output:

['Kvällens', 'gullmoln', 'fästet', 'kransa.\nÄlvorna', 'på', 'ängen',
'dansa,\noch', 'den', 'bladbekrönta', 'näcken\ngigan', 'rör', 'i',
'silverbäcken.\n']

From the output, we can see that the splitting only occur on spaces and not on any other characters. Thus we only get 13 words instead of the 16 anticipated.

We need a way of specifying what a word is. Here we we should define a word as a sequence of just letters. (If we, for example, are analyzing Python programs we would probably also include underscores and, after the first character, also digits.)

There is a very powerful tool called regular expressions that can help us with this. It can be used to define patterns that can be searched for in a text.

Regular expressions are not limited to Python but exists in many other programming languages and editors. In this course we will only show some simple examples but we advise you to read up further on how you can use regular expressions.

To extract the words from the sentence, use the following code:

import re

wordlist = re.findall(r'[a-zA-ZåäöÅÄÖ]+', text)
print(wordlist)

which gives the following output:

['Kvällens', 'gullmoln', 'fästet', 'kransa',
'Älvorna', 'på', 'ängen', 'dansa',
'och', 'den', 'bladbekrönta', 'näcken',
'gigan', 'rör', 'i', 'silverbäcken']

Comments:

Regular expressions are defined in the package re.
The parameters of re.findall is a "raw string", i.e. it starts with r' (or r") instead of only ' (or "). Escape characters are not used the same in raw strings as in normal strings. They either have a different meaning or none at all.
The expression [a-zA-ZåäöÅÄÖ] match all characters between a and z, A and Z, and the characters å, ä, ö, Å, Ä and Ö.
The character + after the [] group signifies that there may be one or more letters
That is, the regular expressions matches a pattern of one or more letters.
The method findall returns a list of all sequences that matches the pattern.

Mandatory assignments

Write a script that analyses a text file and presents some statistics. The script should print the total number of words and the number of unique words. The script should also print a list of the n most common words. The value of n should be given by the user using the input function. The mini-lesson reading and writing a file describes how a file can be read into Python.
Write script that reads a Python file and produces a reference list for the variables and functions used in the program. The output should first contain a listing of the source with line numbers and then the reference list. The reference list specifies on which line each word has appeared.

Python-words such as for, if, def, etc., should not be in the list and nor should any words in the comments.

Example: If the script that counts the frequency of characters given above is used as input, the following output is expected:
```
 1   
 2   freq = {}                # Create an empty dictionary
 3   for c in text:           # Iterate over the characters in the text
 4       if c.isalpha():      # We count only letters
 5           c = c.lower()    # We consider 'a' and 'A' to be the same letter
 6           if c in freq:    # If c already is in the dictionary
 7               freq[c] += 1 # Add 1 to the value stored in freq[c]
 8           else:            # or, if c is not in freq
 9               freq[c] = 1  # Add c to freq and set its value to 1
10   print(freq)



  Reference list:
    c               [3, 4, 5, 5, 6, 7, 9]
    freq            [2, 6, 7, 9, 10]
    isalpha         [4]
    lower           [5]
    print           [10]
    text            [3]	
```
Tip: If line contains a row with Python code, the following regular expression will remove any comment at the end of the line:

line = re.sub(r'#.*$', '', line)

The method re.sub(p, q, s) returns a new string where all occurrences of the pattern p in the string s are substituted by the string q.

The pattern r'#.*$' is interpreted as follows:
- # matches the hash character.
- The dot . matches any characters that is not an EOL character.
- * is a "meta character". It specifies that the previously matched pattern may be repeated zero or more times.
- $ is a "meta character". It signifies an EOL character.
Thus, the expression re.sub(r'#.*$', '', line) returns a string that is identical to line except that everything from a hash character will be replaced by an empty string.

(Note: This is not completely foolproof since the hash character could be escaped or reside inside a string. Thus, the script will not work on itself...)

Question

How many hours have you spent working on this lesson?

Proceed to the next lesson or go back.