Description
Introduction
We’ve just completed our review of the most common Python datatypes, and you’ve been exposed to some simple operations, functions and methods for manipulating these datatypes. In this assignment, we’re going to develop some code that relies greatly on the string datatype as well as all sorts of iteration. First, a few general points.
- This is a challenging project, and you have been given two weeks to work on it. If you wait to begin, you will almost surely fail to complete it. The best strategy for success is to work on the project a little bit every day. To help incentivize you to do so, we will provide preliminary feedback to a partial draft you will upload by the draft due date shown at the top of this page (more on this below).
- The work you hand in should be only your own; you are not to work with or discuss your work with any other student. Sharing your code or referring to code produced by others is a violation of the student honor code and will be dealt with accordingly.
- Help is always available from the TAs or the instructor during their posted office hours. You may also post general questions on the discussion board (although you should never post your Python code). I have opened a discussion board topic specifically for HW1.
Background
In this assignment we will be processing text. With this handout, you will find a file containing the entire text of The Wind in the Willows, a children’s novel published in 1908. At some point during the course of this assignment, I will provide you additional texts for you to test your code on; updated versions of this handout may also be distributed as needed. You should think of this project as building tools to read in, manipulate, and analyze these texts.
The rest of these instructions outline the functions that you should implement, describing their input/output behaviors. As usual, you should start by completing the hawkid () function so that we may properly credit you for your work. Test hawkid () to ensure it in fact returns your own hawkid as the only element in a single element tuple. As you work on each function, test your work on the document provided to make sure your code functions as expected. Feel free to upload versions of your code as you go; we only grade the last version uploaded (although we do provide preliminary feedback on a draft version; see below), so this practice allows you to “lock in” working partial solutions prior to the deadline. Finally, some general guidance.
- You will be graded on both the correctness and the quality of your code, including the quality of your comments!
- As usual, respect the function signatures provided.
- Be careful with iteration; always choose the most appropriate form of iteration (comprehension, while, or for) as the function mandates. Poorly selected iterative forms may be graded down, even if they work!
- Finally, to incentivize getting an early start, you should upload an initial version of your homework by midnight Friday, September 22 (that’s one week from the start of the assignment). We will use the autograder to provide feedback on the first two functions, getBook ()
1
and cleanup(), only. We reserve the right to deduct points from the final homework grade for students who do not meet this preliminary milestone.
def getBook(file):
This function should open the file named file, and return the contents of the file formatted as a single string. During processing, you should (1) remove any blank lines and, (2) remove any lines consisting entirely of CAPITALIZED WORDS. To understand why this is the case, inspect the wind . txt sample file provided. Notice that the frontspiece (title, index and so on) consists of ALL CAPS, and each CHAPTER TITLE also appears on a line in ALL CAPS.
def cleanup(text):
This function should take as input a string such as might be returned by getBook () and return a new string with the following modifications to the input:
Remove possessives, i.e., “’s” at the end of a word;
Remove parenthesis, commas, colons, semicolons, hyphens and quotes (both single and double); and Replace ’!’ and ’?’ with ’.’
A condition of this function is that it should be easy to change or extend the substitutions made. In other words, a function that steps through each of these substitutions in an open-coded fashion will not get full credit; write your function so that the substitutions can be modified or extended without having to significantly alter the code. Here’s a hint: if your code for this function is more than a few lines long, you’re probably not doing it right.
def extractWords(text):
This function should take as input a string such as might be returned by cleanup() and return an ordered list of words from the input string. The words returned should all be lowercase, and should contain only characters, no punctuation.
def extractSentences(text):
This function returns a list of sentences, where each sentence consists of a string terminated by a ’.’.
def countSyllables(word):
This function takes as input a string representing a word (such as one of the words in the output from extractWords(), and returns an integer representing the number of syllables in that word. One problem is that the definition of syllable is unclear. As it turns out, syllables are amazingly difficult to define in English!
For the purpose of this assignment, we will define a syllable as follows. First, we strip any trailing ’s’ or ’e’ from the word (the final ’e’ in English is often, but not always, silent). Next, we scan the word from beginning to end, counting each transition between a consonant and a vowel, where vowels are defined as the letters ’a’, ’e’, ’i’, ’o’ and ’u’. So, for example, if the word is “creeps,” we strip the trailing ’s’ to get “creep” and count one leading vowel (the ’e’ following the ’r’), or a single syllable. Thus:
- c o u n t Sy l l a b l e s ( ’ c r e e p s ’ )
1
- c o u n t Sy l l a b l e s ( ’ d e v o t i o n ’ )
3
- c o u n t Sy l l a b l e s ( ’ c r y ’ )
1
2
The last example hints at the special status of the letter ’y’, which is considered a vowel when it follows a non-vowel, but considered a non-vowel when it follows a vowel. So, for example:
- c o u n t Sy l l a b l e s ( ’ c o y o t e ’ )
2
Here, the ’y is a non-vowel so the two ’o’s correspond to 2 transitions, or 2 syllables (don’t forget we stripped the trailing ’e’). And while that’s not really right (’coyote’ has 3 syllables, because the final ’e’ is not silent here), it does properly recognize that the ’y’ is acting as a consonant.
You will find this definition of syllable works pretty well for simple words, but fails for more complex words; English is a complex language with many orthographic bloodlines, so it may be unreasonable to expect a simple definition of syllable! Consider, for example:
- c o u n t Sy l l a b l e s ( ’ c o n s ume s ’ )
3
- c o u n t Sy l l a b l e s ( ’ s p l a s h e s ’ )
2
Here, it is tempting to treat the trailing -es as something else to strip, but that would cause ’splashes’ to have only a single syllable. Clearly, our solution fails under some conditions; but I would argue it is close enough for our intended use.
def ars(text):
Next, we turn our attention to computing a variety of readability indexes. Readability indexes have been used since the early 1900’s to determine if the language used in a book or manual is too hard for a particular audience. At that time, of course, most of the population didn’t have a high school degree, so employers and the military were concerned that their instructions or manuals might be too difficult to read. Today, these indexes are largely used to rate books by difficulty for younger readers.
The Automated Readability Score, or ARS, like all the indexes here, is based on a sample of the text (we’ll be using the text in its entirety).
h t t p : / / www . r e a d a b i l i t y f o rmu l a s . c om/ a u t oma t e d – r e a d a b i l i t y – i n d e x . p h p
The ARS is based on two computed paramters; the average number of characters per word (cpw) and the average number of words per sentence (wps). The formula is:
ARS = 4. 71 * cpw + 0. 5 * wps − 21. 43
were the weights are fixed as shown. Texts with longer words or sentences have a greater ARS; the value of the ARS is supposed to approximate the US grade level. Thus a text with an ARS of 12 corresponds roughly to high school senior reading level.
def fki(text):
The Flesch-Kincaid Index, or FKI, is also based on the average number of words per sentence (wps), but instead of characters per word (cpw) like the ARS, it uses syllables per word (spw).
h t t p : / / www . r e a d a b i l i t y f o rmu l a s . c om/ fl e s c h – g r a d e – l e v e l – r e a d a b i l i t y – f o rmu l a . p h p
The formula is:
FKI = 0. 39 * wps + 11. 8 * spw − 15. 59
As with the ARS, a greater value indicates a harder text. This is the scale used by the US military; like with the ARS, the value should approximate the intended US grade level. Of course, as the FKI was
3
developed in the 1940’s, it was intended to be calculated by people who had no trouble counting syllables without relying on an algorithm to do so.
def cli(text):
The Coleman-Liau Index, or CLI, also approximates the US grade level, but it is a more recent index, developed to take advantage of computers.
h t t p : / / www . r e a d a b i l i t y f o rmu l a s . c om/ c o l ema n – l i a u – r e a d a b i l i t y – f o rmu l a . p h p
The CLI thus uses average number of characters per 100 words (cphw) and average number of sentences per 100 words (sphw), and thus avoids the difficulties encountered with counting syllables by computer.
CLI = 0. 0588 * cphw − 0. 296 * sphw − 15. 8
Testing Your Code
I have provided a function, evalBook (), that you can use to manage the process of evaluating a book. Feel free to comment out readability indexes you haven’t yet tried to use.
I’ve also provided three texts for you to play with. The first, ’test.txt’, is a simple passage taken from the readbility formulas website listed above. The output my solution produces is:
>>> | e v a l Book ( ’ t e s t . t x t ’ ) | |
Eva l u a t i n g TEST . TXT : | ||
1 0 | . 5 9 | Au t oma t e d Re a d a b i l i t y S c o r e |
1 0 . 1 7 | F l e s c h – K i n c a i d I n d e x | |
7 | . 2 8 | Co l ema n – L i a u I n d e x |
The second, ’wind.txt’, is the complete text to The Wind in the Willows by Kenneth Grahame. My output:
- e v a l Book ( ’w i n d . t x t ’ ) Eva l u a t i n g WIND . TXT :
7 . 4 7 | Au t oma t e d Re a d a b i l i t y S c o r e |
7 . 6 3 | F l e s c h – K i n c a i d I n d e x |
7 . 2 3 | Co l ema n – L i a u I n d e x |
as befits a book intended for young adults. Finally, ’iliad.txt’, is an English translation of Homer’s Iliad.
My output:
>>> | e v a l Book ( ’ i l i a d . t x t ’ ) | |
Eva l u a t i n g I L IAD . TXT : | ||
1 2 | . 3 6 | Au t oma t e d Re a d a b i l i t y S c o r e |
1 0 . 5 0 | F l e s c h – K i n c a i d I n d e x | |
9 | . 4 6 | Co l ema n – L i a u I n d e x |
which I think, correctly, establishes the relative complexity of the language used.
4
©2012-2013 – Laurent Pointal Mémento v1.2.2 Licence Creative Commons Attribution 2
Python 3 Cheat Sheet
Official Python documentation on http://docs.python.org/py3k
integer, float, boolean, string | Base Types | ordered sequence, fast index access, repeatable values | Container Types | |||||||||||||||||||||||||||||||||||||
int | 783 | 0 | -192 | |||||||||||||||||||||||||||||||||||||
list [1,5,9] | [“x”,11,8.9] | [“word”] | [] | |||||||||||||||||||||||||||||||||||||
float | 9.23 | 0.0 | -1.7e-6 | tuple (1,5,9) | 11,”y”,7.4 | (“word”,) | () | |||||||||||||||||||||||||||||||||
bool | True | False | 10-6 | immutable | expression with just comas | |||||||||||||||||||||||||||||||||||
str | “One\nTwo” | ‘I\’m’ | str as an ordered sequence of chars | |||||||||||||||||||||||||||||||||||||
no a priori order, unique key, fast key access ; keys = base types or tuples | ||||||||||||||||||||||||||||||||||||||||
new line | ‘ escaped | dict {“key”:”value”} | {} | |||||||||||||||||||||||||||||||||||||
multiline | “””X\tY\tZ | dictionary | {1:”one”,3:”three”,2:”two”,3.14:”π”} | |||||||||||||||||||||||||||||||||||||
key/value associations | ||||||||||||||||||||||||||||||||||||||||
immutable, | 1\t2\t3″”” | set {“key1″,”key2”} | {1,9,3,0} | set() | ||||||||||||||||||||||||||||||||||||
ordered sequence of chars | tab char | |||||||||||||||||||||||||||||||||||||||
for variables, functions, | Identifiers | type(expression) | Conversions | |||||||||||||||||||||||||||||||||||||
modules, classes… names | int(“15”) can specify integer number base in 2nd parameter | |||||||||||||||||||||||||||||||||||||||
a‥zA‥Z_ followed by a‥zA‥Z_0‥9 | ||||||||||||||||||||||||||||||||||||||||
int(15.56) truncate decimal part (round(15.56) for rounded integer) | ||||||||||||||||||||||||||||||||||||||||
diacritics allowed but should be avoided | ||||||||||||||||||||||||||||||||||||||||
language keywords forbidden | float(“-11.24e8”) | |||||||||||||||||||||||||||||||||||||||
lower/UPPER case discrimination | str(78.3) and for litteral representation | repr(“Text”) | ||||||||||||||||||||||||||||||||||||||
☺ a toto x7 y_max BigOne | ||||||||||||||||||||||||||||||||||||||||
see other side for string formating allowing finer control | ||||||||||||||||||||||||||||||||||||||||
☹ | 8y | and | bool | use comparators | (with ==, !=, <, >, …), logical boolean result | |||||||||||||||||||||||||||||||||||
Variables assignment | list(“abc”) | use each element | [‘a’,’b’,’c’] | |||||||||||||||||||||||||||||||||||||
from sequence | ||||||||||||||||||||||||||||||||||||||||
x = 1.2+8+sin(0) | ||||||||||||||||||||||||||||||||||||||||
dict([(3,”three”),(1,”one”)]) | {1:’one’,3:’three’} | |||||||||||||||||||||||||||||||||||||||
value or computed expression | ||||||||||||||||||||||||||||||||||||||||
set([“one”,”two”]) | use each element | {‘one’,’two’} | ||||||||||||||||||||||||||||||||||||||
variable name (identifier) | from sequence | |||||||||||||||||||||||||||||||||||||||
y,z,r = 9.2,-7.6,“bad” | “:”.join([‘toto’,’12’,’pswd’]) | ‘toto:12:pswd’ | ||||||||||||||||||||||||||||||||||||||
variables | container with several | |||||||||||||||||||||||||||||||||||||||
joining string | sequence of strings | |||||||||||||||||||||||||||||||||||||||
names | values (here a tuple) | |||||||||||||||||||||||||||||||||||||||
“words with | spaces”.split() | [‘words’,’with’,’spaces’] | ||||||||||||||||||||||||||||||||||||||
x+=3 | increment | x-=2 | ||||||||||||||||||||||||||||||||||||||
decrement | “1,4,8,2”.split(“,”) | [‘1′,’4′,’8′,’2’] | ||||||||||||||||||||||||||||||||||||||
x=None « undefined » constant value | splitting string | |||||||||||||||||||||||||||||||||||||||
for lists, tuples, strings, … | Sequences indexing | |||||||||||||||||||||||||||||||||||||||
negative index | -6 | -5 | -4 | -3 | -2 | -1 | len(lst) | 6 | ||||||||||||||||||||||||||||||||
positive index | 0 | 1 | 2 | 3 | 4 | 5 | individual access to items via [index] | |||||||||||||||||||||||||||||||||
lst=[11, 67, “abc”, 3.14, 42, 1968] | lst[1]→67 | lst[0]→11 first one | ||||||||||||||||||||||||||||||||||||||
positive slice | 0 | 1 | 2 | 3 | 4 | 5 | 6 | lst[-2]→42 | lst[-1]→1968 last one | |||||||||||||||||||||||||||||||
negative slice | -6 | -5 | -4 | -3 | -2 | -1 | access to sub-sequences via | start slice | end slice step] | |||||||||||||||||||||||||||||||
[ | : | : |
lst[:-1]→[11,67,”abc”,3.14,42] lst[1:3]→[67,”abc”]
lst[1:-1]→[67,”abc”,3.14,42] lst[-3:-1]→[3.14,42]
lst[::2]→[11,”abc”,42] lst[:3]→[11,67,”abc”]
lst[:]→[11,67,”abc”,3.14,42,1968] lst[4:]→[42,1968]
Missing slice indication → from start / up to end.
On mutable sequences, usable to remove del lst[3:5] and to modify with assignment lst[1:4]=[‘hop’,9]
Boolean Logic | Statements Blocks | statements block executed | Conditional Statement | |||||||||||
Comparators: < > <= >= == != | parent statement: | only if a condition is true | ||||||||||||
≤ ≥ = ≠ | if logical expression: | |||||||||||||
statements block 1… | ||||||||||||||
a and b logical and | ! | |||||||||||||
indentation | statements block | |||||||||||||
both simultaneously | ||||||||||||||
not a | statements block 2… | |||||||||||||
logical not | example : | |||||||||||||
a or b | logical or | parent statement: | can go with several elif, elif… and only one final else, | |||||||||||
True | one or other or both | |||||||||||||
true constant value | if x==42: | |||||||||||||
False false constant value | next statement after | block 1 | # block if logical expression x==42 is true | |||||||||||
print(“real truth”) | ||||||||||||||
Maths | elif x>0: | |||||||||||||
☝ floating point numbers… approximated values! | angles in radians | # else block if logical expression x>0 is true | ||||||||||||
Operators: + – * / // % **
× ÷ | ab | |||
integer | ÷ | ÷ | remainder |
(1+5.3)*2→12.6 abs(-3.2)→3.2
round(3.57,1)→3.6
from math import sin,pi…
sin(pi/4)→0.707… cos(2*pi/3)→-0.4999… acos(0.5)→1.0471…
sqrt(81)→9.0 log(e**2)→2.0
print(“be positive”)
elif bFinished:
# else block if boolean variable bFinished is true
print(“how, finished”)
else:
# else block for other cases
print(“when it’s not”)
statements block executed as long Conditional loop statement | statements block executed for each | Iterative loop statement | ||||||||||||||||
as condition is true | item of a container or iterator | |||||||||||||||||
while logical expression: | for | variable | in | : | ||||||||||||||
statements block | Loop control | sequence | ||||||||||||||||
s = 0 | break | statements block | ||||||||||||||||
i = 1 | initializations before the loop | immediate exit | Go over sequence’s values | |||||||||||||||
condition with at least one variable value (here i) | continue | s = “Some text” | initializations before the loop | |||||||||||||||
while i <= 100: | next iteration | cnt = 0 | ||||||||||||||||
loop variable, value managed by for statement | ||||||||||||||||||
# statement executed as long as i ≤ 100 | i=100 | |||||||||||||||||
for c in s: | ||||||||||||||||||
s = s + i**2 | s= ∑ | i | 2 | if c == “e”: | Count number of | |||||||||||||
i = i + 1 ☝ make condition variable change | cnt = cnt + 1 | e in the string | ||||||||||||||||
i=1 | ||||||||||||||||||
print(“sum:”,s) computed result after the loop | print(“found”,cnt,“‘e'”) | |||||||||||||||||
loop on dict/set = loop on sequence of keys | ||||||||||||||||||
☝ be careful of inifinite loops ! | ||||||||||||||||||
use slices to go over a subset of the sequence | ||||||||||||||||||
Display / Input | Go over sequence’s index | |||||||||||||||||
modify item at index | ||||||||||||||||||
print(“v=”,3,“cm :”,x,“,”,y+4) | access items around index (before/after) | |||||||||||||||||
lst = [11,18,9,12,23,4,17] | ||||||||||||||||||
lost = [] | ||||||||||||||||||
items to display: litteral values, variables, expressions | ||||||||||||||||||
for idx in range(len(lst)): | ||||||||||||||||||
print options: | ||||||||||||||||||
val = lst[idx] | Limit values greater | |||||||||||||||||
sep=” “ (items separator, default space) | ||||||||||||||||||
if val > 15: | than 15, memorization | |||||||||||||||||
end=“\n” (end of print, default new line) | lost.append(val) | of lost values. | ||||||||||||||||
file=f (print to file, default standard output) | lst[idx] = 15 | |||||||||||||||||
s = input(“Instructions:”) | print(“modif:”,lst,“-lost:”,lost) |
☝ input always returns a string, convert it to required type (cf boxed Conversions on on ther side).
Go simultaneously over sequence’s index and values:
for idx,val in enumerate(lst):
len(c)→ items count | Operations on containers | frequently used in | Generator of int sequences | ||||||||||||||||
min(c) max(c) sum(c) | Note: For dictionaries and set, these | for iterative loops | default 0 | not included | |||||||||||||||
sorted(c) → sorted copy | operations use keys. | range([start,]stop [,step]) | |||||||||||||||||
val in c → boolean, membersihp operator in (absence not in) | range(5) | 0 1 2 3 4 | |||||||||||||||||
enumerate(c)→ iterator on (index,value) | range(3,8) | 3 4 5 6 7 | |||||||||||||||||
Special for sequence containeurs (lists, tuples, strings) : | |||||||||||||||||||
range(2,12,3) | 2 5 8 11 | ||||||||||||||||||
reversed(c)→ reverse iterator | c*5 → duplicate c+c2 → concatenate | ||||||||||||||||||
c.index(val)→ position | c.count(val)→ events count | range returns a « generator », converts it to list to see | |||||||||||||||||
☝ modify original list | Operations on lists | the values, example: | |||||||||||||||||
print(list(range(4))) | |||||||||||||||||||
lst.append(item) | add item at end | ||||||||||||||||||
lst.extend(seq) | add sequence of items at end | function name (identifier) | Function definition | ||||||||||||||||
lst.insert(idx,val) | insert item at index | ||||||||||||||||||
named parameters | |||||||||||||||||||
lst.remove(val) | remove first item with value | ||||||||||||||||||
lst.pop(idx) | remove item at index and return its value | def fctname(p_x,p_y,p_z): | |||||||||||||||||
lst.sort() | lst.reverse() | sort / reverse list in place | “””documentation””” | ||||||||||||||||
Operations on dictionaries | Operations on sets | # statements block, res computation, etc. | |||||||||||||||||
return res | result value of the call. | ||||||||||||||||||
d[key]=value | d.clear() | Operators: | |||||||||||||||||
d[key]→value | del d[clé] | | → union (vertical bar char) | ☝ parameters and all of this bloc | if no computed result to | |||||||||||||||
d.update(d2) update/add | & → intersection | only exist in the block and during | return: return None | ||||||||||||||||
– ^ → difference/symetric diff | |||||||||||||||||||
d.keys() | associations | < <= > >= → inclusion relations | the function call (“black box”) | Function call | |||||||||||||||
d.values() views on keys, values | s.update(s2) | r = fctname(3,i+2,2*i) | |||||||||||||||||
d.items() | associations | s.add(key) s.remove(key) | |||||||||||||||||
d.pop(clé) | s.discard(key) | one argument per parameter | |||||||||||||||||
storing data on disk, and reading it back | Files | retrieve returned result (if necessary) | |||||||||||||||||
Strings formating | |||||||||||||||||||
f = open(“fil.txt”,“w”,encoding=“utf8”) | formating directives | ||||||||||||||||||
values to format |
file variable | name of file | opening mode | encoding of | |||
for operations | on disk | ‘r’ read | chars for text | |||
(+path…) | ‘w’ write | files: | ||||
‘a’ append… | utf8 | ascii | ||||
cf functions in modules os and | ||||||
os.path | latin1 | … | ||||
writing | empty string if end of file | reading |
f.write(“hello”) | s = f.read(4)if char count not | |
☝ text file → read /write only | read next | specified, read |
strings, convert from/to required | line | whole file |
type. | s = f.readline() | |
f.close() ☝ don’t forget to close file after use |
Pythonic automatic close : with open(…) as f:
very common: iterative loop reading lines of a text file
for line in f :
# line processing block
“model {} {} {}”.format(x,y,r) | str | |||||
“{selection:formating!conversion}” | ||||||
Selection : | Examples | “{:+2.3f}”.format(45.7273) | ||||
2 | →‘+45.727’ | |||||
x | “{1:>10s}”.format(8,“toto”) | |||||
0.nom | ||||||
→‘ | toto’ | |||||
4[key] | ||||||
0[2] | “{!r}”.format(“I’m”) | |||||
Formating : | →‘”I\’m”‘ | |||||
fillchar alignment sign minwidth.precision~maxwidth type
< > ^ = + – space 0 at start for filling with 0 integer: b binary, c char, d decimal (default), o octal, x or X hexa… float: e or E exponential, f or F fixed point, g or G appropriate (default),
% percent
string : s …
Conversion : s (readable text) or r (litteral representation)