IT 117: Intermediate Scripting

IT 117: Intermediate Scripting
Class 12

Microphone

Questions

Are there any questions before I begin?

Homework 6

I have posted homework 6 here.

It is due this coming Sunday at 11:59 PM.

Quiz 4

Let's look at the answers to Quiz 4

Midterm

The Midterm exam for this course will be held on Tuesday, October 21st.

The exam will be given in this room.

It will consist of questions like those on the quizzes along with questions asking you to write short segments of Python code.

60% of the points on this exam will consist of questions from the Weekly Graded Quizzes.

There is a link to the answers to the graded quizze on the class web page.

There will be 15 of these questions worth 4 points each.

The other 40% of points will come from four questions that ask you to write a short segment of code.

Each of the code questions is worth 10 points each.

To study for the code questions you should know

Dictionaries
Sets
How to use the os and sys modules
How to write a regular expression

A good way to study for the code questions is to review the Class Exercises and homework solutions.

The last class before the exam, Thursday, October 16th, will be a review session.

You will only be responsible for the material in the Class Notes for that class on the exam.

You will find the Midterm review Class Notes here.

If for some reason you cannot take the exam on the date mentioned above you must contact me to make alternate arrangements.

The Midterm is given on paper.

I scan each exam paper and upload the scans to Gradescope.

I score the exam on Gradescope.

You will get an email from Gradescope with your score when I am done.

The Midterm is a closed book exam.

You are not allowed to use any resource, other than what is in your head, while taking the exam.

Cheating on the exam will result in a score of 0 and will be reported to the Administration.

Remember your Oath of Honesty.

To prevent cheating, certain rules will be enforced during the exam.

Tips and Examples

sys.argv

sys.argv is a variable inside the sys module
It contains a list of all the things on the command line ...
including the name or pathname that runs the script ...
and any command line arguments

If I run the following script

import sys

print("sys.argv:", sys.argv)

With this command line
```
$ ./argv.py foo bar bletch
```

I get

sys.argv: ['./argv.py', 'foo', 'bar', 'bletch']

The first entry in the list is the first thing on the command line
It is always the pathname used to run the script
The second entry in the list is the first argument to the script
And so on

You can see this more clearly in the following script

import sys

for index in range(len(sys.argv)):
    print(sys.argv[index])

When I call this script as follows
```
$ ./args.py arg1 arg2 arg3
```
I get
```
0: ./args.py
1: arg1
2: arg2
3: arg3
```
sys.argv allows a Python script to use command line arguments

Why Check Command Line Arguments?

Let's say I create the following script to count the number of files in a directory

import os
import sys

dir = sys.argv[1]
file_count = 0
for entry in os.listdir(dir):
    if os.path.isfile(entry):
       file_count += 1
print("There are ", file_count, "files in", dir)

I can run this script on my current directory
```
$ ./count_files_1.py .
```
And get a result
```
There are  3 files in .
```
But if I forget to supply an argument to the script
```
$ ./count_files_1.py
```

I get an error

Traceback (most recent call last):
  File "./count_files_1.py", line 7, in <module>
    dir = sys.argv[1]
IndexError: list index out of range

A script must always check that it gets the arguments it needs
We do this by looking at the length of sys.argv
sys.argv will always have a length of at least one ...
because it always returns the pathname of the script
How many entries should there be in sys.argv?
The number of required command line arguments plus 1
The 1 accounts for the pathname of the script
If the script does not get the right number it should print a usage message ...
and call sys.exit() to quit

Telling the User What Went Wrong

Checking for the right number of arguments is good ...
but it is not enough
You need to tell the user why the program quit
But we can do better than that
We can tell the user what arguments the script needs to do its job
We do this with a usage message
Here's is a revised version of the count_files script
This one checks for the right number of command line arguments ...

and prints a usage message if it doesn't get them

import os
import sys

if len(sys.argv) < 2:
    print("Usage:", sys.argv[0], "DIRECTORY")
    sys.exit()

dir = sys.argv[1]
file_count = 0
for entry in os.listdir(dir):
    if os.path.isfile(entry):
       file_count += 1
print("There are ", file_count, "files in", dir)

When we miss an argument the result is much better than before
```
$ ./count_files_2.py 
Usage: ./count_files_2.py DIRECTORY
```

Improving the Usage Message

The script is much better with a usage message
But there is still room for improvement
What if I ran the script from several directories above?

In that case I would get this

$ example_code_it117/os_sys_example_code/count_files_2.py 
Usage: example_code_it117/os_sys_example_code/count_files_2.py DIRECTORY

The message gives me the full pathname I used to run the script
This makes it hard to see the name of the script
We can improve the usage message with the basename function ...
contained in os.path module
It returns the last part of the pathname
The filename after the last /
It works the same way as the Unix basename command
```
$ basename /courses/it117/f21/ghoffman
ghoffman
```

The new version of the script has the following code

if len(sys.argv) < 2:
    print("Usage:", os.path.basename(sys.argv[0]), "DIRECTORY")
    sys.exit()

This script gives a much better usage message

$ example_code_it117/os_sys_example_code/count_files_3.py 
Usage: count_files_3.py DIRECTORY

Experimenting with Regular Expressions

There are a number of web site that let you enter a regular expression ...
and test them against a string
One example is Regular Expression Tester
Another is regexr.com

Problems Running hw6.py on pe15.cs.umb.edu

If you get an error including "bad interpreter" you have a hashbang problem
Windows uses different characters to indicate the end of the line than Linux
This can be fixed by running dos2unix on hw6.py
See Hashbang Problem with Windows Text Files

Backslash, \, in Regular Expressions

The backslash character, \, has two uses in regular expressions
- As a metacharacter
- As the first character in defined character classes
But it is also used as the first charcter of an escape sequence
Escape sequences are used to include certain characters in string literals
Characters like tab, \t and newline, \n
So there is a potential conflicte between the two usese of \
- In escape sequences
- In regular expressions
But it was not a problem I encountered
when first using regular expressions in Python
And I wrote Python statements like the one below without any problems
```
>>> import re
>>> pattern = re.compile("\d")
>>> 
```
But that has changed with the latest Python release

Now the intepreter gives a warning when it encounters that statement

>>> import re
>>> pattern = re.compile("\d")
<stdin>:1: SyntaxWarning: invalid escape sequence '\d'

Now I could ignore the warning ...
but it makes the output of the script ugly
There is a simple fix for this problem
It involves something Python calls raw strings
If I write the string literal "\t" ...
the interpreter changes it to the Unicode tab character
But if I but an r before the quotes, like this r"\t"" ...
the interpreter will not read it as an escape sequence

So let's use raw strings to in the statement above

>>>  import re
>>> pattern = re.compile(r"\d")
>>>

Review

The Characters in Regular Expressions

Regular expressions are a language for describing a pattern
You use this language to specify a pattern for a text search
This pattern is compared against a string
If some characters in the string fit the pattern, we have a match
A regular expression pattern is a string composed of
- Ordinary characters
- Meta-characters
- Character classes

Ordinary Characters in Regular Expressions

Ordinary characters are characters which are not meta-characters
An ordinary character will match itself
So the regular expression "a" will match a string like "abc" ...
and "bcd" will match "abcde" ...
and so on
Regular expressions are case sensitive
Upper case characters only match upper case characters
And the same for lower case
Digits are ordinary characters
So the regular expression "5" matches the string "256"

Using Regular Expressions to Find a Match

There is more than one way to use regular expressions ...
but the simplest way is to use them much like grep in Unix
You run grep with two arguments
- A string you are trying to match
- A list of files to look for a matches
The search function in the re module can do something similar

If the string matches the regular expression, a Match object is created

>>> match = re.search("man", "A man, a plan, a canal. Panama")
>>> print(match)
>_sre.SRE_Match object; span=(2, 5), match='man'>

A Test Function for Regular Expressions

To experiment with regular expressions we need a test function
This function will take two parameters
- A pattern string
- A line to be matched
The pattern string will be turned into a Pattern object
We will then use the search method on the pattern object to find a match

Here is the code

def regex_test(regular_expression, line):
    pattern_object = re.compile(regular_expression )
    match_object   = pattern_object.search(line)
    if match_object :
        print("Regular expression:", regular_expression)
        print("Matches:", line)
    else:
        print("Regular expression:", regular_expression)
        print("Does NOT match", line)

Here it is in operation

>>> regex_test("man", "A man, a plan, a canal, Panama")
Regular expression: man
Matches: A man, a plan, a canal, Panama
>>> regex_test("xxx", "A man, a plan, a canal, Panama")
Regular expression: xxx
Does NOT match A man, a plan, a canal, Panama

Meta-characters in Regular Expressions

Meta-characters are characters with special meaning inside a regular expression

The meta-characters are

    .  ^  $  *  +  ?  {  }  [  ]  \  |  (  )

Every character that is not a meta-character is an ordinary character

The . Meta-character

. matches one of any single character ...
except newline
It works the same way as the ? meta-character on the Unix command line

Here is an example

>>> regex_test("th.n", "And then I went home")
Regular expression: th.n
Matches: And then I went home
>>> regex_test("th.n", "I am better than you")
Regular expression: th.n
Matches: I am better than you
>>> regex_test("th.n", "I wish I were thiner")
Regular expression: th.n
Matches: I wish I were thiner

. only matches a single character

So you must use one . for every character you are trying to match

>>> regex_test("t..n", "And then I went home")
Regular expression: t..n
Matches: And then I went home            
>>> regex_test("t..n", "Is there a taint of scandal?")
Regular expression: t..n
Matches: Is there a taint of scandal?

The * Meta-character

* matches zero or more occurrences of the previous character
* in regular expressions is similar to * in Unix
But there is an important difference
In Unix, * matches 0 or more occurrences of any character
In regular expressions, * matches 0 or more occurrences ...
of the character that comes before it

It will match many occurences of the character that comes before it

regex_test("t*n", "1234 tttttn abcd")
Regular expression: t*n
Matches: 1234 ttttt abcd

But it will also match no occurences of the character that comes before it

regex_test("t*n", "1234 n abcd")
Regular expression: t*n	
Matches: 1234 n abcd

You can get the same effect as * in Unix ...

if use .* in regular expressions

>>> regex_test("t.*n", "abcd tan efg")
Regular expression: t.*n
Matches: abcd tan efg
>>> regex_test("t.*n", "xx the zzn")
Regular expression: t.*n
Matches: xx the zzn

So * means one thing in Unix ...
and another thing in regular expressions
This is one of the reasons it takes time to get used to regular expressions

The + Meta-character

The + meta-character is like *
Because it is used to indicate repetition of the previous character
* means zero or more occurrences

But + means one or more occurrences

>>> regex_test("ab+c", "xxx  abccccc  yyy")
Regular expression: ab+c
Matches: xxx  abccccc  yyy
>>> regex_test("ab+c", "xxx abbbbbccccc zzz")
Regular expression: ab+c
Matches: xxx abbbbbccccc zzz

It will not match no occurrences of the character it follows

>>> regex_test("ab+c", "xxx  accccc zzz")
Regular expression: ab+c
Does NOT match xxx  accccc zzz

Unlike the * meta-character

>>> regex_test("ab*c", "xxx accccc zzz")
Regular expression: ab*c
Matches: xxx accccc zzz

The ? Meta-character

? is also a repetition meta-character
It means zero or one occurrences of the previous character

In other words, it means the previous character is optional

>>>  regex_test("ab?c", "qqq abc jjj")
Regular expression: ab?c
Matches: qqq abc jjj

>>> regex_test("ab?c", "123 ac 456")
Regular expression: ab?c
Matches: 123 ac 456

>>> regex_test("ab?c", "786 abbc vvv")
Regular expression: ab?c
Does NOT match 786 abbc vvv

The \ Meta-character

The backslash, \ , is a meta-character
It turns off the special meaning of the character that immediately follows it
It performs the same function as the backslash in Unix

To search for a meta-character, put a \ in front of it

>>> regex_test("a\+b", "345 a+bcde")
Regular expression: a\+b
Matches: 345 a+bcde

If you don't turn off the meta-character you won't get a match

>>> regex_test("a+b", "906 a+bcde")
Regular expression: a+b
Does NOT match 906 a+bcde

To match more than one meta-character ...

put \ in front of each

>>> regex_test( "a\+\+\+b", "567  a+++bcde")
Regular expression: a\+\+\+b
Matches: 567  a+++bcde

The \ is also used in character classes

Character Classes

A character class is a set of characters
Character classes match a single occurence ...
of any character within the set
There are 6 character classes built into regular expressions
Their names all have the same format
A \ in front of a single letter
If the letter following \ is lower case ...
it will match a single character in the set
But if it is upper case ...
it matches a single character not in the set

\d and \D Character Classes

\d matches a single digit

>>> regex_test("\d", "1234")
Regular expression: \d
Matches: 1234

\d can be used with a repetition meta-character ...

to match many occurrences of a digit

>>> regex_test("\d*a", "1234abc")
Regular expression: \d*a
Matches: 1234abc

\D matches any single character that is not a digit

>>> regex_test("\D", "1a234")
Regular expression: \D
Matches: 1a234

The \w and \W Character Classes

\w matches any single alphanumeric character ...
and the underscore, _

The alphanumeric characters are the letters and the digits

>>> regex_test("\w","---a------------")
Regular expression: \w
Matches: ---a------------

>>> regex_test("\w+","---1234abc------")
Regular expression: \w+
Matches: ---1234abc------

\W matches any single character that is not a letter ...
or an underscore, _

or a digit

>>> regex_test("\W+","###" )
Regular expression: \W+
Matches: ###

The \s and \S Character Classes

\s matches any whitespace character

>>> regex_test("a\sb", "----a b----")
Regular expression: a\sb
Matches: ----a b----

\S matches any character that is not whitespace

>>> regex_test("\S+", "abcd")
Regular expression: \S+	
Matches: abcd

Attendance

New Material

Real World Regular Expressions

How might a system administrator use regular expressions?
Let's say that the there is a log file with many lines
And some of the lines contain IPv4 addresses
IPv4 addresses consist of 4 numbers ...
separated by .
The numbers will consist of 1, 2 or 3 digits and look like this
```
205.236.184.72
```
Let's write a regular expression to match an IPv4 address
Long experience with regular expression has taught me to work slowly
To build up a regular expression bit by bit ...
just as I suggest you do with homework scripts

To match the first digits use \d+

>>> regex_test("\d+", "205.236.184.72")
Regular expression: \d+
Matches: 205.236.184.72

Now we need to match a .
But . is a meta-character so we have to turn off its special meaning

We do this by putting a backslash, \ in front

>>> regex_test("\d+\.", "205.236.184.72")
Regular expression: \d+\.
Matches: 205.236.184.72

We can now repeat the above pattern to match the next group of digits

>>> regex_test("\d+\.\d+\.", "205.236.184.72")
Regular expression: \d+\.\d+\.
Matches: 205.236.184.72

Repeating this pattern, but removing the final \. we get

>>> regex_test("\d+\.\d+\.\d+\.\d+", "205.236.184.72")
Regular expression: \d+\.\d+\.\d+\.\d+
Matches: 205.236.184.72

Getting Strings from a Match

So far we have only used regular expressions to find a match
But that is not the most powerful thing regular expressions can do
Regular expressions can be used to return parts of a matching string
We can use this feature to extract data from the lines of a text file
For example, the Apache web server creates a log file that records every connection

A line in this log looks like this

205.236.184.72 - - [09/Mar/2014:00:03:21 +0000] "GET /wzbc-2014-03-05-14-00.mp3 HTTP/1.1" 200 56810323

This line contains several useful pieces of information
A system administrator might want to collect them
You could write a Python script to collect all IP addresses
This would tell a sysadmin who is connecting to the site
The script could also grab the date and time of each connection
To extract part of a string from a match we need two things
- The ( ) meta-characters
- The group method of a match object

The ( ) Meta-characters

To extract some part of the matched text ...
we place the ( ) ...
around parts of the regular expression ...
that match the values we are trying to extract
We then use the the group method of the Match object ..
to extract these values

If we wanted to extract the IP address from the following line

205.236.184.101 - - [09/Mar/2014:00:03:21 +0000] "GET /wzbc-2014-03-05-14-00.mp3 HTTP/1.1" 200 56810323

We need to do two things
- Write a regular expression that matches only the line we want
- Put ( ) meta-characters around the part we want to extract
Let's say that the the following regular expression matches this line
```
\d+\.\d+\.\d+\.\d+.*GET
```
The string we want to extract from the matching line is the IP address
We can do this by putting ( ) around the part of the pattern ...
that matches the IP address
```
(\d+\.\d+\.\d+\.\d+).*GET
```

We compile this into a Pattern object

>>> pattern_object = re.compile("(\d+\.\d+\.\d+\.\d+).*GET")

When we use this Pattern object to attempt to match the line

>>> match_object = pattern_object.search('205.236.184.101 - - [09/Mar/2014:00:03:21 +0000] "GET /wzbc-2014-03-05-14-00.mp3 HTTP/1.1" 200 56810323')

a Match object is created

>>> match_object
<_sre.SRE_Match object; span=(0, 53), match="205.236.184.101 - - [09/Mar/2014:00:03:21 +0000]

We can now use the group method to extract this part of the matching text
To do that we have to specify which ( ) group we are talking about ...
because a regular expression can have many ( ) groups
So group needs a number to specify a particular ( )
We only have one set of ( ) in the example above
So we use 1 as the argument
```
match_object.group(1)
'205.236.184.101'
```

Extracting Text in a Loop

Regular expressions are often used to find matches in large text files
The Apache web server keeps a log of all connections to the server

The log looks like this

199.21.99.114 - - [23/Mar/2014:00:05:14 +0000] "GET /wzbc-2014-03-20-00-00.m3u HTTP/1.1" 200 102 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"
108.121.132.248 - - [23/Mar/2014:00:07:46 +0000] "GET / HTTP/1.1" 200 76437 "http://www.bc.edu/bc_org/svp/st_org/wzbc/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
108.121.132.248 - - [23/Mar/2014:00:07:47 +0000] "GET /favicon.ico HTTP/1.1" 200 1150 "-" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
151.203.239.216 - - [23/Mar/2014:00:03:03 +0000] "GET /wzbc-2014-03-16-13-00.mp3 HTTP/1.1" 206 20035340 "-" "NSPlayer/12.00.7601.17514 WMFSDK/12.00.7601.17514"
82.193.99.33 - - [23/Mar/2014:00:07:49 +0000] "GET / HTTP/1.1" 200 76437 "http://carsdined.org" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; XMPP Tiscali Communicator v.10.0.2; .NET CLR 2.0.50727)"
82.193.99.33 - - [23/Mar/2014:00:07:50 +0000] "GET / HTTP/1.1" 200 76437 "http://carsdined.org" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; XMPP Tiscali Communicator v.10.0.2; .NET CLR 2.0.50727)"
82.193.99.33 - - [23/Mar/2014:00:07:51 +0000] "GET / HTTP/1.1" 200 76437 "http://carsdined.org" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; XMPP Tiscali Communicator v.10.0.2; .NET CLR 2.0.50727)"
108.121.132.248 - - [23/Mar/2014:00:08:07 +0000] "GET /wzbc-2014-03-22-11-00.m3u HTTP/1.1" 200 102 "http://zbconline.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"
183.60.213.106 - - [23/Mar/2014:00:11:35 +0000] "GET /wzbc-2014-02-15-19-00.m3u HTTP/1.1" 404 301 "-" "Mozilla/5.0 (compatible; EasouSpider; +http://www.easou.com/search/spider.html)"
199.21.99.114 - - [23/Mar/2014:00:12:55 +0000] "GET /wzbc-2014-03-06-22-00.m3u HTTP/1.1" 404 305 "-" "Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)"

Here is how we can get the IP address from each line
Of course we first need to create a file object
```
>>> file = open("apache_log.txt", "r")
```
Now we compile the Pattern object using the regular expression created above
```
>>> pattern_object = re.compile("(\d+\.\d+\.\d+\.\d+).*GET")
```
Then we loop through the file and try to match each line
If we find a match we call the group method on the Match object

And print the results

>>> for line in file:
...     match_object = pattern_object.search(line)
...     if match_object:
...        ip_address = match_object.group(1)
...        print(ip_address)
... 
199.21.99.114
108.121.132.248
108.121.132.248
151.203.239.216
82.193.99.33
82.193.99.33
82.193.99.33
108.121.132.248
183.60.213.106
199.21.99.114

A New Regular Expression Testing Script

We need a new regular expression testing script
This one will look for a match ...
but it will also extract some text if it finds a match

We will use regex_test_with_group.py for this purpose

#! /usr/bin/python3

# tests regular expressions and returns
# the first group if it can

import re
import os
import sys

def regex_match_with_group(regular_expression, line):
    pattern_object = re.compile(regular_expression)
    match_object   = pattern_object.search(line)
    if match_object :
        try :
            return_string = match_object.group(1)
            print("regular expression:", regular_expression)
            print("matches:", line)
            print("returns:",  return_string)
        except :
            print("Match found but no substring returned")
    else:
        print("No match")

if len(sys.argv) < 3 :
    print("Usage: ", os.path.basename(sys.argv[0]), " REGULAR_EXPRESSION  STRING_TO_MATCH")
    sys.exit()

regex = sys.argv[1]
line  = sys.argv[2]

regex_match_with_group(regex, line)

This script does more than try to match a line of text
It also extracts data from the first grouping
Why did I put a the call to group inside a try/except statement?
I need this in case the regular expression does not contain parentheses, ( )

Repetition in Regular Expressions

To match a certain number of digits we can use many instances of \d

$ ./regex_test_with_group.py "(\d\d\d)" 123456789 
regular expression: (\d\d\d)
matches: 123456789
returns: 123

But there is another way
We can follow \d with curly braces, { } ...

and put the number of repetitions we want inside the braces

$ ./regex_test_with_group.py  "(\d{5})"  123456789  
regular expression: (\d{5})
matches: 123456789
returns: 12345

Of course, this also works with \w

$ ./regex_test_with_group.py  "(\w{6})"  abcdefghijk  
regular expression: (\w{6})
matches: abcdefghijk
returns: abcdef

And with \s

$ ./regex_test_with_group.py  "\d+(\d\s{4}\w{2})"  "12345    abcdefghijk"  
regular expression: \d+(\d\s{4}\w{2})
matches: 12345    abcdefghijk
returns: 5    ab

The curly braces can also be used with a specific character

$ ./regex_test_with_group.py  "(b{3})"  "---bbbbbbbbb---" 
regular expression: (b{3})
matches: ---bbbbbbbbb---
returns: bbb

Specifying a Range of Repeating Characters

We can also use curly braces to specify a range of repetitions
When we do this, the curly braces contain two integers ...
separated by a comma
The first integer is the minimum number of repetitions ...

and the second is the maximum

$ ./regex_test_with_group.py  "(\d{2,5})"  "---12---------"
regular expression: (\d{2,5})
matches: ---12---------
returns: 12

$ ./regex_test_with_group.py  "(\d{2,5})"  "---12345---------"   
regular expression: (\d{2,5})
matches: ---12345---------
returns: 12345

Specifying Minimum and Maximum of Repeating Characters

We can also use { } to specify the minimum and maximum of repetitions
To specify a minimum, put a number after the opening { ...

followed by a comma and a closing }

$ ./regex_test_with_group.py  "(\d{2,})"  "---12---------"
regular expression: (\d{2,})
matches: ---12---------
returns: 12

$ ./regex_test_with_group.py  "(\d{2,})"  "---123456---------"
regular expression: (\d{2,})
matches: ---123456---------
returns: 123456

Similarly, to specify a maximum put a comma after the after the opening { ...

followed by a number and a closing }

$ ./regex_test_with_group.py  "(\d{,3})"  "---123456---------"
regular expression: (\d{,3})
matches: ---123456---------
returns:123

$ ./regex_test_with_group.py  "(\d{,3})"  "---12---------"
regular expression: (\d{,3})
matches: ---123456---------
returns:12

Creating Custom Character Classes

Character classes are sets of characters
Each character class matches a single character in its set
Python provides 6 predefined character classes
- \d matches any digit
- \D matches any character not a digit
- \w matches any alphanumeric character and _
- \W matches any character not an alphanumeric or _
- \s matches any whitespace character
- \S matches any character not a whitespace
But Python lets you define your own character classes
You do this using the [ ] meta-characters

Between the brackets you put the characters you want in the class

$ ./regex_test_with_group.py  "([abc])"  bdewrosdf 
regular expression: ([abc])
matches: bdewrosdf 
returns: b

To match more than one character ...
we need to use the repetition meta-characters
- ? matches 0 or 1 instance of the character it comes before
- * matches 0 or more instance of the character it comes before
- + matches 1 or more instance of the character it comes before

Like this

$ ./regex_test_with_group.py  "([abc]+)"  bcaewrosdf 
regular expression: ([abc]+)
matches: bcaewrosdf
returns: bca

Or this

$ ./regex_test_with_group.py  "([abc]{2})"  bcaewrosdf
regular expression: ([abc]{2})
matches: bcaewrosdf
returns: bc

The + or { } must appear immediately after the closing bracket

Ranges of Characters in a Character Class

What if you wanted to match all lowercase letters?
You can't use \w
\w contains uppercase characters, the underscore, _, and digits
I can create a custom character class listing all the lowercase characters like this
```
[abcdefghijklmnopqrstuvwxyz]
```
But there is a better way
I can use a range of characters inside the [ ]
You do this by placing a - between two characters ...

at the beginning and end of the range

$ ./regex_test_with_group.py    "([a-d]+)"  ---------bacdnmonpn--------
regular expression: ([a-d]+)
matches: ---------bacdnmonpn--------
returns: bacd

I can also use more than one range between the [ ]

$ ./regex_test_with_group.py    "([a-dm-p]+)"  ---------bacdnmonpn--------
regular expression: ([a-dm-p]+)
matches: ---------bacdnmonpn--------
returns: bacdnmonpn

The characters must be in the correct order

$ ./regex_test_with_group.py  "\W*([e-a]+)"  ---------bacdnmonpn--------
Traceback (most recent call last):
  File "./regex_test_with_group.py", line 17, in <module>
    pattern_object = re.compile( regex )
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/re.py", line 224, in compile
    return _compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/re.py", line 293, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_compile.py", line 536, in compile
    p = sre_parse.parse(p, flags)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_parse.py", line 829, in parse
    p = _parse_sub(source, pattern, 0)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_parse.py", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_parse.py", line 778, in _parse
    p = _parse_sub(source, state)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_parse.py", line 437, in _parse_sub
    itemsappend(_parse(source, state))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/sre_parse.py", line 575, in _parse
    raise source.error(msg, len(this) + 1 + len(that))
sre_constants.error: bad character range e-a at position 5

Anchors in Regular Expressions

Two meta-characters are used to indicate special places in the string
They are called anchors
The two special places are the start of a string ...
and it's end

^ means the start of a string

>>> pattern_object = re.compile("^(\d)")
>>> match_object = pattern_object.search("123456789")
>>> if match_object :
...     print match_object.group(1)
... 
1

$ means the end of a string

>>> pattern_object = re.compile("(\d)$")
>>> match_object = pattern_object.search("123456789")
>>> if match_object :
...     print match_object.group(1)
... 
9

The ^ Meta-Character

The ^ meta-character has two meanings
When used outside the square brackets, [ ] it is an anchor
It marks the start of the string
But it has a different meaning when it is the first characters inside [ ]
In that place, it reverses the sense of the character class
It means any character not between [ and ]

So while [5-9] matches the digits between 5 and 9

$ ./regex_test_with_group.py  "([5-9]+)"  987654321
regular expression: ([5-9]+)
matches: 987654321
returns: 98765

[^5-9] matches any character not a digit between 5 and 9

$ ./regex_test_with_group.py  "([^5-9]+)"  asdfasd123456789
regular expression: ([^5-9]+)
matches: asdfasd123456789
returns: asdfasd1234

The | Meta-Character

The | meta-character is a logical OR

This meta-character gives you a choice when matching

$ ./regex_test_with_group.py  "(Red|Blue)"  "Red Sox"
regular expression: (Red|Blue)
matches: Red Sox
returns: Red

$ ./regex_test_with_group.py "(Sox|Ducks)"  "Red Sox"
regular expression: (Sox|Ducks)
matches: Red Sox
returns: Sox

Greedy versus Non-greedy Matching

There are two repetition meta-characters that can match many characters
- *
- +
By default, any search using these meta-characters will always be "greedy"
That means that the match will always be as long as possible
Most of the time this is what you want
But sometimes it isn't
Let's say you were trying to find the contents of the first entry in an HTML table cell
The line we would be searching would look something like this
```
<td>Class 4</td> <td>February 6th</td>
```
If we simply search for everything between <td> and </td> ...

this is what we'll get

$ ./regex_test_with_group.py  "<td>(.*)</td>" "<td>Class 4</td> <td>February 6th</td>"
regular expression: <td>(.*)</td> 
matches: <td>Class 4</td> <td>February 6th</td>
returns: Class 4</td> <td>February 6th

The match starts with the contents of the first table cell
But it keeps going
It's greedy, remember
The match text ends with the contents of the last cell
So it includes all the tags in between
We can turn off this greedy matching
To do this, put the meta-character ? ...
after the repetition meta-character *

To get just the first cell we use <td>(.*?)</td>

$ ./regex_test_with_group.py  "<td>(.*?)</td>"  "<td>Class 4</td> <td>February 6th</td>" 
regular expression: <td>(.*?)</td> 
matches: <td>Class 4</td> <td>February 6th</td>
returns: Class 4

Other Methods to Find a Match

In the Python code above I created a regular expression object Pattern object ...
and then used the search method on this object ...
to find a match
It turns out that a Match object has three methods that can do this
- match()
- search()
- fullmatch()
Each method uses applies different standards ...
when looking for a match
match() checks for a match only at the beginning of the string
search() checks for a match anywhere in the string
fullmatch() checks for entire string to be a match
Each requires slightly different regular expressions ...
to match the same line
search() is the most forgiving
It will skip over the start of the line ...
to find a match
So your regular expression must only match the part of the line ...
that you are really interested in
match() is fussier
Your regular expression must match the characters from the start of the string ...
up to the part of the string that matters
fullmatch() is the most stringent
The regular expression must match the entire line

Studying

Flash Cards

Tips and Examples

Review

New Material

Studying

Questions

Homework 6

Quiz 4

Midterm

Tips and Examples

sys.argv

Why Check Command Line Arguments?

Telling the User What Went Wrong

Improving the Usage Message

Experimenting with Regular Expressions

Problems Running hw6.py on pe15.cs.umb.edu

Backslash, \, in Regular Expressions

Review

The Characters in Regular Expressions

Ordinary Characters in Regular Expressions

Using Regular Expressions to Find a Match

A Test Function for Regular Expressions

Meta-characters in Regular Expressions

The . Meta-character

The * Meta-character

The + Meta-character

The ? Meta-character

The \ Meta-character

Character Classes

\d and \D Character Classes

The \w and \W Character Classes

The \s and \S Character Classes

Attendance

New Material

Real World Regular Expressions

Getting Strings from a Match

The ( ) Meta-characters

Extracting Text in a Loop

A New Regular Expression Testing Script

Repetition in Regular Expressions

Specifying a Range of Repeating Characters

Specifying Minimum and Maximum of Repeating Characters

Creating Custom Character Classes

Ranges of Characters in a Character Class

Anchors in Regular Expressions

The ^ Meta-Character

The | Meta-Character

Greedy versus Non-greedy Matching

Other Methods to Find a Match

Studying

Class Exercise

Class Quiz