Lab Exercise 2: Delving into grep and split
The goal of this week's project is to start converting data into knowledge. We'll answer questions like, how many sunny days were there in June and July at Great Pond? To do that, we have to start looking at the values of the data on each day and have the computer make decisions.
The purpose of this lab time is to give you some practice with the Unix tool grep as well as to examine how we can split a string into pieces using the Python string function split. These two capabilities will be necssary for the project.
Mount the Personal fileserver, and make a folder called project2 in
your directory. Command-K in the Finder is the shortcut to the
fileserver connection window.
Once mounted, go ahead and create a project2 folder. Then open Terminal and change your working directory to the project2 folder. You change directories by typing cd and then the path to the directory. You can either type the path to your project2 directory in Terminal or you can use the Finder to drag and drop the path to your project2 folder into Terminal after you type cd and a space.
Grep is a very useful tool for searching for patterns in data. So far, we have used it to search for a specific string. However, what if we want to search for more than one string, or strings that follow particular rules? In general, grep lets us search for basic regular expressions or extended regular expressions using the -E flag. To learn more about the format of basic regular expressions, type man re_format in the Terminal. To learn more about grep, type man grep in the Terminal.
Regular expressions are a powerful method of describing patterns. We're going to look at two specific capabilites. Consider the following file.
date,month,days 1/13/2014,cold,31 2/10/2014,colder,28 3/10/2014,cold,31 4/14/2014,wet,30 5/12/2014,muddy,31 6/9/2014,wet,30 7/14/2014,hot,31 8/11/2014,hotter,31 9/8/2014,cool,30 10/13/2014,cooler,31 11/10/2014,chilly,30 12/8/2014,cold,31 1/13/2015,colder,31 2/10/2015,colder,28 3/10/2105,cold,31 4/14/2015,chilly,30 5/12/2015,wet,31 6/9/2015,warm,30 7/14/2015,hot,31 8/11/2015,warm,31 9/8/2015,warm,30 10/13/2015,cool,31 11/10/2015,cool,30 12/8/2015,cold,31
Download the dates.csv file so you can test out the grep command. (Right click on the link and use Save Link As... or Download Linked Filed As...)
What if we wanted to find the lines corresponding to August and September of 2014? The pattern we want is something like '#/#/2014', where the first number is either an 8 or a 9, and the second number is any one or two digit value. A regular expression allows us to specify single characters from a set of choices by using brackets. The expression  means that grep can match an 8 or a 9. Try the following pattern.
grep '' dates.csv
You should get every line in the file that contains an 8 or 9. If you add a slash after the , then it will find lines that have an 8 or a 9 followed by a slash.
grep '/' dates.csv
The second number consists of one or two digits, and they could be any digit. Rather than having to enumerate all of the digits, we can use [:digit:] to represent the set of all digits. Try the following pattern.
grep '/[:digit:]' dates.csv
This pattern fails to find anything, because [:digit:] (which has to be interpreted as a single special character) works only if it is inside another pair of brackets, just like we put the 89 in brackets. Try the following.
grep '/[[:digit:]]' dates.csv
This pattern still doesn't do what we want, as it will get dates where there is an 8 or 9 in the second field. Adding a second slash to the pattern should eliminate some of the lines we do not want. Try the following.
grep '/[[:digit:]]/' dates.csv
This time, the problem is that the pattern is too strict. It grabs only the line where there is a single digit between the slashes. However, there can be either one or two of digits in the middle field. We can specify that there are one or more digits by using the special combination \+ after a symbol or bracket expression. So the expression '[[:digit:]]\+' specifies one or more digits. Therefore, we can extend our overall pattern to the following. Try it out.
grep '/[[:digit:]]\+/' dates.csv
Note: this is the first attempt where you actually have to put quote marks around the expression. The reason is that this expression contains the backslash character \. The backslash character has a special meaning to the Terminal, so it does not pass the expression to grep unchanged. By putting quotes around the expression, we tell the Terminal to pass the expression to grep unchanged.
The final touch is to stick 2014 on the end. This expression should give us August and September of 2014.
grep '/[[:digit:]]\+/2014' dates.csv
Verify that your output is
Extended Regular Expressions
Extended regular expressions offer similar pattern matching with a moderately different syntax. The first few calls to grep above will work for either basic or extended regular expressions. For example, the same output is produced by calling grep in basic mode or in extended mode (with the -E flag)
- grep '/[[:digit:]]/' dates.csv
- grep -E '/[[:digit:]]/' dates.csv
The basic and extended expression syntax is different in some of the metacharacters. For the date-grepping we did in lab today, the only place where there is a difference between them is in the syntax to indicate how many times you would like to allow a pattern to repeat. For basic regular expressions, you use \+ to indicate 1 or more times. For extended regular expressions, it is simply +.
The final expression in basic and extended forms should be
- grep '/[[:digit:]]\+/2014' dates.csv
- grep -E '/[[:digit:]]+/2014' dates.csv
- paste Imagine you have two data files that
each have the same number of rows, but different data. You want to
merge these two data files together, combining the first row from file
1 with the first row from file 2, and so on. The Unix
command paste is the tool you want.
Download the file temps.csv. It contains the same number of lines as the dates.csv file, but shows the high and low temperature (F) for the contiguous United States on the corresponding dates.
Try using paste on the two files with no argument and see what it does. Note that, by default, it puts a tab in between the lines from each file. If we want it to insert a comma instead, we need to tell it to do so using the -d flag (see man paste for details).
paste -d ',' dates.csv temps.csv
One other useful thing we can do with the Unix shell is redirect output from the terminal to a file. The > symbol tells the terminal to send anything going to stdout to the specific file. For example, the following command sends the output of paste to the file blend.csv.
paste -d ',' dates.csv temps.csv > blend.csv
Verify that the contents of blend.csv make sense by opening the file in TextWrangler.
Create a new file in TestWrangler. Save it as temps.py. Copy the following template into your file. The template puts the body of your code inside a function main, and then calls the main function at the end after checking to see if the file was executed and not imported.
# Your Name # Spring 2017 # CS 152 Project 2 # # Command to run the program # # grep /2014 blend.csv | cut -f 4,5 -d ',' | python temps.py # # import sys def main(stdin): # main code here if __name__ == "__main__": main(sys.stdin)
The last two lines will be new this week. In all of our future coding, we will encapsulate all of our code in functions. The top-level or master function is often called main (but it does not have to be). By encapsulating all code in functions, it makes it easier to import existing code files into other files to re-use the functionality. However, if we want to run a file, we want the main function in that file to execute. The if-statement in the above code differentiates between whether a file was executed on the Terminal (command-line) or imported into another Python file. If it was imported, then we do not want the main function to automatically execute. The if-statement is true only when the file is executed directly, so if it is imported, the main function does not execute.
The goal of this task is to find the average high temp and average low temp for 2014. We can simplify our task by first using grep to find all lines with the string /2014, then use cut to extract fields 4 and 5. However, that means each line still contains two numbers. Test the first two components of the command above on the blend.csv file and see if it gives you a stream of numbers in two columns.
In order to separate the two numbers in the stream, we need a way to split a string into pieces inside Python.
Start by creating the overall loop that reads a line from the stream until it receives an empty line. The following code should go inside the main function.
# assign to buf the result of calling stdin.readline() # while buf.strip() != '': # # Your other code will go here # # assign to buf the result of calling stdin.readline()
We do not have to use sys.stdin.readline because sys.stdin is passed in as the argument to the main function. We do this so that another Python function could also call the main function with its own data to process.
Put a print statement as the first thing in the while loop and print out buf. Then run your program. This shows you what is in the variable buf.
As the second thing in the while loop, assign to words the result of calling buf.split(','). Calling the split function of buf with a comma as an argument divides the string into pieces, splitting it on the commas. After assigning the split result to words, have your program print words on the next line. Test it and see what it prints out.
From the prior step, the variable words is what we call a list. Visually and syntactically, Python represents a list as square brackets with comma-separated elements. To access the elements of a list, we use what is called bracket notation. The first element of the list contained in words is words. The second element of the list is words, and so on. Note that Python uses what is called zero-indexing, which means that the first element of a list has the index 0.
In your loop, after the assignment to words, assign to hitemp the result of casting words to a float. Then assign to lotemp the result of casting words to a float. Remove the other print statements in the loop and add a print statement that shows hitemp and lotemp. Test your code and make sure it prints out two columns of floating point numbers.
Now we're going to calculate the average high temperature and average low temperature. Prior to the start of your loop, initialize three variables, count, hisum, and losum, to zero. Inside the loop, increment count by 1, hisum by hitemp, and losum by lotemp. Remember, you can increment a variable by using the += notation. The following expression is the same as a = a + b
a += b
After the loop, but still inside the main function, print out the average high temperature value (hisum/count) and the average low temperature value (losum/count).
- Formatted printing in Python
Note that when you print out floating point numbers, the number of decimal places Python uses varies. Sometimes it prints out a lot, sometimes just a few. Python doesn't care about significant figures and doesn't worry about making things look nice. That's your job.
Fortunately, Python gives us an easy way to control how numbers are formatted when you print them to the Terminal or to a file. This is called formatted printing. The concept is to write out the string you want to print with placeholders for variables. The placeholders specify how the value is to be formatted.
Try the following example in your code when you print out the average high temperature.
print "Average Hi Temp: %f" % (hisum/count)
The % sign indicates that this is a placeholder for a variable. The f character indicates that the value to be printed is a floating point value. Test out your code.
Note that we still get lots of decimal places, perhaps more than are useful, when Python prints the floating point number. Fortunately, we can specify how many decimal places to use in our format string. The following tells Python to use three decimal places.
print "Average Hi Temp: %.3f" % (hisum/count)
We can also tell Python to use a certain number of characters for the whole field by putting a number in front of the decimal in our format string. This allows us to line up the decimal points on a column of numbers.
print "Average Hi Temp: %7.3f" % (hisum/count)
print "Average Lo Temp: %7.3f" % (losum/count)
Try out the above statements and test out what varying the two numbers does to the format of the output.
When you are done with the lab exercises, you may begin the project.