Archive for the ‘Programming’ Category
Debuging a script that parses /proc/net/dev
A Intermittent Problem:
I wrote a Perl script for Nagios that would figure out the bandwidth of an interface by parsing TX (transmit) and RX (receive) bytes from /proc/net/dev. The proc file system is a virtual file system that provides the ability to view various kernel statistics as well as modify some kernel parameters. My script parses the file twice at a specified interval, and then subtracts the old value from the new value to return bytes per second. I realized that this wasn’t the most accurate method, but it was good enough for my purposes and I didn’t have to install snmp. Also, the larger the interval, the smaller the error would generally be assuming light load.
The problem was that this script would fail every so often with ‘Not numeric subtraction’. So I started saving snapshots of /proc/net/dev and noticed that the script would fail after when the values were around 4 billion something. This I knew to be in the neighborhood of 2^32 (The max of a positive only 32-bit integer value). To confirm my thoughts that this was the max value for this counter, I decided to have a poke around the kernel source code.
Into the Kernel:
I didn’t know where to look in the source for this, but /proc/net/dev has the string ‘Inter-|’ which I figured would be a unique enough string to give me a place to start. Sure enough, a recursive grep for this string returned only 3 lines of code. The function I wanted was dev_seq_printf_stats in dev/core/dev.c:
static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
{
struct net_device_stats *stats = dev->get_stats(dev);
seq_printf(seq, "%6s:%8lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
"%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
dev->name, stats->rx_bytes, stats->rx_packets,
stats->rx_errors,
///.....
Looking at the printf specifiers for this they were %ul — unsigned long integer, which on my system was indeed a max of 4294967295 ( 32^2 – 1). I wanted to be extra sure, so I traced the net_device_stats struct to include/linux/netdevice.h and confirmed that the net_device_stats->rx_bytes member was in fact an unsigned long integer. So now I knew the error happened when the counter maxed out and then reset to zero, but why a non-numeric subtraction error?
Problem Found:
%8lu as a ANSI C standard library printf specifier defaults to 8 characters wide, and also defaults to right justify since there is no hyphen flag. To find out if the kernel did the same I traced seq_printf to lib/vsprintf.c and saw that the Linux kernel version formatted this in the same way. When the bytes value was less than 8 characters long, there was leading white space that threw off my parser. All I needed was to add the extra line at line 9 to eliminate any leading whitespace:
sub parseBandwidth {
my $interface = shift;
my @ifconfigOutput = @_;
foreach my $line (@ifconfigOutput) {
if ( $line =~ /:/ ) {
my @interfaceLine = split( /:/, $line);
if ($interfaceLine[0] =~ /$interface/) {
# Next line is to sanitize leading whitespace
$interfaceLine[1] =~ s/^\s+//;
my @interfaceStats = split( /\s+/, $interfaceLine[1] );
print( LOG "DEBUG I have parsed out: @interfaceStats\n") if $debug;
return @interfaceStats;
}
}
}
}
A Perl API for TVRage – WebService::TVRage
The Module
This new module I have written provides an object oriented interface to TVRage’s XML service which allows you to get episode and other information for television shows. It is written very similarly to my previous module, WebService::UPS, and also uses XML::Simple and Mouse. You can get the module from CPAN here. You can also install it with ’sudo cpan -i WebService::TVRage’.
Example:
use WebService::TVRage::EpisodeListRequest;
use WebService::TVRage::ShowSearchRequest;
my $searchReq = WebService::TVRage::ShowSearchRequest->new();
my $searchResults = $searchReq->search('Heroes');
my $heroFromSearch = $searchResults->getShow('Heroes');
print $heroFromSearch->getLink(), "\n";
print $heroFromSearch->getCountry(), "\n";
print $heroFromSearch->getStatus(), "\n";
my $heroes = WebService::TVRage::EpisodeListRequest->new( 'episodeID' => $heroFromSearch->getShowID() );
my $episodeList = $heroes->getEpisodeList();
print $episodeList->getNumSeasons(), "\n";
my $episode = $episodeList->getEpisode(1,3);
print $episode->getTitle(), "\n";
print $episode->getAirDate(), "\n";
foreach my $showtitle ($searchResults->getTitleList()) {
my $show = $searchResults->getShow($showtitle);
print $show->getLink();
}
A Line By Line Explanation of Selection Sort from Mastering Algorithms with Perl
Introduction:
O’Reilly’s Mastering Algorithms with Perl is written for programmers who are already quite familiar with Perl. I thought it might help myself and maybe others to walk through the code for selection sort that is on Page 120 because the code isn’t the clearest Perl. My analysis is not meant to explain selection sort because the book does that. Rather, it is to explain the Perl code.
The Code:
#!/usr/bin/perl
use strict;
use warnings;
sub selection_sort {
my $array = shift;
my $i; # The starting index of a minimum-finding scan.
my $j; # The running index of a minimum-finding scan.
for ( $i = 0; $i < $#$array ; $i++ ) {
my $m = $i; # The index of the minimum element.
my $x = $array->[ $m ]; # The minimum value.
for ( $j = $i + 1; $j < @$array; $j++ ) {
( $m, $x ) = ( $j, $array->[ $j ] ) # Update minimum.
if $array->[ $j ] < $x;
}
# Swap if needed.
@$array[ $m, $i ] = @$array[ $i, $m ] unless $m == $i;
}
}
my @array = (1, 7, 2, 8, 2, 5, 20);
selection_sort(\@array);
print "@array\n";
The only change I made to the code from the example is to change lt to < on line 18 so the comparison is numerical and not by ascii value.
Analysis:
Line 27 passes a reference to the @array array to the selection_sort subroutine. This makes it so the array is changed in place and a copy of the array does not have to be made.
Line 7 makes the variable $array , which is local to the function selection_sort, a reference to the array in line 26. ’shift’ removes the first argument to the function from the @_ array. The @_ is the default variable within a subroutine so it can be omitted. The line could have been written as $array = shift(@_); .
Line 12 uses a c-style for loop (Please forgive the messed up syntax highlighting). In Perl for and foreach do the same thing. However, a for or foreach loop is context sensitive depending on what comes after the for/foreach keyword. So Perl programmers use for when they are writing a c-style loop. The c-style loop is used for array indexing here. The first statement, $i = 0; initializes the index variable. The second statement, $i < $#$array; is the exit condition as in a while loop. The third statement $i++ increments by one each loop. In the second statement, $#$array, means ‘the index of the last element of the array that the reference $array points to’. Since it is less than instead of less than or equal to, and it is the last index of the array, not the number of items in the array, the loop will stop before the last element of the array.
Line 14 is again using references, so $array->[ $m ] returns the value of the index $m in array that $array points to.
Line 16, like 12, uses the c-style loop. This time the exit condition (the second statement in the loop header) is $j < @$array . This dereferences the array that $array points to, and evaluates in scalar context which returns the number of items in the array. So this also could have been written as $j <= $#$array .
Line 17 and 18 are like a standard if statement but it is written backwards. This is known as postfix syntax, a trailing conditional, or a statement-modifying if. Note, this is written as one line — there is no semi-colon.
Line 22 also uses the trailing conditional, this time it is ‘unless’. This line uses array slices to swap the items. The slice [ $m, $i ] selects two items, the item at index $m and index $i . It does not select a range like the python array[0:2] . In Perl you use the range operator .. instead of : . And again @$array deferences the array, you will often see this written as @{$array} .
Conclusion
I hope this helps someone else who is reading this book, thanks to everyone in #perl on irc.freenode.com for the help.
Track UPS Packages with Perl – WebService::UPS
The Module:
I have made a Perl object oriented module for tracking UPS shipments. To use this module you will need to get a developer key for the UPS online tools here. This module makes a XML request to the online tools, and then parses the response using XML::Simple. The module has methods to get specific information such as recent activity. You can read the full module documentation as well as download the module at CPAN’s site here.
Example:
my $Package = WebService::UPS::TrackRequest->new;
$Package->Username('kbrandt');
$Package->Password('topsecrent');
$Package->License('8C3D7EE8FZZZZZ4');
$Package->TrackingNumber('1ZA45Y5111111111');
print $Package->Username();
my $trackedPackage = $Package->requestTrack();
print $trackedPackage->getActivityList();
Installation:
You can install this module with cpan. In Linux the command is ‘cpan -i WebService::UPS::TrackRequest’ . The required prerequisite modules are: Mouse, LWP::UserAgent , HTTP::Request::Common , XML::Simple , and Data::Dumper .
Parsing the The American Recovery and Reinvestment Act with Perl
Introduction:
I think of the American government as a democratic republic. The government is run by a small group of people, a republic, that is elected by the public to represent them, a democracy. Congress, and the bills they pass, should have oversight from the people. Although the bills are made available to the public, the size makes them somewhat inaccessible, and in my opinion, the media fails at providing enough detailed information on the content of these bills.
My goal was to take the 2009 stimulus bill and try to parse out some information about where the money is going in this massive bill. Although my parser is incomplete, it was still able to parse out information that I could not find by searching for it on the Recovery Website. So I consider it useful.
Parsing the Bill:
My first step was to get the bill into something I could parse. When I started this, I wasn’t aware of THOMAS, so I got the pdf from the recovery.org website and then converted it. I did this using pdftotext, and I then converted it to ascii to make it easier to work with the funny start and end quotes (also called left handed and right handed quotes):
pdftotext -enc UTF-8 -eol unix recoveryAct.pdf recoveryAct2.txt
cat recoveryAct2.txt | uni2ascii -e > recoveryAct3.txt
The parsing is done with regular expressions. After looking over the file, I decided the best way in was to parse the file line by line. I did however want to look ahead at certain points, so I read the file into memory to make this simple. This is known as slurping. The first two regular expressions capture the page number and the section of the document. These are important because all this script really does is make a sort of index, so I can find things that might be of interest.
The group of four regular expressions starting at line 50 do the bulk of the work. These parse the dollar amount after variations of a frequent phrase in the bill. The phrase goes something like ‘For an additional amount for … $50,000,000′. I put the regular expressions in the order of what I think will provide me the most accurate match first since ‘or’ is a short-circuit operator. I also do this on a larger scale with the if and elsif blocks.
As an example, I will explain the first regular expression at line 49. (I am going to explain the meta characters, not all the backtracking and how the Perl regular expression engine handles this, I would recommend “Pro Perl Parsing” or that you search the Perl Journal to learn about that). ‘For’ simply matches the word ‘For’ with any case variation (for example for or fOr), the /i modifier at the end makes the whole expression case insensitive. For ‘.*?’. the period means ‘any character’, the asterisk means match ‘any character’ zero or more times. The question mark makes the asterisk non-greedy. Since zero or more of any character would also ‘consume’ the word ‘additional’, making the asterisk non-greedy makes it so it will stop consuming characters if it finds the word ‘additional’. With \`\`(.*?)\’\’ I am capturing what appears between the ‘funny quotes’ I mentioned previously, this is the ascii interpretation of these funny quotes. The parenthesis around (.*?) capture what is between the funny quotes, which is what the money following is for as far as my parser is concerned. (\$[0-9,]*) captures any sort of dollar amount by looking for any combination of number and commas after a dollar sign. Lastly, the /g makes it so this will work if the pattern happens multiple times on the same line. I then print out the captured information with the page number.
Starting at line 60 I made it so it can read ahead a few lines in case my regular expression is interrupted by a new line. I did this by keeping track of which line I am at (which is the same as the index of the array), and then having a nested loop which reads ahead a few lines but without interrupting the flow of the main parsing loop.
The last sections, starting at line 84, I used to print out more basic matches and to look at them so I could see what dollar amounts were not captured, and use that information to improve my parser.
You can get the output of the script here. It has lines to show to start of each section of the bill, and then lines for the amount of money, what the parser thinks it is for, and the page of the bill that the line refers to. The page is important, because the script doesn’t understand that the following amount might not be the total amount of money, and might get confused elsewhere as well.
My Next Steps:
The next steps I would like to take are to start looking at the Lingua modules and look into incorporating natural language processing. It also might be helpful if I capture the html versions from THOMAS, as this will allow me to already have the sections parsable with HTML::TreeBuilder.
Conclusion:
I think congress should develop, and actually start using, XML markup for their bills. This will allow people to develop proper parsers that could retrieve the information, and display it visual formats so people could have a better handle on the where the money is going. Our country now has a CTO, Vivek Kundra, and I think he should lead the government to provide more open and accessible information.
#!/usr/bin/perl
#===============================================================================
#
# FILE: parseBill.pl
#
# USAGE: ./parseBill.pl
# AUTHOR: Kyle Brandt (kb), www.kbrandt.com
# COMPANY: Boston, MA
# VERSION: 1.0
# CREATED: 03/19/2009 10:16:54 AM
#===============================================================================
use strict;
use warnings;
use Roman;
#Globals
my $printit = 1;
my $delim = "\t";
my $page = 1;
my $titleSection;
my $resolution = 1;
my $total = 0;
my %causeMoney;
my %notParsed;
my $romanRegex = '';
foreach my $number (1..20) {
$romanRegex .= Roman($number);
unless ($number == 20) {
$romanRegex .= '|';
}
}
my @Bill = <>;
my $index = 0;
foreach (@Bill) {
#Get Page Number
if (/H\. R\. 1.*?([0-9]{1,3})/) {
#print $1, "\n";
$page = $1;
}
if ( m/TITLE ($romanRegex)-[A-Z ]*/) {
$titleSection = $&;
print $titleSection, $delim, $page, "\n" if $printit;
}
#For additional is a common phrase, this gets the dollar amount after it and what it is for
my @amounts;
if (
( @amounts = /For.*?additional.*?\`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or
( @amounts = /For.*?additional.*?for(.*?)(\$[0-9,]*)/gi) or
( @amounts = /For an amount for \`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or
( @amounts = /For necessary expenses for(.*?)(\$[0-9,]*)/gi)
) {
my $whatfor;
my $amount;
while (@amounts) {
$whatfor = shift @amounts;
$amount = shift @amounts;
$amount =~ tr/,$//d;
print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
$causeMoney{$whatfor . ':' . $page} = $amount;
$total += $amount;
}
}
#Maybe if we read ahead a few lines, we will find what we are looking for
elsif ( @amounts = /For.*?additional.*?\`\`(.*?)\'\'/) {
AMOUNT:
while (@amounts) {
my $whatfor = shift @amounts;
if (length($whatfor) > 40) {
next AMOUNT;
}
if ( $index < ($#Bill - 6 )) {
for my $line (($index + 1) ... ($index + 6)) {
if (my $amount = $Bill[$line] =~ /\$[0-9,]*/) {
$amount =~ tr/,$//d;
print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
$causeMoney{$whatfor . ':' . $page} = $&;
}
}
}
}
}
#Like above, but can't figureout what it is for
elsif ( my @unknownAmounts = /For.*?additional.*?(\$[0-9,]*)/gi) {
for my $unknownAmount (@unknownAmounts) {
$unknownAmount =~ tr/,$//d;
$causeMoney{'UnknownAtPage' . $page} = $unknownAmount;
$total += $unknownAmount;
}
}
#All money, that doesn't fit into the above, could be a portion of what is above.
elsif ( my @dontKnow = /\$[0-9,]*/gi ) {
for my $money ( @dontKnow ) {
$money =~ tr/,$//d;
if ( $money >= 1000000 ) {
$notParsed{$money} = $page;
}
}
}
$index += 1;
}
pyGnomeFind: A GUI frontend to GNU Find
I have written a graphical front end to the GNU find utility called pyGnomeFind. It does not include all of the features of the actual command line utility but does cover most of the essentials. The current version is a testing/preview version. It was written using Python, pyGtk, and Glade. You can get a copy here, and at the bottom is the obligatory screen shot. Right now the code has a haphazard structure so I need to re factor it. Please do let me know if you see anything wrong with a generated find command. Lastly, on a Debian/Ubuntu system you might need to run ‘apt-get install python-gtk2 python-glade2′ to get it to work.
Update 1: Version 0.2, includes the ability to execute the command and display the results in a window, and also a reworking of the interface so sections (i.e time and size) are collapsible.
Update 2: Version 0.3, the user is now able to take parts of the command and group them for use with and/or logic. Next I will be looking into some possible interface reworking, the ability to have multiples of some of the options that currently lack it, and threading so a find command that is executed that takes a long time does not lock up the GUI.
Update 3: Version 0.3.5, Added multiples of many of the tests where there was only one. Also started to use the Glade 3 interface designer (was using 2).
Quick Tip: Thinking about Bash Redirection and File Descriptors
A common question that comes up from people new to bash scripting is: “How do I redirect standard error to standard out?” There are a few ways to write this but the clearest way in my opinion is “command 2>&1″. File descriptor 2 is standard error, and 1 is standard out. So “2>&1″ reads in the form of the question: “File descriptor 2 is being redirected to file descriptor 1.”
However, I think that question itself causes confusion. I don’t think the phrase should be “redirecting standard error to standard out.” Rather, you are redirecting standard error to where standard out points to. You can also think “the file descriptors describe the files they point too.” To see this behavior, you can run ‘xclock 1> ~/scrap/foo 2>&1 ‘. What this does is redirect standard error to where standard out points to, and then redirects standard output to ‘~/scrap/foo’. If you run the following: ‘ls -l /proc/pid_of_xclock/fd’, you will see the above described behavior in action.
Learning while Reviewing: Python and Subnetting
I am currently learning Python, but I also needed to review subnetting. I have found that one of the best ways to stretch my synapses is to combine learning something new with reviewing something that I learned a while back.
To accomplish my review while still learning a new skill I decided to write a simple program that contains functions for the tasks involved in subnetting. Basically, I aimed to write a mediocre ipcalc. I also decided to write these functions in the way that I would think of them, not necessarily the most efficient way to program them. For example, see this solution vs. mine for converting decimal to binary. My functions include decimal to binary conversion, getting relevant information from a subnet mask, and getting the network id of an IP using a subnet mask. You can get the full code of the following examples here.
The first function, ip_to_binary, is pretty straight forward. It works the same way a person is usually taught to convert IP octets into binary (a.k.a base 2). It generates a list of the powers of two, up to 2^7 as a reference. It then iterates over each octet of the IP address. Within that iteration it iterates over the 2^7 list. If it can subtract the octet from the value in the list, it does so and records a 1 and then subtracts the value of the list from the octet; the program then proceeds with another iteration. If it can’t subtract the value then it just records a 0.
The second function, get_subnet_info, uses the previous function to convert a subnet mask to binary. It then counts the number of masked bits (1) and unmasked bits (o). Finally it uses these counts to find out information such as how many subnets there are, how many hosts are in each subnet, what the CIDR notation is, and what the block size is in decimal for the octet of interest (the one that isn’t 0 or 255). See Sybex CCNA study guide, sixth edition, page 119 for more information on these operations.
The third function just prints the information returned from get_subnet_info in a helpful format.
The last function, get_network_id, takes an IP and subnet mask to get the network id number. This is done by iterating over each octet in the IP and the mask in parallel and performing bitwise AND between the two octets in each iteration. See “The ‘mask’ in subnet mask’ section in IP subnetting made easy to find out how this works.
Learning something new at the same time as reviewing something old creates an energy and agility in my thought process. This combination leads me to make connections between different areas of study. It also makes me learn new information while accessing my memory. Lastly, writing about it clarifies my thoughts and acts as a short term review to wrap it all up.
Bash: Getting Command Line Columns to Line up
Update: David Harding pointed out in his comment to this post that the column utility does exactly this. Therefore, the following is really just an academic exercise.
In my last post I showed how to get columns outputted in the command line to line up using python. In this post I am going to show you how to do it with Bash scripts (I think you could also use this same method with python using calls to the shell). Instead of padding the columns with spaces as I did in my previous post, this time we use a tab character for the delimiter and manually set the tab stops in the terminal itself. Since this is the terminal, not the shell, this will work with other shells as well (such as my favorite interactive shell, Zsh).
There is example code at the bottom. Instead of creating functions as I did with my previous post I have kept this example pretty tedious (repetitive code etc) to lessen the levels of abstraction and make the example a little clearer.
The first part finds the max width of each column of a text file. This example has 4 columns and a while loop that splits them on the tab character by setting the IFS ( Input Field Separator ) variable to tab for the loop only. Each iteration of the while loop remembers the value of each column; it saves the value in the $max# variable if the length was larger then the previous iteration ( the variable substitution ${#variable} returns the length of the variable ).
The second part, after the while loop, finds where the tab stops should be placed. The setterm command with the -tabs switch sets tab stops at absolute positions up to 160 ( each argument specifies where the tab stop is relative to the start of the line, not relative to the previous tab stop ) . So for this example, the second tab stop position is found by adding the width of the first column to the width of second column — this gives us the position relative to the start of line. Lastly, after setting the tab stops the file is displayed on the terminal with cat.
A caveat is that it is hard to find out what to set $TERM to. On my machine, when I am in screen session $TERM is equal to ’screen’, but this doesn’t work with setterm, I have to set TERM to ‘linux’.
I hope this helps someone when creating their next command line utility that uses columns.
Getting Command Line Columns to Line up with Python
I created a solution for a program I am writing that makes columns line up when outputted to the command line. I am new to Python, and am hoping I might get some input on this topic.
In the text file that the program reads, the fields (columns) are delimited by ‘$’ and the records (lines) are delimited by newlines ‘\n’. So one line looks like: 1$foo$bar
I broke this task into two separate functions. The first function, opt_output, finds the max length of each column and returns a list such as [1, 23, 14] where 1 is the max width of the first column, 23 is the max length of the second column etc. This function takes a list object as an argument; that list is the file described above with one record as an object in the list. It then iterates over the list, splitting each record into a list on its delimiter. Then the function iterates over an individual record, saving the length of each item if the length is larger then the the length of the equivalent column in previous record. This is of course clearer in the code itself:
The Second function, print_line, takes one record as a list object (already split into items) and the value returned from the previous function. It then uses the values returned from opt_output and pads each item in the list with spaces. The number of spaces to pad it with is figured out by subtracting the length of the item from the max length of the column that was found by the previous function. Finally, it rejoins the list with the delimiter ‘$’ and then splits the list again using the padded spaces as the delimiter for each column:
When calling these functions I record the value of the first function in a variable so it does not have to iterate over the list over and over again. The print line function is called for each line in the file when outputting file to the screen. As long as there is a fixed width font everything lines up nicely in the terminal.






