Archive for the ‘Perl’ Category
Debuging a script that parses /proc/net/dev
A Intermittent Problem:
I wrote a Perl script for Nagios that would figure out the bandwidth of an interface by parsing TX (transmit) and RX (receive) bytes from /proc/net/dev. The proc file system is a virtual file system that provides the ability to view various kernel statistics as well as modify some kernel parameters. My script parses the file twice at a specified interval, and then subtracts the old value from the new value to return bytes per second. I realized that this wasn’t the most accurate method, but it was good enough for my purposes and I didn’t have to install snmp. Also, the larger the interval, the smaller the error would generally be assuming light load.
The problem was that this script would fail every so often with ‘Not numeric subtraction’. So I started saving snapshots of /proc/net/dev and noticed that the script would fail after when the values were around 4 billion something. This I knew to be in the neighborhood of 2^32 (The max of a positive only 32-bit integer value). To confirm my thoughts that this was the max value for this counter, I decided to have a poke around the kernel source code.
Into the Kernel:
I didn’t know where to look in the source for this, but /proc/net/dev has the string ‘Inter-|’ which I figured would be a unique enough string to give me a place to start. Sure enough, a recursive grep for this string returned only 3 lines of code. The function I wanted was dev_seq_printf_stats in dev/core/dev.c:
static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
{
struct net_device_stats *stats = dev->get_stats(dev);
seq_printf(seq, "%6s:%8lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
"%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
dev->name, stats->rx_bytes, stats->rx_packets,
stats->rx_errors,
///.....
Looking at the printf specifiers for this they were %ul — unsigned long integer, which on my system was indeed a max of 4294967295 ( 32^2 – 1). I wanted to be extra sure, so I traced the net_device_stats struct to include/linux/netdevice.h and confirmed that the net_device_stats->rx_bytes member was in fact an unsigned long integer. So now I knew the error happened when the counter maxed out and then reset to zero, but why a non-numeric subtraction error?
Problem Found:
%8lu as a ANSI C standard library printf specifier defaults to 8 characters wide, and also defaults to right justify since there is no hyphen flag. To find out if the kernel did the same I traced seq_printf to lib/vsprintf.c and saw that the Linux kernel version formatted this in the same way. When the bytes value was less than 8 characters long, there was leading white space that threw off my parser. All I needed was to add the extra line at line 9 to eliminate any leading whitespace:
sub parseBandwidth {
my $interface = shift;
my @ifconfigOutput = @_;
foreach my $line (@ifconfigOutput) {
if ( $line =~ /:/ ) {
my @interfaceLine = split( /:/, $line);
if ($interfaceLine[0] =~ /$interface/) {
# Next line is to sanitize leading whitespace
$interfaceLine[1] =~ s/^\s+//;
my @interfaceStats = split( /\s+/, $interfaceLine[1] );
print( LOG "DEBUG I have parsed out: @interfaceStats\n") if $debug;
return @interfaceStats;
}
}
}
}
My Not-So-Shabby Screen and Gnome-Terminal Setup
Introduction
For a system administrator it is important to have an efficient and comfortable interface to all your servers. GNU Screen is an excellent utility to be able to have a single terminal connected to multiple servers that won’t disappear when you close the window. I have a set up that allows me to spawn gnome-terminal with different screen sessions for each location I administer in a different tab. Then each screen session has a named ‘tab’ that automatically logs into each server at that location. It ends up looking like this:

My main two recommendations for screen are to set up the meta character as back-tick ( ` ) and to give screen a ‘tab bar’. You can read how to do these two things here.
Setting up Screen
Once you have screen set up the way you like, you can the specify an additional screenrc file with the -c switch and your settings in the ~/.screenrc will still be used. The secondary screenrc is where you can list different server groups. This file will make it so there are named ‘tabs’ for each server, and each tab will log into the server you specify. Each line in the file should be something like ’screen -t myServer ssh myServer’, the first mySever is the name of the tab, and then ssh myServer is the command that will be run. To simplify doing this in the future, I made a little Perl script that reads a file that has one server name per line and prints the rc file to standard out.
#!/usr/bin/perl
#===============================================================================
# FILE: makeScreenRc.pl
# USAGE: ./makeScreenRc.pl
# AUTHOR: Kyle Brandt (kb), kyle@kbrandt.com
#===============================================================================
use strict;
use warnings;
print "zombie qr\n";
while (<>) {
chomp;
my $server = $_;
print "screen -t $server ssh $server", "\n";
}
So if you called the above script with something like ‘perl makeScreenRc.pl myDmzList > myScreenDmzRc’ you can then use the created file with ’screen -R DMZ -c myScreenDmzRc’. The capital R switch looks for an existing detached session and will attempt to reattach it before creating a new one. This will be useful with gnome-terminal in case gnome-terminal crashes.
Setting Up Gnome-Terminal
The next step is to create a profile for each of the screen sessions. You can do this by going to File::New Profile and then create a profile with a relevant name for the screen session, i.e. ‘DMZ’ . After that, Go to the Title and Command tab, check ‘Run a custom Command instead of my shell’ and and edit the command to be something like ’screen -R DMZ -c myScreenDmzRc’. Then repeat this for each of the screen sessions you have set up. Then, you can run something like ‘gnome-terminal –tab-with-profile=DMZ –tab-with-profile=MyOffice’ where DMZ and MyOffice are the names of the gnome-terminal profiles you created. This automatically detaches itself from the controlling terminal, so if you close the terminal you launched this from, the new terminal will not close. Lastly, you can set up a shell alias to run the above command, so all you have to do to open up your command central is type something like ‘myservers’.
A Perl API for TVRage – WebService::TVRage
The Module
This new module I have written provides an object oriented interface to TVRage’s XML service which allows you to get episode and other information for television shows. It is written very similarly to my previous module, WebService::UPS, and also uses XML::Simple and Mouse. You can get the module from CPAN here. You can also install it with ’sudo cpan -i WebService::TVRage’.
Example:
use WebService::TVRage::EpisodeListRequest;
use WebService::TVRage::ShowSearchRequest;
my $searchReq = WebService::TVRage::ShowSearchRequest->new();
my $searchResults = $searchReq->search('Heroes');
my $heroFromSearch = $searchResults->getShow('Heroes');
print $heroFromSearch->getLink(), "\n";
print $heroFromSearch->getCountry(), "\n";
print $heroFromSearch->getStatus(), "\n";
my $heroes = WebService::TVRage::EpisodeListRequest->new( 'episodeID' => $heroFromSearch->getShowID() );
my $episodeList = $heroes->getEpisodeList();
print $episodeList->getNumSeasons(), "\n";
my $episode = $episodeList->getEpisode(1,3);
print $episode->getTitle(), "\n";
print $episode->getAirDate(), "\n";
foreach my $showtitle ($searchResults->getTitleList()) {
my $show = $searchResults->getShow($showtitle);
print $show->getLink();
}
A Line By Line Explanation of Selection Sort from Mastering Algorithms with Perl
Introduction:
O’Reilly’s Mastering Algorithms with Perl is written for programmers who are already quite familiar with Perl. I thought it might help myself and maybe others to walk through the code for selection sort that is on Page 120 because the code isn’t the clearest Perl. My analysis is not meant to explain selection sort because the book does that. Rather, it is to explain the Perl code.
The Code:
#!/usr/bin/perl
use strict;
use warnings;
sub selection_sort {
my $array = shift;
my $i; # The starting index of a minimum-finding scan.
my $j; # The running index of a minimum-finding scan.
for ( $i = 0; $i < $#$array ; $i++ ) {
my $m = $i; # The index of the minimum element.
my $x = $array->[ $m ]; # The minimum value.
for ( $j = $i + 1; $j < @$array; $j++ ) {
( $m, $x ) = ( $j, $array->[ $j ] ) # Update minimum.
if $array->[ $j ] < $x;
}
# Swap if needed.
@$array[ $m, $i ] = @$array[ $i, $m ] unless $m == $i;
}
}
my @array = (1, 7, 2, 8, 2, 5, 20);
selection_sort(\@array);
print "@array\n";
The only change I made to the code from the example is to change lt to < on line 18 so the comparison is numerical and not by ascii value.
Analysis:
Line 27 passes a reference to the @array array to the selection_sort subroutine. This makes it so the array is changed in place and a copy of the array does not have to be made.
Line 7 makes the variable $array , which is local to the function selection_sort, a reference to the array in line 26. ’shift’ removes the first argument to the function from the @_ array. The @_ is the default variable within a subroutine so it can be omitted. The line could have been written as $array = shift(@_); .
Line 12 uses a c-style for loop (Please forgive the messed up syntax highlighting). In Perl for and foreach do the same thing. However, a for or foreach loop is context sensitive depending on what comes after the for/foreach keyword. So Perl programmers use for when they are writing a c-style loop. The c-style loop is used for array indexing here. The first statement, $i = 0; initializes the index variable. The second statement, $i < $#$array; is the exit condition as in a while loop. The third statement $i++ increments by one each loop. In the second statement, $#$array, means ‘the index of the last element of the array that the reference $array points to’. Since it is less than instead of less than or equal to, and it is the last index of the array, not the number of items in the array, the loop will stop before the last element of the array.
Line 14 is again using references, so $array->[ $m ] returns the value of the index $m in array that $array points to.
Line 16, like 12, uses the c-style loop. This time the exit condition (the second statement in the loop header) is $j < @$array . This dereferences the array that $array points to, and evaluates in scalar context which returns the number of items in the array. So this also could have been written as $j <= $#$array .
Line 17 and 18 are like a standard if statement but it is written backwards. This is known as postfix syntax, a trailing conditional, or a statement-modifying if. Note, this is written as one line — there is no semi-colon.
Line 22 also uses the trailing conditional, this time it is ‘unless’. This line uses array slices to swap the items. The slice [ $m, $i ] selects two items, the item at index $m and index $i . It does not select a range like the python array[0:2] . In Perl you use the range operator .. instead of : . And again @$array deferences the array, you will often see this written as @{$array} .
Conclusion
I hope this helps someone else who is reading this book, thanks to everyone in #perl on irc.freenode.com for the help.
Track UPS Packages with Perl – WebService::UPS
The Module:
I have made a Perl object oriented module for tracking UPS shipments. To use this module you will need to get a developer key for the UPS online tools here. This module makes a XML request to the online tools, and then parses the response using XML::Simple. The module has methods to get specific information such as recent activity. You can read the full module documentation as well as download the module at CPAN’s site here.
Example:
my $Package = WebService::UPS::TrackRequest->new;
$Package->Username('kbrandt');
$Package->Password('topsecrent');
$Package->License('8C3D7EE8FZZZZZ4');
$Package->TrackingNumber('1ZA45Y5111111111');
print $Package->Username();
my $trackedPackage = $Package->requestTrack();
print $trackedPackage->getActivityList();
Installation:
You can install this module with cpan. In Linux the command is ‘cpan -i WebService::UPS::TrackRequest’ . The required prerequisite modules are: Mouse, LWP::UserAgent , HTTP::Request::Common , XML::Simple , and Data::Dumper .
Parsing the The American Recovery and Reinvestment Act with Perl
Introduction:
I think of the American government as a democratic republic. The government is run by a small group of people, a republic, that is elected by the public to represent them, a democracy. Congress, and the bills they pass, should have oversight from the people. Although the bills are made available to the public, the size makes them somewhat inaccessible, and in my opinion, the media fails at providing enough detailed information on the content of these bills.
My goal was to take the 2009 stimulus bill and try to parse out some information about where the money is going in this massive bill. Although my parser is incomplete, it was still able to parse out information that I could not find by searching for it on the Recovery Website. So I consider it useful.
Parsing the Bill:
My first step was to get the bill into something I could parse. When I started this, I wasn’t aware of THOMAS, so I got the pdf from the recovery.org website and then converted it. I did this using pdftotext, and I then converted it to ascii to make it easier to work with the funny start and end quotes (also called left handed and right handed quotes):
pdftotext -enc UTF-8 -eol unix recoveryAct.pdf recoveryAct2.txt
cat recoveryAct2.txt | uni2ascii -e > recoveryAct3.txt
The parsing is done with regular expressions. After looking over the file, I decided the best way in was to parse the file line by line. I did however want to look ahead at certain points, so I read the file into memory to make this simple. This is known as slurping. The first two regular expressions capture the page number and the section of the document. These are important because all this script really does is make a sort of index, so I can find things that might be of interest.
The group of four regular expressions starting at line 50 do the bulk of the work. These parse the dollar amount after variations of a frequent phrase in the bill. The phrase goes something like ‘For an additional amount for … $50,000,000′. I put the regular expressions in the order of what I think will provide me the most accurate match first since ‘or’ is a short-circuit operator. I also do this on a larger scale with the if and elsif blocks.
As an example, I will explain the first regular expression at line 49. (I am going to explain the meta characters, not all the backtracking and how the Perl regular expression engine handles this, I would recommend “Pro Perl Parsing” or that you search the Perl Journal to learn about that). ‘For’ simply matches the word ‘For’ with any case variation (for example for or fOr), the /i modifier at the end makes the whole expression case insensitive. For ‘.*?’. the period means ‘any character’, the asterisk means match ‘any character’ zero or more times. The question mark makes the asterisk non-greedy. Since zero or more of any character would also ‘consume’ the word ‘additional’, making the asterisk non-greedy makes it so it will stop consuming characters if it finds the word ‘additional’. With \`\`(.*?)\’\’ I am capturing what appears between the ‘funny quotes’ I mentioned previously, this is the ascii interpretation of these funny quotes. The parenthesis around (.*?) capture what is between the funny quotes, which is what the money following is for as far as my parser is concerned. (\$[0-9,]*) captures any sort of dollar amount by looking for any combination of number and commas after a dollar sign. Lastly, the /g makes it so this will work if the pattern happens multiple times on the same line. I then print out the captured information with the page number.
Starting at line 60 I made it so it can read ahead a few lines in case my regular expression is interrupted by a new line. I did this by keeping track of which line I am at (which is the same as the index of the array), and then having a nested loop which reads ahead a few lines but without interrupting the flow of the main parsing loop.
The last sections, starting at line 84, I used to print out more basic matches and to look at them so I could see what dollar amounts were not captured, and use that information to improve my parser.
You can get the output of the script here. It has lines to show to start of each section of the bill, and then lines for the amount of money, what the parser thinks it is for, and the page of the bill that the line refers to. The page is important, because the script doesn’t understand that the following amount might not be the total amount of money, and might get confused elsewhere as well.
My Next Steps:
The next steps I would like to take are to start looking at the Lingua modules and look into incorporating natural language processing. It also might be helpful if I capture the html versions from THOMAS, as this will allow me to already have the sections parsable with HTML::TreeBuilder.
Conclusion:
I think congress should develop, and actually start using, XML markup for their bills. This will allow people to develop proper parsers that could retrieve the information, and display it visual formats so people could have a better handle on the where the money is going. Our country now has a CTO, Vivek Kundra, and I think he should lead the government to provide more open and accessible information.
#!/usr/bin/perl
#===============================================================================
#
# FILE: parseBill.pl
#
# USAGE: ./parseBill.pl
# AUTHOR: Kyle Brandt (kb), www.kbrandt.com
# COMPANY: Boston, MA
# VERSION: 1.0
# CREATED: 03/19/2009 10:16:54 AM
#===============================================================================
use strict;
use warnings;
use Roman;
#Globals
my $printit = 1;
my $delim = "\t";
my $page = 1;
my $titleSection;
my $resolution = 1;
my $total = 0;
my %causeMoney;
my %notParsed;
my $romanRegex = '';
foreach my $number (1..20) {
$romanRegex .= Roman($number);
unless ($number == 20) {
$romanRegex .= '|';
}
}
my @Bill = <>;
my $index = 0;
foreach (@Bill) {
#Get Page Number
if (/H\. R\. 1.*?([0-9]{1,3})/) {
#print $1, "\n";
$page = $1;
}
if ( m/TITLE ($romanRegex)-[A-Z ]*/) {
$titleSection = $&;
print $titleSection, $delim, $page, "\n" if $printit;
}
#For additional is a common phrase, this gets the dollar amount after it and what it is for
my @amounts;
if (
( @amounts = /For.*?additional.*?\`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or
( @amounts = /For.*?additional.*?for(.*?)(\$[0-9,]*)/gi) or
( @amounts = /For an amount for \`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or
( @amounts = /For necessary expenses for(.*?)(\$[0-9,]*)/gi)
) {
my $whatfor;
my $amount;
while (@amounts) {
$whatfor = shift @amounts;
$amount = shift @amounts;
$amount =~ tr/,$//d;
print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
$causeMoney{$whatfor . ':' . $page} = $amount;
$total += $amount;
}
}
#Maybe if we read ahead a few lines, we will find what we are looking for
elsif ( @amounts = /For.*?additional.*?\`\`(.*?)\'\'/) {
AMOUNT:
while (@amounts) {
my $whatfor = shift @amounts;
if (length($whatfor) > 40) {
next AMOUNT;
}
if ( $index < ($#Bill - 6 )) {
for my $line (($index + 1) ... ($index + 6)) {
if (my $amount = $Bill[$line] =~ /\$[0-9,]*/) {
$amount =~ tr/,$//d;
print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
$causeMoney{$whatfor . ':' . $page} = $&;
}
}
}
}
}
#Like above, but can't figureout what it is for
elsif ( my @unknownAmounts = /For.*?additional.*?(\$[0-9,]*)/gi) {
for my $unknownAmount (@unknownAmounts) {
$unknownAmount =~ tr/,$//d;
$causeMoney{'UnknownAtPage' . $page} = $unknownAmount;
$total += $unknownAmount;
}
}
#All money, that doesn't fit into the above, could be a portion of what is above.
elsif ( my @dontKnow = /\$[0-9,]*/gi ) {
for my $money ( @dontKnow ) {
$money =~ tr/,$//d;
if ( $money >= 1000000 ) {
$notParsed{$money} = $page;
}
}
}
$index += 1;
}