Original computing articles by a systems administrator

Parsing the The American Recovery and Reinvestment Act with Perl

Introduction:
I think of the American government as a democratic republic. The government is run by a small group of people, a republic, that is elected by the public to represent them, a democracy. Congress, and the bills they pass, should have oversight from the people. Although the bills are made available to the public, the size makes them somewhat inaccessible, and in my opinion, the media fails at providing enough detailed information on the content of these bills.

My goal was to take the 2009 stimulus bill and try to parse out some information about where the money is going in this massive bill. Although my parser is incomplete, it was still able to parse out information that I could not find by searching for it on the Recovery Website. So I consider it useful.

Parsing the Bill:
My first step was to get the bill into something I could parse. When I started this, I wasn’t aware of THOMAS, so I got the pdf from the recovery.org website and then converted it. I did this using pdftotext, and I then converted it to ascii to make it easier to work with the funny start and end quotes (also called left handed and right handed quotes):

pdftotext -enc UTF-8 -eol unix recoveryAct.pdf recoveryAct2.txt
cat recoveryAct2.txt | uni2ascii -e > recoveryAct3.txt

The parsing is done with regular expressions. After looking over the file, I decided the best way in was to parse the file line by line. I did however want to look ahead at certain points, so I read the file into memory to make this simple. This is known as slurping. The first two regular expressions capture the page number and the section of the document. These are important because all this script really does is make a sort of index, so I can find things that might be of interest.

The group of four regular expressions starting at line 50 do the bulk of the work. These parse the dollar amount after variations of a frequent phrase in the bill. The phrase goes something like ‘For an additional amount for … $50,000,000′. I put the regular expressions in the order of what I think will provide me the most accurate match first since ‘or’ is a short-circuit operator. I also do this on a larger scale with the if and elsif blocks.

As an example, I will explain the first regular expression at line 49. (I am going to explain the meta characters, not all the backtracking and how the Perl regular expression engine handles this, I would recommend “Pro Perl Parsing” or that you search the Perl Journal to learn about that). ‘For’ simply matches the word ‘For’ with any case variation (for example for or fOr), the /i modifier at the end makes the whole expression case insensitive. For ‘.*?’. the period means ‘any character’, the asterisk means match ‘any character’ zero or more times. The question mark makes the asterisk non-greedy. Since zero or more of any character would also ‘consume’ the word ‘additional’, making the asterisk non-greedy makes it so it will stop consuming characters if it finds the word ‘additional’. With \`\`(.*?)\’\’ I am capturing what appears between the ‘funny quotes’ I mentioned previously, this is the ascii interpretation of these funny quotes. The parenthesis around (.*?) capture what is between the funny quotes, which is what the money following is for as far as my parser is concerned. (\$[0-9,]*) captures any sort of dollar amount by looking for any combination of number and commas after a dollar sign. Lastly, the /g makes it so this will work if the pattern happens multiple times on the same line. I then print out the captured information with the page number.

Starting at line 60 I made it so it can read ahead a few lines in case my regular expression is interrupted by a new line. I did this by keeping track of which line I am at (which is the same as the index of the array), and then having a nested loop which reads ahead a few lines but without interrupting the flow of the main parsing loop.

The last sections, starting at line 84, I used to print out more basic matches and to look at them so I could see what dollar amounts were not captured, and use that information to improve my parser.

You can get the output of the script here. It has lines to show to start of each section of the bill, and then lines for the amount of money, what the parser thinks it is for, and the page of the bill that the line refers to. The page is important, because the script doesn’t understand that the following amount might not be the total amount of money, and might get confused elsewhere as well.

My Next Steps:
The next steps I would like to take are to start looking at the Lingua modules and look into incorporating natural language processing. It also might be helpful if I capture the html versions from THOMAS, as this will allow me to already have the sections parsable with HTML::TreeBuilder.

Conclusion:
I think congress should develop, and actually start using, XML markup for their bills. This will allow people to develop proper parsers that could retrieve the information, and display it visual formats so people could have a better handle on the where the money is going. Our country now has a CTO, Vivek Kundra, and I think he should lead the government to provide more open and accessible information.

#!/usr/bin/perl 
#===============================================================================
#
#         FILE:  parseBill.pl
#
#        USAGE:  ./parseBill.pl  
#       AUTHOR:  Kyle Brandt (kb), www.kbrandt.com
#      COMPANY:  Boston, MA
#      VERSION:  1.0
#      CREATED:  03/19/2009 10:16:54 AM
#===============================================================================

use strict;
use warnings;
use Roman;

#Globals
my $printit = 1;
my $delim = "\t";
my $page = 1;
my $titleSection;
my $resolution = 1;
my $total = 0;
my %causeMoney;
my %notParsed;
my $romanRegex = '';
foreach my $number (1..20) {
	$romanRegex .= Roman($number);
	unless ($number == 20) {
		$romanRegex .= '|';
	}
}

my @Bill = <>;
my $index = 0;
foreach (@Bill) {
	#Get Page Number
	if (/H\. R\. 1.*?([0-9]{1,3})/) {
		#print $1, "\n";
		$page = $1;
	}
	if ( m/TITLE ($romanRegex)-[A-Z ]*/) {
		$titleSection = $&;
		print $titleSection, $delim, $page, "\n" if $printit;
	}
	#For additional is a common phrase, this gets the dollar amount after it and what it is for
	my @amounts;
	if ( 
	   ( @amounts = /For.*?additional.*?\`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or 
	   ( @amounts = /For.*?additional.*?for(.*?)(\$[0-9,]*)/gi) or 
	   ( @amounts = /For an amount for \`\`(.*?)\'\'.*?(\$[0-9,]*)/gi) or 
	   ( @amounts = /For necessary expenses for(.*?)(\$[0-9,]*)/gi)  
	   ) {
		my $whatfor;	
		my $amount;
		while (@amounts) {
			$whatfor = shift @amounts;
			$amount = shift @amounts;
            $amount =~ tr/,$//d;
            print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
            $causeMoney{$whatfor . ':' . $page} = $amount;
            $total += $amount;
    	}
	} 
	#Maybe if we read ahead a few lines, we will find what we are looking for
	elsif ( @amounts = /For.*?additional.*?\`\`(.*?)\'\'/) {
		AMOUNT:
		while (@amounts) {
			my $whatfor = shift @amounts;
			if (length($whatfor) > 40) {
				next AMOUNT;
			}
				if ( $index < ($#Bill - 6 )) {
					for my $line (($index + 1) ... ($index + 6)) { 
					if (my $amount = $Bill[$line] =~ /\$[0-9,]*/) {
            			$amount =~ tr/,$//d;
            			print $amount, $delim, $whatfor, $delim, $page, "\n" if $printit;
						$causeMoney{$whatfor . ':' . $page} = $&;
					}
				}
			}
		}	
	}
	#Like above, but can't figureout what it is for
	elsif ( my @unknownAmounts = /For.*?additional.*?(\$[0-9,]*)/gi) {
		for my $unknownAmount (@unknownAmounts) {
			$unknownAmount =~ tr/,$//d;
			$causeMoney{'UnknownAtPage' . $page} = $unknownAmount; 
			$total += $unknownAmount;
		}
	}
	#All money, that doesn't fit into the above, could be a portion of what is above.
	elsif ( my @dontKnow = /\$[0-9,]*/gi ) {
		for my $money ( @dontKnow  ) {
			$money =~ tr/,$//d;
			if ( $money >= 1000000 ) { 
			$notParsed{$money} = $page;
			}
		}
	}
	$index += 1;
}

Leave a Reply