Debuging a script that parses /proc/net/dev
A Intermittent Problem:
I wrote a Perl script for Nagios that would figure out the bandwidth of an interface by parsing TX (transmit) and RX (receive) bytes from /proc/net/dev. The proc file system is a virtual file system that provides the ability to view various kernel statistics as well as modify some kernel parameters. My script parses the file twice at a specified interval, and then subtracts the old value from the new value to return bytes per second. I realized that this wasn’t the most accurate method, but it was good enough for my purposes and I didn’t have to install snmp. Also, the larger the interval, the smaller the error would generally be assuming light load.
The problem was that this script would fail every so often with ‘Not numeric subtraction’. So I started saving snapshots of /proc/net/dev and noticed that the script would fail after when the values were around 4 billion something. This I knew to be in the neighborhood of 2^32 (The max of a positive only 32-bit integer value). To confirm my thoughts that this was the max value for this counter, I decided to have a poke around the kernel source code.
Into the Kernel:
I didn’t know where to look in the source for this, but /proc/net/dev has the string ‘Inter-|’ which I figured would be a unique enough string to give me a place to start. Sure enough, a recursive grep for this string returned only 3 lines of code. The function I wanted was dev_seq_printf_stats in dev/core/dev.c:
static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
{
struct net_device_stats *stats = dev->get_stats(dev);
seq_printf(seq, "%6s:%8lu %7lu %4lu %4lu %4lu %5lu %10lu %9lu "
"%8lu %7lu %4lu %4lu %4lu %5lu %7lu %10lu\n",
dev->name, stats->rx_bytes, stats->rx_packets,
stats->rx_errors,
///.....
Looking at the printf specifiers for this they were %ul — unsigned long integer, which on my system was indeed a max of 4294967295 ( 32^2 – 1). I wanted to be extra sure, so I traced the net_device_stats struct to include/linux/netdevice.h and confirmed that the net_device_stats->rx_bytes member was in fact an unsigned long integer. So now I knew the error happened when the counter maxed out and then reset to zero, but why a non-numeric subtraction error?
Problem Found:
%8lu as a ANSI C standard library printf specifier defaults to 8 characters wide, and also defaults to right justify since there is no hyphen flag. To find out if the kernel did the same I traced seq_printf to lib/vsprintf.c and saw that the Linux kernel version formatted this in the same way. When the bytes value was less than 8 characters long, there was leading white space that threw off my parser. All I needed was to add the extra line at line 9 to eliminate any leading whitespace:
sub parseBandwidth {
my $interface = shift;
my @ifconfigOutput = @_;
foreach my $line (@ifconfigOutput) {
if ( $line =~ /:/ ) {
my @interfaceLine = split( /:/, $line);
if ($interfaceLine[0] =~ /$interface/) {
# Next line is to sanitize leading whitespace
$interfaceLine[1] =~ s/^\s+//;
my @interfaceStats = split( /\s+/, $interfaceLine[1] );
print( LOG "DEBUG I have parsed out: @interfaceStats\n") if $debug;
return @interfaceStats;
}
}
}
}