Printer-friendly version

14 November 2003

Data Acquisition with Perl - #1

by Joseph DiVerdi

What's Up

In this installment we will turn to the problem of acquiring experimental data using a Perl program. Actually, the problem isn't only acquiring the data but what to do with it once it has been acquired. In this case, we will be examining a special type of experimental data, that is, event capture.

An event can be one of many observables including, but not limited to,

  • a lever actuated by an experimental subject,
  • the detection of a radioactive decay or cosmic ray by a Geiger-Mueller tube,
  • the detection of a lightening strike by a photo-detector or radio-detector,
  • the passing of an automobile on a road as detected by a pneumatic tube,
  • one of my cats passing through its door by breaking a light path,
  • someone in a household turning on the bathroom light or opening the refrigerator door,

and so on. The range of possibilities is limited only by our experimental imagination. However, the task is the same in this type of acquisition problem: record when the actual event occurs. The program described here will perform this task in a convenient way.

Equally important is the means by which the occurrence of an event is declared to the computer. In the method I am describing here I have chosen a scheme which makes use of the serial port as the interface between the computer and the outside world. A circuit was developed, using junk-box parts, which accepts a logic signal and emits a single character in ASCII format using the RS-232 protocol. The logic signal is the falling edge of a CMOS electrical signal but could just as easily be any of a number of other electrical signals. The character emitted is fixed in the circuit and has no significance - the important quantity is when the character occurs because that time signals that the chosen event has taken place. The software program waits and waits and waits until it is notified that the character has been received at the serial port and then it snaps into action by recording the time that it received the notification.

A Little Hardware

The schematic of the logic translator circuit is shown in the first figure. It consists of three ICs and runs on a single 5VDC power supply. It draws a little over 10mA so it can draw its power easily from the computer supply.

Logic Convertor Schematic

Click image to enlarge

 

The LM555 IC generates a square wave which serves as the baud rate generator; it runs at sixteen times the desired baud rate. The IM6402 is a rather old CMOS UART (Universal Asynchronous Receiver Transmitter) IC of which I happen to have a handful. They are probably hard to obtain in this day and age so if you're interested in building something like this be prepared to improvise. Perhaps a more timely strategy is to use a PIC or STAMP micro-controller to perform this function - they are inexpensive and plentiful and easy to program. In any event (no pun intended) the UART generates a bit stream corresponding to a single character whenever pin 23 is brought to ground. The character is selected by the logic level of eight input pins. I used a small 8-pole DIP switch to permit easy changes to the selected character but have never used any value other than 000010102, or 0127, or 1010, or 0A16, or <LF>, or "\n" (dependi! ng upon your number system or language) which is a "new line" character in UNIX parlance. The MAX236 IC is a CMOS to RS-232 convertor which only requires single +5VDC power. It has its own internal positive and negative DC power convertors which permit it to deliver true RS-232 drive levels. An image of the convertor follows.

Click image to enlarge

Looking at the Disk File Format

This is a good time to describe exactly how the data will be organized on the disk so that some of the programming logic will be more clear. There are a few basic rules:

  • All the data files will be contained in a single disk directory.
  • Each data file will correspond to one calendar day's worth of events, irrespective of how many events occur in that day.
  • Each data file's name will be the calendar date to which it corresponds. The file name will have the format: ccyy_mm_dd_UT.
    • ccyy - two decimal digits of century & two decimal digits of year (no Y3K problem here).
    • mm - two decimal digits of month with leading zero padding.
    • dd - two decimal digits of day with leading zero padding.
    • The string "UT" indicating that the date is in Universal Coordinated Time, basically the successor to Greenwich Mean Time.
    • Each of these fields is separated by the underscore _ character.
  • Each data file's contents consists of one or more "comment" lines containing meta-information and one or more event records.
  • Each event record exists in the appropriate file as a single line containing the time of occurrence of the event in the format: ccyy.mm.dd hh:mm:ss UT.
    • The first three fields are in the same format as the file name.
    • hh - two decimal digits of the hour, in twenty-four hour format with leading zero padding.
    • mm - two decimal digits of minute with leading zero padding.
    • ss - two decimal digits of second with leading zero padding.
    • The string "UT" indicating that the date is in Universal Coordinated Time, basically the successor to Greenwich Mean Time.
    • Note that these fields are separated by different characters.

These formats may seem odd, convoluted, and generally perverse but they do serve to organize the data in a form which is convenient, human-readable, compact (while being human-readable), unique, and searchable. Since a new file is written every day, a particular data file isn't open forever and separate data mining or data reaping programs can go after a particular day's data once without having to (re-)read other days' data. Since the data are written as text, mere mortals can inspect the data without special utilities or x-ray vision. Since the data have not been binned (beyond the one second level) they can be analyzed and re-analyzed at any time over various time windows. Here is a sample of the contents of a particular data file.


			# Created with script version: 20011222
			# file name: 2002.01.06_MT
			2002.01.06 00:09:10 MT
			2002.01.06 09:02:40 MT
			2002.01.06 09:50:12 MT

A Little Software

Now let's take a look at the Perl code which will suck in those characters emitted by the hardware and write disk files containing the timing of the events.


			#! /usr/bin/perl
			# ----------------------------------------------------------------------------------
			# particle_log, by Joseph A. DiVerdi
			# Copyright 2001 by La Famiglia DiVerdi
			# Copyright 2002, 2003 by XTR Systems, LLC
			#
			# Program to slurp in data from the particle counter, process it into a
			#  standardized format, and archive it to disk.
			#
			# Revision History:
			# created   1 Dec 2001 JAD
			# ----------------------------------------------------------------------------------
			# Includes and other external modules
				use warnings;
				use strict;
				use Carp;
			# ----------------------------------------------------------------------------------
			# main execution module
				my $version = "20031111";
				
				# data_directory_name must end with a slash
				my $directory_name = "/home/diverdi/html/event_data/";
				my $port_name = "/dev/ttyS1";
				
				# open the serial port for read
				open INPUT, "<", $port_name or
					die "Can't open serial port '$port_name': $!\n";
				
				# define this variable to prevent "strict" complaints but leave it undefined
				my $file_name;
				
				# look for a line of serial data terminated with a <NL>
				while (<INPUT>) {
					# save a copy of the current time which is when the event is received
					my $current_time = time;
					
					# check if the current file name is defined or if it is defined but it doesn't correspond to the current day
					if (!defined $file_name or $file_name ne format_file_name($current_time)) {	
						# set up the now current log file name and full path
						$file_name = format_file_name($current_time);
						my $file_path = $directory_name . $file_name;
						
						# issuing an open on DATA will automatically close an existing open data file
						open DATA, ">>" . $file_path or
							die "Can't open log file '$file_path': $!\n";
						# change the permissions of the file to owner: read, write; group: read; world: none
						chmod 0640, $file_path;
						
						# set output flushing for the DATA file handle
						# that is do NOT buffer disk data, write it to disk immediately
						select DATA;
						$| = 1;
						
						# put the header in this data file if the file doesn't already exist
						print DATA "# Created with script version: $version\n# file name: $file_name\n" 
							unless -s $file_path;
					}
					
					# put the time the event occurred in the data file in a human readable format
					my @times = gmttime $current_time;
					printf DATA "%04d.%02d.%02d %02d:%02d:%02d UT\n", 
						$times[5] + 1900, $times[4] + 1, $times[3], $times[2], $times[1], $times[0];
				}
			# ----------------------------------------------------------------------------------
			sub format_file_name {
				
				# convert the supplied argument, in Unix time, into a nicely formated string
				# such as: 2003_11_02_UT
				my @times = gmttime shift;
				return sprintf "%04d.%02d.%02d_UT", $times[5] + 1900, $times[4] + 1, $times[3];
			}
			# ----------------------------------------------------------------------------------

As in the previous code examples of this series, the beginning of the program conforms to some requirements and to some good programming standards. The first line tells the operating system that this is a Perl program and to execute it as such. There's an abbreviated comment section describing the function of the program and its heritage - it has been abbreviated for this publication and should contain more explanatory detail in general. There are a few "includes" which provide a standardized and somewhat rigorous programming environment (I need all the help I can get to make me write better code). The "Carp" module is a new one for us. It provides a more detailed set of error messages which you'll appreciate when (not if) something goes wrong.

The first executable statements are variable assignment statements. The version number is identified using a calendar date format and is always written to the disk file contents so that the file's format can be traced to a particular program version. The directory name specifies the directory which will contain the various data files. The port name contains the name of the UNIX serial device where the characters will be received. You'll note that we open, close, read, and write to a device in exactly the same fashion as we operate on disk files which is fundamental to the UNIX philosophy.

The first open statement connects the program to the serial port; the leading "<" character signifies that this connection will be for reading (as opposed to writing). The rest of this pair of statements "openordie …" is a very popular Perl idiom which deserves a little attention because of that popularity. It is known in the programming world as "short cutting" the "or" statement. You see, the open statement returns a value of true or false depending upon whether it was able to open the device of not. This return value is the first argument of the or boolean statement. Since the result of an or statement is true when either or both of its arguments is true, if the first argument is true then there is no need to evaluate or execute the second argument and it never is executed. If the first argument is false! (because the device couldn't be opened) then the first argument is false and the second argument needs to be evaluated to return the result. The second argument, however, is a die statement which is actually the combination of an exit and a print statement. If it is executed then a message is reported and the program terminates immediately.

The bulk of the program is a while loop. Note that the INPUT file descriptor is the test argument. The neat feature of this construction is that the execution stalls at this point until a character corresponding to the end of a line (<NL>) is received. So it just sits there, waiting for a character to arrive, without consuming any computer resources until the character arrives when the loop commences. After the loop contents are executed then the program control returns to this point and it awaits a new character. The reason for this behavior is beyond the scope of these articles but involves neato programming principles such as signals and blocking I/O.

The first task inside the loop is to capture the current time, that is the time when the event occurred. The time function returns the current time as so-called UNIX time, that is the integer, decimal number of seconds since the beginning of the UNIX epoch which occurred on January 1, 1970 UT. This now very large number is a very convenient way of keeping track of clock time but only because a large number of functions are available to manipulate it. It is also important to note that there are some limitations to the technique used here. First of all, the timing of an event cannot be specified below the one second level. There are high-resolution time functions which provide microsecond resolutions but we'll save those for another time. Second, the basic, out-of-the-box, run-of-the-mill UNIX is not real-time UNIX. Since UNIX is a multi-tasking operating system it is possible that some other task will not relinquish computer resources for some finite! time which can distort our timing and make it appear as if a particular event occurred later than it actually did. There are variants known as Real-Time UNIX which can be made strictly deterministic but I avoid these techniques by ensuring that the data acquisition computer is reserved for data acquisition and no one plays any games on it. This problem is addressed by throwing hardware at it.

The next series of statements checks to see if a disk file exists which corresponds to the same day as the current event, that disk file is open and ready for writing, that the appropriate comment lines have been written to that disk file, and that the current event information is immediately written to that disk file without buffering, a technique used to make disk access more efficient but a nuisance to us in this application.

Actual Experimental Data

I have used this setup, along with a pair of Geiger-Mueller tubes and a bunch more electronics to capture radioactive events and cosmic ray events. Since this installment is already longer than I would like those experimental data will appear in a separate article.

SAS Member Unix Accounts For Learning Perl

If you're an SAS member in good standing and are interested in trying out some Perl programming but don't have access it, I'll be happy to spare a few CPU cycles on one of my servers and provide an account. From this account you can edit and run Perl programs but must not do evil network things. The whole story is spelled out on the application page. Read the rules of engagement, fill in the form, and I'll do the rest (including checking with SAS HQ to see if you've been naughty or nice). You'll receive your login information via email shortly thereafter.


Joseph DiVerdi is keeping an eye on those pesky cats using technology. Contact him at diverdi@xtrsystems.com.