LibraryThing and Sirsi/Dynix @ ACPL Part II

OK so it is time to clean up the file. Below is the perl code that I wrote to solve this problem. I have tried to comment it enough that it is easy to understand. This is just out of self interest as I might need to look at it again in the future.

#!/usr/bin/perl
###############################################################################
# Program : lt_parser.pl
# Programmer : Sean Robinson
# Date : 4.19.2008
# Description : This program takes the output from the sirsi command
# selitem -oC | selcatalog -iC -oe -e020,245,100 > ISBN_extract.out
# and parses it so that the data is clean for inporting into LibraryThing
#
# Version : 0.01
#
#
# Note : I recommend testing this code by taking the ISBN_extract.out
# file and extracting 100 lines and then running the code against that reduced
# data file as a test. The command I used is
# >tail -100 ISBN_extract.out > testdata.out
#
#
###############################################################################

###
# I have explicitly defined the path where my data is you would need to change
# this location.
###
open (FH, “</home/srobinson/dev/librarything/ISBN_extract.out”) || die “Cannot open ISBN_extract.out”;

###
# defining variables
###

my $isbn;
my $title;
my $author;

###
# Looping through the file you have just opened line by line
###
while (<FH>) {

# Assigns the input line $_ to the the variable $input
$input = $_;

# Spliting on piple symbol. This is a rough way to get the three pieces of data.
# The first element is isbn the second is title and the third is author
@values = split(/\|/, $input);

# isbn
$isbn = $values[0];

# title
$title = $values[1];

# author
$author = $values[2];

# set of rules to get a clean isbn from the data

# split on blank space after isbn number
@isbnsplit1 = split(/ /, $isbn);
$f_isbn = $isbnsplit1[0];

# split on ( after isbn
@isbnsplit2 = split(/\(/, $f_isbn);
$f_isbn = $isbnsplit2[0];

# checking that string length either 10 or 13 digits
$str_length = length($f_isbn);

if (($str_length == 10) || ($str_length == 13)) {
# print “String length is $str_length “;
}

# if less than 10 character assing variable BLANK
if (($str_length < 10) || ($str_length > 13)) {
# print “String length is $str_length “;
$f_isbn = “BLANK”;
# print “ISBN is $f_isbn \n”;
}

# checking for colon at end of isbn number and removing it
if ($str_length == 11) {
$f_isbn=chop($f_isbn);
}

# split on colon after isbn number

if (($f_isbn=~/\-/) || ($f_isbn=~/\s+/) || ($f_isbn=~/\:/) || ($f_isbn=~/BLANK/) ) {
} else {
$isbn = $f_isbn;
print “$isbn\t$title\t$author\n”;
}

}
# end while# end of code

At the command line I just run

>./lt_parse.pl > hold.out

This code just parses the file ISBN_extract.out and pipes the output to hold.out. This is the file that you will upload to to LibraryThing. I have uploaded the file and am just waiting for it to index. I emailed them as it seemed to be taking a while. They told me that this is normal and it would probably be done overnight. The only question I have is that I did add a \n (newline) at the end of each record and I do not know if this will cause a problem. We will just have to wait and see. I will have to say that working with LibraryThing has been a joy. Sonya and been great and I am excited about this project.

Leave a Reply

Your email address will not be published.