LibraryThing and Sirsi/Dynix @ ACPL Part II
- November 11th, 2008
- Posted in Posts
- By srobinson
- Write comment
OK so it is time to clean up the file. Below is the perl code that I wrote to solve this problem. I have tried to comment it enough that it is easy to understand. This is just out of self interest as I might need to look at it again in the future.
#!/usr/bin/perl
###############################################################################
# Program : lt_parser.pl
# Programmer : Sean Robinson
# Date : 4.19.2008
# Description : This program takes the output from the sirsi command
# selitem -oC | selcatalog -iC -oe -e020,245,100 > ISBN_extract.out
# and parses it so that the data is clean for inporting into LibraryThing
#
# Version : 0.01
#
#
# Note : I recommend testing this code by taking the ISBN_extract.out
# file and extracting 100 lines and then running the code against that reduced
# data file as a test. The command I used is
# >tail -100 ISBN_extract.out > testdata.out
#
#
###############################################################################
###
# I have explicitly defined the path where my data is you would need to change
# this location.
###
open (FH, “</home/srobinson/dev/librarything/ISBN_extract.out”) || die “Cannot open ISBN_extract.out”;
###
# defining variables
###
my $isbn;
my $title;
my $author;
###
# Looping through the file you have just opened line by line
###
while (<FH>) {
# Assigns the input line $_ to the the variable $input
$input = $_;
# Spliting on piple symbol. This is a rough way to get the three pieces of data.
# The first element is isbn the second is title and the third is author
@values = split(/\|/, $input);
# isbn
$isbn = $values[0];
# title
$title = $values[1];
# author
$author = $values[2];
# set of rules to get a clean isbn from the data
# split on blank space after isbn number
@isbnsplit1 = split(/ /, $isbn);
$f_isbn = $isbnsplit1[0];
# split on ( after isbn
@isbnsplit2 = split(/\(/, $f_isbn);
$f_isbn = $isbnsplit2[0];
# checking that string length either 10 or 13 digits
$str_length = length($f_isbn);
if (($str_length == 10) || ($str_length == 13)) {
# print “String length is $str_length “;
}
# if less than 10 character assing variable BLANK
if (($str_length < 10) || ($str_length > 13)) {
# print “String length is $str_length “;
$f_isbn = “BLANK”;
# print “ISBN is $f_isbn \n”;
}
# checking for colon at end of isbn number and removing it
if ($str_length == 11) {
$f_isbn=chop($f_isbn);
}
# split on colon after isbn number
if (($f_isbn=~/\-/) || ($f_isbn=~/\s+/) || ($f_isbn=~/\:/) || ($f_isbn=~/BLANK/) ) {
} else {
$isbn = $f_isbn;
print “$isbn\t$title\t$author\n”;
}
}
# end while# end of code
At the command line I just run
>./lt_parse.pl > hold.out
This code just parses the file ISBN_extract.out and pipes the output to hold.out. This is the file that you will upload to to LibraryThing. I have uploaded the file and am just waiting for it to index. I emailed them as it seemed to be taking a while. They told me that this is normal and it would probably be done overnight. The only question I have is that I did add a \n (newline) at the end of each record and I do not know if this will cause a problem. We will just have to wait and see. I will have to say that working with LibraryThing has been a joy. Sonya and been great and I am excited about this project.
No comments yet.