zondag 5 juli 2009

KWIC

Ok, een beetje laat, maar hierbij dan ook mijn eerste post. Dit is een scriptje (perl) om key words in context (KWICs) te vinden in een tekstbestand, waarbij je zelf kan aangeven hoeveel woorden je er om heen wil (links en rechts). De output is in html, dus die kan je makkelijk in je browser inlezen.
In bestand X vind je dus woord Y met p woorden links en q rechts door in je terminal de opdracht te geven:

$ perl kwic.plx Y X p q


Morgen komt mijn bigram-collocatie-programma!


#! /usr/bin/perl
use warnings;
use strict;

## Author: Barend Beekhuizen
## Exercise 1.7 of Manning and Schütze's book (Statistical NLP). A KWIC-programme that outputs in html. Arguments of the command are (1) search query (word) as a reg.exp. (2) file name of corpus (3) context left in words (4) context right in words.

# This reads the arguments from the command line and assigns them as values to the four variables
my $searchQuery = shift;
my $corpus = shift;
my $contextLeft = shift;
my $contextRight = shift;

# This initiates three arrays; the first two as the windows of the context, the last one the entire corpus
my @contextLList;
my @contextRList;
my @corpusWords;

# This opens the corpus we declared
open CORPUS, "< $corpus" or die $!;

# The first string of html-code: initiating a html-script, describing the query and starting a table
print "<html>\n<body>\n",
"<h4>KWIC for <i>$searchQuery</i> in \"$corpus\" ",
"with contexts of $contextLeft left and $contextRight right</h4>\n",
"<table><table border=\"0\"\ncellspacing=\"10\">";

# preprocessing the corpus: spacing all punctuation marks
while (<CORPUS>)
{
s/\./ ./g; s/\,/ ,/g; s/;/ ;/g; s/\:/ :/g; s/\?/ ?/g; s/\!/ !/g; s/"/ "/g; s/'/ '/g;
$/ = " ";
push @corpusWords, $_
};

# going over the corpus
foreach my $i (1..@corpusWords-1)
{
$/ = " ";
push @contextLList, $corpusWords[$i-1];
if (@contextLList > $contextLeft) {shift @contextLList};
push @contextRList, $corpusWords[$i+$contextRight];
if (@contextRList > $contextRight) {shift @contextRList};
if ($corpusWords[$i] =~ /\b$searchQuery\b/i)
{
my $hitLeft = "@contextLList";
my $hitRight = "@contextRList";
my $hit = $corpusWords[$i];
print "<tr>\n<td align=\"right\">$hitLeft</td>\n<td align = \"center\">$hit</td>\n<td align = \"left\">$hitRight</td>\n</tr>"
};
};

# final string of html
print "</table>\n</body>\n</html>\n";

2 opmerkingen: