linGNUistics

vrijdag 20 november 2009

bin-search for Python

Find function for sorted lists. In a sorted list we can do better than using 'find' which must look at each element in the array in turn. In this function we jump right in the middle of the array. If the middle element is the one we want, we're done. Otherwise, we search in the left or the right half of the vector. This depends on whether the object was less than or greater than the middle element. This example is taken from Paul Graham's 'ANSI Common Lisp', ch 4.

-----------------------

def binSearch(obj, vec):

    l = len(vec)

    if l != 0:

        return finder(obj, vec, 0, l-1)

def finder(obj, vec, start, end):

    r = end - start

    if r == 0:

        if obj == vec[int(start)]:

            return obj

        else:

            return 0

    else:

        mid = start + round(r / 2)

        obj2 = vec[int(mid)]

        if obj < obj2:

            return finder(obj, vec, start, mid-1)

        else:

            if obj > obj2:

                return finder(obj, vec, mid+1, end)

            else:

                return obj

Breadth first python script

Simple implementation of 'breadth-first search'. This implementation is based on the Common Lisp version in Paul Graham's 'ANSI Common Lisp'.

'm' is a network. We can try to find the shortest path from 'a' to 'd' with:

> shortestPath('a', 'd', m)

----------------------------------------------

m = [['a', 'b', 'c'], ['b', 'c'], ['c', 'd']]

def shortestPath(start, end, network):

    return bfs(end, [[start]], network)

def bfs(end, queue, network):

    if queue == []:

        return 0

    else:

        print queue

        path = queue[0]

        print path

        node = path[0]

        print node

        if node == end:

            return path[::-1]

        else:

            return bfs(end, queue[1:] + newPaths(path, node, network), network)

def newPaths(path, node, network):

    return map(lambda n: [n] + path, assoc(node, network)[1:])

def assoc(key, lst):

    pair = lst[0]

    if key == pair[0]:

        return pair

    else: return assoc(key, lst[1:])

zondag 1 november 2009

check dit: http://orgmode.org/

woensdag 16 september 2009

Simpel uniq-script

infile = open(raw_input("Bestand: "), "r")

bestand = infile.readlines()

infile.close()

data = []

for line in bestand:

        if line not in data:

            data.append(line)

output = open(raw_input("Output: "), "w")

for regel in data:

    output.write(regel)

    output.close()

zondag 5 juli 2009

Ok, een beetje laat, maar hierbij dan ook mijn eerste post. Dit is een scriptje (perl) om key words in context (KWICs) te vinden in een tekstbestand, waarbij je zelf kan aangeven hoeveel woorden je er om heen wil (links en rechts). De output is in html, dus die kan je makkelijk in je browser inlezen.
In bestand X vind je dus woord Y met p woorden links en q rechts door in je terminal de opdracht te geven:


$ perl kwic.plx Y X p q

Morgen komt mijn bigram-collocatie-programma!



#! /usr/bin/perl
use warnings;
use strict;

## Author: Barend Beekhuizen
## Exercise 1.7 of Manning and Schütze's book (Statistical NLP). A KWIC-programme that outputs in html. Arguments of the command are (1) search query (word) as a reg.exp. (2) file name of corpus (3) context left in words (4) context right in words.

# This reads the arguments from the command line and assigns them as values to the four variables
my $searchQuery = shift;
my $corpus = shift;
my $contextLeft = shift;
my $contextRight = shift;

# This initiates three arrays; the first two as the windows of the context, the last one the entire corpus
my @contextLList;
my @contextRList;
my @corpusWords;

# This opens the corpus we declared
open CORPUS, "< $corpus" or die $!;

# The first string of html-code: initiating a html-script, describing the query and starting a table
print  "<html>\n<body>\n",
 "<h4>KWIC for <i>$searchQuery</i> in \"$corpus\" ",
 "with contexts of $contextLeft left and $contextRight right</h4>\n",
 "<table><table border=\"0\"\ncellspacing=\"10\">";

# preprocessing the corpus: spacing all punctuation marks 
while (<CORPUS>) 
 { 
 s/\./ ./g; s/\,/ ,/g; s/;/ ;/g; s/\:/ :/g; s/\?/ ?/g; s/\!/ !/g; s/"/ "/g; s/'/ '/g;  
 $/ = " ";
 push @corpusWords, $_
 };

# going over the corpus
foreach my $i (1..@corpusWords-1) 
 {
 $/ = " ";
 push @contextLList, $corpusWords[$i-1];
 if (@contextLList > $contextLeft) {shift @contextLList};
 push @contextRList, $corpusWords[$i+$contextRight];
 if (@contextRList > $contextRight) {shift @contextRList};
 if ($corpusWords[$i] =~ /\b$searchQuery\b/i) 
  {
  my $hitLeft = "@contextLList";
  my $hitRight = "@contextRList";
  my $hit = $corpusWords[$i];
  print "<tr>\n<td align=\"right\">$hitLeft</td>\n<td align = \"center\">$hit</td>\n<td align = \"left\">$hitRight</td>\n</tr>"
  };
 };

# final string of html
print "</table>\n</body>\n</html>\n";

maandag 29 juni 2009

N(atural) L(anguage) T(ool)K(it)

Een aanrader: www.nltk.org.

De Natural Language TookKit is een set python modules voor allerlei taalkundige taken. Bij de modules komt ook een uitstekend boek waarin zowel veel basisvaardigheden voor het programmeren in Python als meer geavanceerde toepassingen aan bod komen.

donderdag 18 juni 2009

Emacs #1

Gnu Emacs is eenvoudigweg de beste teksteditor (OS?) die er is (zo!). Je kunt het zo gek niet verzinnen, of het is mogelijk in Emacs. Ik zit nog middenin de (rather steep) learning curve en wil graag mijn ervaringen met jou, Barend, en de wereld, delen.

Vandaag een aantal kleine handige trucjes.

De mate van productiviteit die je met Emacs kunt bereiken hangt vooral af van hoe snel je de vele honderden key-bindings weet te leren. Deze cheat-sheet kan dat proces wellicht een beetje versnellen.
Het werken met meerdere schermen binnen Emacs is natuurlijk al fantastisch. Maar wat als je een bepaalde schermconfiguratie in z'n geheel kan switchen naar een andere? Kijk hier, installeer (Debian Linux: apt-get install elscreen) en voeg de volgende regels toe aan je .emacs:

(require 'elscreen)
;(setq elscreen-display-tab nil) ; haalt tab-lijst weg
In emacs kun je je met C-x-o verplaaten naar andere schermen. Een intuïtievere manier is je met de pijltjes door de verschillende schermen te bewegen. Ok, dan schrijven we daar toch een stukje code voor! :) Alles wat je moet doen is:

(windmove-default-keybindings 'meta)

toevoegen aan je .emacs. Nu kun je je met Alt + pijltje verplaatsen van scherm naar scherm.

Voor nu tot hier.