You are viewing [info]salmoni's journal

Previous 10

Apr. 8th, 2012

Launched!

It's go!

http://crowdsorters.com - card sorting: crowd sourced.

Nov. 30th, 2011

Roistr into BizSpark

Our application to BizSpark was accepted by MS with a quick turnaround - pretty much the same day which was very fast service, more so than I expected. I had a quick browse of available software but there wasn't too much of interest yet, particularly as Roistr's based on an open source stack.

However, it's a nice step. Our main interest in BizSpark is the connections and marketing that MS can bring rather than cheap software.

Oct. 10th, 2011

Released

So a public announcement here - I've launched Roistr, the online semantic relevance engine!

Apr. 9th, 2011

But...

The problem is how to work out which are items are not grouped correctly. And if I knew that, then that would be the answer to the question. Looking at the associations, it is not the strengths that define membership of items in the 'mixed' group. Using just association strength would split up the groups that have been clustered accurately which is not what I want.

What this means is that testing the dimension of confidence is done differently than through pure association.

Perhaps this relates somewhat to Anderson's 'spreading activation' and associated nodes to an item are themselves tested for association (like testing the macro-properties of clusters). So you would work out the most closely associated nodes with an incoming document, and then in turn analyse the association between those nodes themselves. If they are disparate, then you can infer that the original incoming node cannot be clustered with certainty. If one the other hand they are closely clustered together, then you can infer that certainty is strong.

So it's clustering associated to ascertain certainty.

However, this is computationally expensive (though probably possible). Obviously it would be best to work it out only for nodes with (semantically) dispersed secondary nodes; but we cannot tell this before hand.

So perhaps relevance is proving to be a multi-dimensional thing?

One dimension of association with nodes; another dimension of the association of those associated nodes?

I can see myself going around in circles here.

The other alternative is to split the document into component tokens and analyse each of these in turn to examine association between each other. This can indicate whether a document within itself has dispersed semantics or not.

Testing will show.

Dec. 12th, 2010

For the record - Infomap and how to work it

This is a list of things that have to be done to get Infomap working on a modern Linux distribution.

* BLOCKSIZE in preprocessing/preprocessing_env.h : needs to be set to the highest number of words a document has in the corpus. If a document has more words than BLOCKSIZE, the building of the model will hang.

* Install libgdbm-dev with Synaptic or apt-get. Infomap needs a header file and without it, Infomap will not compile (not pass ./configure).

* Not finding ndbm.h : All happens in /usr/include

ln -s gdbm-ndbm.h ndbm.h or just copy gdbm-ndbm.h to /usr/include/ndbm.h

Infomap will not compile (not pass ./configure) without this.

Then it should go through configure, make, and make install well.

This is the code for CompareTerms:

# term1, term2 - terms to be compared
vec1 = "associate -q term1"
vec2 = "associate -q term2"
vec1 = numpy.array(vec1)
vec2 = numpy.array(vec2)
product = numpy.sum(vec1 * vec2)
return product

This produces an association between 2 terms.

When calling this, the 'args' string that calls associate must be formatted as a single string and not by Popen. This is important when sending more than 1 term. If not, associate will treat the terms as a quote search rather than an AND search.

Dec. 6th, 2010

Associating terms?

Infomap doesn't have any option to allow terms to be associated with each other. This is a problem as it's the background of what I need to do.

However, in CVS, there is a small Perl script that does the job. I don't know Perl but can just about make sense of it. The process is to take the terms and retrieve the query vectors for each ("-q" option). A series of about 100 vectors are returned. These are multiplied and then summed.

The equivalent Python code is:

# term1, term2 - terms to be compared
vec1 = "associate -q term1"
vec2 = "associate -q term2"
vec1 = numpy.array(vec1)
vec2 = numpy.array(vec2)
product = numpy.sum(vec1 * vec2)
return product

Which seems to be all that is needed. Of course, some experimentation is needed. I may be better off making a wrapper for Python to the 'associate' command just to make use a lot easier.

Infomap success!

After a long wait, I managed to get Infomap to analyse a larger corpus. Not my ideal corpus but large enough to begin preliminary experiments in semantic retrieval.

And why so long? Well I was running the study on my Asus 701 netbook - not the most powerful computer in the world by any means. However, after a week of running (while very busy with work), it seemed clear to me that something was wrong.

So I checked the mailing list and found that I needed to re-define a constant in preprocessing/preprocessing_env.h to a figure higher than the largets number of words found in any one document within the corpus. The figure (BLOCKSIZE) was set at 1000000 and I had 1.25 million words in one document. So I re-defined it as 1300000 and the analysis went through fairly quick (less than 2 hours). I had to automate the word count (wc) program with Python to get the figure I needed but that's done. However, it's working well.

And the results - for a few queries, pretty much on the lines I wanted. However, much more work is needed to really get a grip on what is happening and to see if my ideas have veracity. I think I will cross post this to Advogato as it involves open source software.

Nov. 28th, 2010

Another block

Infomap didn't run well on OSX. btw, my machine is a macbook with 2.4 GHz spec running 10.6.5.

The issue was that I kept on getting a segmentation fault when trying to use the SVD (singular value decomposition) routines to decompose the word-document matrix. Error 139 if I remember correctly.

Luckily, someone had written a post on the mailing list archives (if you list 2,000 messages, you can skip past the spam - the interface for that archive is truly appalling). The problem seems to be with 64 bit machines and memory allocation. Not being familiar with the code or too groovy with C, I'm the wrong person to sort it out. There was a patch but trying that and recompiling didn't do much either - just a different error and earlier in the analysis process.

So I've abandoned Infomap on my Mac and gone to my Asus 701 netbook! It's a 32-bit machine so shouldn't come across the same problem. Right now, it's running on a limited corpus of 1,659 documents which is a lot but not as many as I'd hoped. Perhaps when I can get my family and myself more settled, I can get a desktop machine that can handle as much as I want. Besides, I can leave the Asus running quite happily anywhere in a corner doing experiments while I need the Mac for work and life.

I'll report back with more findings.

btw, getting Infomap running on the Asus needed the same sym link to be put in as OSX (see previous post).

I'm quite excited to be working on this again. I have a whole raft of tests ready to apply - I just need the associate program to be running at a reasonable speed. From memory, I recall that Infomap worked the term comparison far more quickly than semantic vectors which was too slow for production use - and not being a Java head, I am the wrong person to optimise the code in my favour and I don't share the same aims as the developers which means my needs are a low priority.

Anyway, right now, the Asus has read in the text files in 20 minutes and is doing its maths. It's just a case of waiting now to see if it segfaults here too.

Nov. 27th, 2010

Infomap on OSX

Infomap works on OSX well enough. There are a couple of gotchas to watch out for (see previous post) such as making a symlink to the proper database files.

The biggest problem was preparing the corpus. In total, I have almost 8,000 books from Project Gutenburg. These were the biggest problem because they came in a variety of formats.

Step 1 was to get the DVD archive with many thousands of works. Then I had to remove works not in the English language. The remaining files had to be unzipped (not all are), then non-text formats removed. Infomap only works with plain text so all PDFs, Word, HTML etc files.

After that, the extraneous text (copyrights etc) had to be removed to prevent them from being included in the analysis. Then, all these cleaned texts (I ended up with just 7,808) were put into a separate directory and a list was made of all the files.

This preparation was the bulk of the work! Because Gutenburg use a variety of formats, there was a lot of scripting to ensure the texts were suitable.

Right now, the Infomap model is being built.

Nov. 24th, 2010

Infomap

This is a last from the past. I'm going to document how to get Infomap compiled and installed on Ubuntu.

Problem is that some headers are not found.

Use Synaptic to install libgdbm-dev

Then, configure is not finding nbdm.h so provide a symlink

ln -s gdbm-ndbm.h ndbm.h

Should compile then.

Will try on OSX.

Previous 10

April 2012

S M T W T F S
1234567
891011121314
15161718192021
22232425262728
2930     

Syndicate

RSS Atom
Powered by LiveJournal.com