So, IBM has told us all the secrets to building Watson. How hard could it be?
OK, pretty hard. But the whole system was built out of some fundamental building blocks. From a high level perspective, the most interesting building block to me is the Apache UIMA. This is the engine that parses natural language documents, looking for facts. IBM used some pretty heavy duty annotators to find the things they were looking for, but the default annotators are pretty neat to begin with. A short example page shows off the general idea.
This raises the question 'how can I use this?' I was trying to think of a novel, game-changing application for this, but then I realized I wasn't a computer scientist, so I should probably think a little bit smaller for now. I'd like to know how many unique names come up in every paper ever published in an IEEE journal. This is probably still unreasonable, because I can't just download all of the IEEE corpus, as cool as that would be. ASU probably wouldn't enjoy me abusing their library access by screen scraping all the PDFs, either. New plan needed. Thinking even smaller, I attended an IEEE conference last summer that just happened to provide all of the proceedings of the conference for the last 30 years on a CD. This might be a reasonable amount of data to start with.
This is another one of those projects that will probably be slow to happen, but it doesn't seem insurmountable. The hard work is all done. I just have to put other people's work to use. I'll let you know if I make any progress.
how are you going to do this? are you going to build your own watson?
ReplyDeletealso. do your parents know you have a blog? i feel like your mom would really enjoy this.
Well, I'm not going to make anything as fancy as that. The UIMA code already has built-in annotators for names. I just need to get it running and feed the data into a simple program that uses the name annotator, I think.
ReplyDeleteI just emailed them about it last night!