The next thing I plan on doing, when I have time, is learning something about http://en.wikipedia.org/wiki/MapReduce. To do this, I'm going to need a lot of data. Luckily, someone has collected a bunch of publicly-available datasets. I'm not sure which one sounds coolest. I realize I may be getting ahead of the game here, since I have absolutely no experience with this sort of programming, but I can at least ask for suggestions here.
So, anyone seen any fun data sets in there? I feel BLS data might be sort of interesting. I could use it to prove all sorts of irresponsible political points. It's hard to argue with a few hundred megs of raw numbers. I like the idea of the Usenet corpus from 2005-2009, but it's 28 gigs. Woof.
I really like the idea of this kind of massively parallel data manipulation. Had I known this kind of stuff existed back in high school, maybe I would have done more with computer science in college. Never too late to teach myself, though.
No comments:
Post a Comment