this semester i took a machine learning class. the class has a project requirement, so i decided to tackle a problem that's quite near to my heart (and likely important to many people in the blogosphere as well): how can we make blog search results better?
if you've ever used technorati, or google blog search, you'll immediately agree with me when i say that their results are complete junk. in fact, i was using these poor search engines to get the lay of the blogland regarding the recent inflammatory and bigoted comments of congressman virgil goode concerning muslim immigration. needless to say the linear results didn't give a very good lay of the land, though my digging around did.
so my project aimed at improving this, and i came up with a simple mechanism called LDARank. i've posted the full writeup here (note that it isn't very formal as the class itself wasn't very formal, but you'll get the idea). the introduction is quoted below:
my friend nikhil made the good point that i should make the mechanism available on some website, so people can give it a whir. maybe i'll do that over the next week or two, as i have some time.
In recent years, blogs have been rapidly growing in popularity. The world of blogs, called the blogosphere, has been gaining many users because of the ease of publishing and the desire of users to have their own personal stage. However, the rise of blogs and blog postings has not seen a commensurate rise in the quality of the ranking of results from blog search engines. It is still very difficult to browse through blogs with a targeted search query.
In this paper we propose a new ranking algorithm for blog posts based on topic modeling, called LDARank. In the first section we describe some of the problems with blog search. In the second and third sections we propose our solution and a technology to implement the solution, LDARank. We then examine the results of using the LDARank algorithm with various blogs and queries. Finally, we assess the practicality of using LDARank in a production setting, and examine its limitations.