Better Google Searching

I was writing a note to the Google QA department via their webpage to ask for better filtering for specific kind of trash results. Having some experience with custom searches and filtering it occurred to me that everything I was asking them for I could probably just do myself. As they say, if you want a job done right… OK, I’m going to start a little (?) pet project to create a better search engine using the Google API and freely available resources.

What was the trigger here? There are pages and entire domains that exist specifically for purposes not related to "content", but to lure people to ads or malware. This "man minute myth" search is a perfect example, where I was looking for a book (had the title wrong, I wanted "The Mythical Man-Month") and I got a bunch of pages that had the exact same keyword trash.

Rather than simply an option to include omitted results, I was hoping to get an option that allows us to add more weight to the results that are omitted, so that we can eliminate some of these trash sites from initial results. In other words, we have the ability to include more of the same, but not to exclude more of these apparently duplicate results.

That idea got me thinking that it would be very helpful to have a link next to ‘Cached’ or ‘Note this’ where we can flag a site as trash to help others. Pages can be assigned a weight based on the number for agree/disagree votes they get. To overcome "stuffing the ballotbox", a weight can be assigned to votes cast by individuals, so that their vote has less weight when they have a tendency to significantly disagree with more people, especially those who agree with our own votes. We should be able to select our own trigger values to determine what results display based on weight, changing this on a per-search basis if we wish, and re-display results with temporary weight values.

I think this sort of democratic process would help to make Google a much more valuable resource. I can’t believe I’m the only person who has come up with this and soon I will head off to find other existing projects. However, that won’t sway me from trying to create a better mousetrap.

Here are some thoughts that I have on this right off the bat – I have no idea which direction I’ll go on any of these:

  1. If it’s not clear already, I see this as being driven entirely from a hosted server, with client activity limited to this one server. The client won’t be making Ajax calls to other sites for info, that’ll just kill performance and put more burden on a developer as the client browser gets heavier with code. Think simple: You go to Google to do searches and this is no different.
  2. I’m more comfortable with C# than PHP so I’ll write this as an ASP.NET app first.
  3. I like the idea of opening this for public contributions. For example, the algorithms described above will obviously be complex and require tweaks. It would be nice to have plugin weighting algorithms that allow individuals to pick how they want results weighted – and even to vote on which algorithms return the best results.
  4. It would be nice if those algorithms were fairly generic so that they could be plugged into the code no matter what language I’m using. For example – if I write it in C#, then someone should be able to contribute VB.NET source and have it plugin as a DLL. This is of course very easy with .NET. But I’d like the same freedom to incorporate Java or PHP code too. (Dream on bucko!) The problem here is that the performance of this is already going to be less than stellar and executing core algorithms externally is going to make that even more of a pain. Yup, security would be an issue but I wouldn’t let code execute on my systems without thoroughly checking it first. Heck, even that can be put through a voting system.
  5. Hmm, a voting system… It seems to me this isn’t specific to Google results so the voting mechanisms can be abstracted to any content including blogs and forum postings. (Uh oh – there I go again taking a simple idea and turning it into a framework before I write a single line of code.)
  6. Rather than storing user info, votes, URLs, and other such info locally or on a MySQL server accessible from my website, I was thinking about using the Google Base infrastructures. Sure, initially I might store the data in MySQL or even a local SQL Server, but that’s not a long-term solution, and it doesn’t address an itch I have to make use of more public/modern resources. This could turn into a real Web 2.0 mashup case study.
  7. For you MV / Pick people wondering why I don’t do something like this with MV – this is a constant problem with our technology:
    – We can’t easily host it outside of dedicated environments.
    – Dedicated environments cost money to maintain.
    – There are limited connectivity options outside of dedicated environments. That is, it’s tough to get end-users to use UO.NET, the D3 Class Library, or QM Client to connect in. Compare this to easily accessible libraries for Perl, PHP, or ADO.NET Data Adapters. (This point is arguable but that’s the point – we can argue about it and never get anywhere. At the end of the day, it’s unlikely you’ll find any company hosting a large-scale internet site with an MV DBMS at a cost that’s anywhere near MySQL or similar offerings.
  8. Google will get appropriate credit for the resources. This isn’t intended to mask their resources or re-brand the interface, just serve as a new front-end to their offering. This is what the "Google Operating System" is all about.
  9. I really need to see if this sort of thing is already facilitated with Google custom searches. I have yet another side project that I haven’t opened up yet that uses that feature.
  10. Whenever I come up with ideas like this (and I have a lot of them, I just don’t post them here) I go to Google Labs to see if they already have something similar in the works. But I never see any of my ideas being implemented. Well, there are a lot of smart people over there and I have a hard time believing my ideas are completely original, so I have to believe that they’re just not updating the Labs data very often. Once in a while I do poke through the various forums to see what other people are talking about, but ironically, finding relevant discussions about whatever I have on my mind is really difficult. What the heck do you search for? "better search results"? Yeah, that’s unique enough to narrow down the result set. The first thing Google needs to do is improve the basic searching algorithms so that the rest of us can do decent searches to see if our "better search algorithm" type queries are really unique or if we’re coming up with decade old techniques to solve last century’s problems.
  11. Users won’t be able to use this anonymously. I need to identify people that are in it just to muck up the process, and add value to results approved by more reputable contributors. Individuals need to establish a track record so that the algorithms can determine if they’re malicious, error prone, or if they simply disagree in principle with various decisions that others would make. The intent here isn’t to ascertain agreement with content, simply to determine if content even exists at a link returned in a standard resultset. Obviously people won’t want others to identify them individually so this needs to be anonymous in the community sense but login to the environment needs to be mandatory.
  12. I haven’t thought about how the data will be stored, how sheer volumes of data can be processed quickly, etc.. Off-hand I’m thinking some data would initially be keyed by URL (URI) with a simple list of individuals voting Yes or No as to whether the resource has true content. A redundant tally can be maintained to give a simple ratio of Yes to No votes. Individual results just need to be looked up for each page returned to a user. This won’t be trivial. Some number of results need to be processed and returned to the user. I don’t want to pre-process all results, but I need to process enough to return a full page (20 to 100 results) per user preferences. When the user requests a new page I can’t query Google and re-process everything again, I need to maintain a cache of the original results, the results I’ve already processed and returned to the user, and then use that subset of data for paging. Some delays can be expected when the user asks for the Nth or last page because I may need to go through the whole result set to return that number of results – maybe even spin off threads to generate pages in the background depending on how many results are returned. Initially, small result sets shouldn’t impact performance results that much. But as people start asking for filtering on larger result sets this could be a problem. Because of this I’m tending toward keeping this simple and productive rather than making it more complex and ruining the whole thing. Hey, for my purposes, I just don’t want to see so much crud in my search results.

Of course there are so many other considerations but I think the best approach for this is to simply do what I wanted to do, which is to create a front-end to Google using their provided search API, pre-process the results, and render them. Then I’ll create the voting mechanism and play with the algorithms to remove results that have been voted out.

Sigh. Here I go again. I’ll let you know when a simple prototype is available.

 

Leave a Reply