Show HN: My independent search engine focused on user control

Posted by nox21125 |2 hours ago |1 comments

n1xis10t 25 minutes ago

Very cool, I subscribed to the newsletter. I’ve experimented with retrieval and ranking across a sample of a million pages from the early days of the Common Crawl (around 2014) and I was surprised by how many of them seemed high quality. The CTO of CC tells me it’s because most of the early URLs were donated by Blekko, which was an old search engine that he used to work for. I don’t know what the quality of recent CC stuff is like, but I think it would be fun to supplement an index with this older data, especially because you’d get a lot of pages that are 404’s now (but you could deliver the extracted text to the user, or link to a temporally nearby snapshot from WayBack).

Another fun thing to consider is making a meta search engine that functions like MetaCrawler used to, where it gets all (or a bunch of) the available results from all the source engines, and then actually fetches and extracts the text from the linked pages, and then matches the query and ranks the pages independent of what the source engines did. If you’d like to do that, I would recommend adapting the source code of 4get.ca (at least for the scrapers), because the guy who writes it is rather talented at coming up with and maintaining workarounds.

If you monetize this, I’d be interested in working for you. I know Python, HTML, CSS, am familiar with JavaScript, and have a lot of experimental (and successful!) experience with ranking web results.

Also, you might be interested in reading this article (from 2600 magazine) about disappearing search engines: https://archive.org/details/search-timeline In addition to the things in that article, there was a search engine for discord (“Searchcord”) that went away in less than a week after it was announced here (on HN), and there is this recent blog post which lists search engines with independent indexes, a painfully large number of which went away with no announcement: https://seirdy.one/posts/2021/03/10/search-engines-with-own-... The author of the 2600 article doesn’t really get into theories about why search engines disappear, but it certainly seems like a lot of them do. I’m curious to know if they disappear for random different reasons, or if it’s just really difficult to make and maintain a search project, or if there’s some other common reason. If you suddenly feel disinclined to work on this project, could you let me know why (maybe anonymously with a new email account or something)? Thanks.