How to build your own web search engine

achempion · May 1, 2019, 1:02pm

Hi, I hope it wouldn’t be off topic.

I was wondering for a long period of time to create another web search. For now, I’m vastly disappointed with my search experience especially when I want to search something unrelated to StackOverflow (like a cooking recipe or some discussions). My boiling point has reached the final destination and I’ve started to look into how to build a solution.

My goal is to build web search only with high-quality websites (like websites who not asks for push notifications, doesn’t have popups or obtrusive ads and not using dark UI patterns).

I’m considering to use Sphinx to power search but also I need to store cached pages in order to generate previews or to reindex it later.

The hard part now is to decide what to use to store pages with metadata. For now, best candidates are Cassandra, CouchDB and RiakKV. Despite all solutions is schema-free (except Cassandra) I think strict schema will be better for my data structure.

My general requirements are to create a solution which can be managed with spare resources either time or money because it’s a sort of hobby project.

Cassandra looks like a good fit to store TB of cached webpages with Spinx on top of it but I don’t think JVM based tools is a good fit for my requirements.

So, I’m looking forward for any ideas for a tool to store TBs of cached web pages which can be horizontally scaled on demand.

Thank you

subetei · May 1, 2019, 2:10pm

It’s an interesting topic to me as I’ve been thinking of doing the same thing… my impetus was poor results in medical based searches.
Had also chosen Sphinx for reasons of speed/scale/ease of uptime. Not much to add for storage… think I’d go with Couch or Riak over Cassandra just to keep the stack in line and I don’t think there’s enough potential additional upside in Cass to consider it (if someone knows otherwise I’m interested to hear the argument). Even Riak’s timeseries optimization could have interesting uses in this case. I’ve never run sphinx indexer off those solutions; are you planning to update the index via xmlpipe2 or have something else in mind?

No concrete answer yet but I want to keep up with this convo

dimitarvp · May 1, 2019, 3:03pm

Joining @subetei and OP here: I have been interested in that in years but never had the time or the nerves to properly delve into it.

If I would ever get the time or funding to do it I’d definitely want to engineer and support the storage layer where I believe Elixir is excellently qualified to orchestrate tasks. Also, writing crawlers has been one of the things that can truly excite me in programming.

I do believe we still need a proper functional language oriented database – namely immutable append-only ledger. I’ve liked the demos of Datomic but have no idea how well do they translate to real-world cases.

Seconding @subetei in the suggestion to use Riak so as to have as much of your code in Erlang and Elixir as possible.

egze · May 1, 2019, 4:38pm

There is also Google Custom Search https://developers.google.com/custom-search/
You can include only the sites that you want.

jdumont · May 1, 2019, 6:18pm

I’d love to see something like this also. It’s the way my brain naturally works, which is I think why I’ve struggled with storage in my apps thus far.

Event sourcing is great, but it’s the CQRS that often needs to accompany it that muddies the water again.

achempion · May 2, 2019, 5:37pm

I’ve done some research and apparently Cassandra is more appropriate of these three options.

Riak unfortunately is vastly outdated running on r16/r17 and founded company looks like a dissolved.

CouchDB also an interesting solution but too much hustle with adding/removing nodes and redistributing the data.

What I like though is C++ based rewritten version of Cassandra called ScyllaDB.

Couple of more useful links to read about solutions to power search (Solr, Sphinx and so on)

subetei · May 2, 2019, 10:36pm

If you were to consider CouchDB I’d look more at CouchBase which may have more of those kinds of niceties. I believe their cluster scale out is fairly automatic at this point. When I played with it years ago it was already pretty straightforward

The only outdated info about Sphinx I see there is that they now do have a suggestion query built in based on levenstein distance. For proof of scale Craigslist has a good talk or two about it. The must use non-negative unique integer key for your index source is something potentially annoying depending what you do with db distribution

achempion · May 6, 2019, 11:47am

I’ve looked at CouchBase which is a great project. Explored couple of talks from conferences and general usage of CouchBase is “interactive applications” which means you constantly updating/searching your data. It’s written in Erlang/Go/C++ which is also a good sign. One of usage was storing vast amount of data in RAM for an ad network to serve the data under 3ms.

I don’t think it’ll be appropriate solution to store large amounts of data on HDDs across tens of servers. Full text search also doesn’t look promising for a search engine.

I’ll explore Cassandra + Solr combination as a temporary solution.

Also, I found http://commoncrawl.org project to “download the web” without crawlers.

Here is also an awesome post to get grasp understanding about search engines in general https://medium.com/startup-grind/what-every-software-engineer-should-know-about-search-27d1df99f80d.

aenglisc · May 6, 2019, 11:59am

tangui · May 6, 2019, 1:30pm

FYI Riak’s search backend is also Solr. (I’ve recently had to hack a few Solr schemas in Riak)

subetei · May 6, 2019, 8:30pm

Interesting! When searching for Riak the results are confusing so I think everyone (including myself) assumed the bet365 takeover died out or something. Didn’t find http://www.riak.info/ straight away.

Does seems like a good fit here… the sacrifices Riak makes for its speed and reliability don’t really impact this scenario

subetei · May 27, 2019, 11:37pm

I needed a crawler for something else anyway so I put a quick project together to try all this out. Used the elixir ecosystem to handle each aspect of search. It’s interesting the design concerns that come up as you can see from my growing wish-list on the readme. That search article you linked @achempion has a lot of great food for thoughts I may try to address.

Anyway running a search site has been pretty fun so I posted a demo site with only a couple sites partially indexed (elixirforum and infogalactic… the index will grow a little every day). And I actually need Riak for something too so I may try swap that in. The readme lists the tech choices I made thus far.

Source Code
Demo site

achempion · May 28, 2019, 11:40am

Have you looked at http://commoncrawl.org ? It has content already downloaded for indexing purposes. You can even create a web search which can be used to search only by URL matches.

Great initiative though Thank you for posting the source code.

subetei · May 28, 2019, 1:10pm

Np. Source isn’t much to it really but it’s something to tinker with.
I did look at commoncrawl and that ones very nice. It doesnt fit the use case for my client though so I’ll be continuing to improve the crawler so I can use that too

edit: also to consider, for indexing news sites, youtube channels etc you’d still want to roll your own crawler if you want to make recent activity available. The Zam default configuration is to just index once a day max but every hour or real time isn’t very hard to implement instead

heywhy · January 3, 2022, 5:01pm

I know this is an old post but if you’re looking for a full-text library to use without the deployment complexities of popular search engines, you should have a look at https://github.com/heywhy/ex_elasticlunr.