I need to index patents from the sources available in the web. The current implementation enables querying over a database of patents published by the USPTO.
Current implementation: Apache Nutch, Apache Lucene and PostgreSQL.
Is it a project requirement for the full-text index to be embedded, or it is a possibility to use a search server like ElasticSearch or Solr (both based on Lucene internally)? In that case there are Elixir clients available.
It really wouldn’t be hard to set up your own Lucene server, then query it from Elixir. Full-text index & search is very CPU intensive–I’d call it a good example of something the BEAM is not suited for.
As @lucaong suggested, Solr and ElasticSearch are nice services based on Lucene, gotta see what is better for you. I use Elastic, but instead of a client package to interact with it, I use just a module that provides a minimal API to talk to it.
Definitely not a recommendation, but yes, there is at least one:
riak_search is a Lucene-like full-text search engine in the BEAM (Erlang), it just seems not to be under active development anymore – not sure if others are still working with it. Looks like it provides full text abilities just with BEAM, and also integration with Solr.
And don’t forget, if you’re using Postgres as the database the full text search capabilities are quite good and can often prevent the need to involve an external engine.
if you’re looking to integrate with elastic search there’s https://github.com/tsloughter/erlastic_search - take a look at the .travis.yml if you’d like to set up a similar integration test (i’m a maintainer on that project)