Comparison of re-Isearch to Popular Engines
Abstract
There are a vast number of engines on the market. On the commerical side there is, for example, MarkLogic; there are RBMSs with fulltext support like SQLlite and the extraordinary PostgreSQL; and on the “open source” side there engines like ElasticSearch, Solr and Neo4j. The last 3, interestingly, are not really engines but are applications based around the Lucene engine. This study thus compares re-Isearch to Lucene.
Apache Solr
Apache Solr is an open-source search server built on top of Apache Lucene that provides all of Lucene’s search capabilities through HTTP requests. It has been around for almost a decade and a half, making it a mature product with a broad user community.
Solr uses request handlers to ingest data from XML files, CSV files, databases, Microsoft Word documents, and PDFs. It provides native support for the Apache Tika library.
A Solr configuration requires at least 512 MB of HEAP memory to allocate to instances.
Solr is licensed with the Apache 2.0 License. It is an official Apache project.
ElasticSearch
Elasticsearch is also an open-source search engine built on top of Apache Lucene. It extends Lucene’s indexing and search functionalities using RESTful APIs, and it archives the distribution of data on multiple servers using the index and shards concept. Elasticsearch is completely based on JSON. It supports data ingestion from multiple sources using the Beats family (lightweight data shippers available in the ELK Stack) and Logstash.
The default Elasticsearch configuration requires 1GB of HEAP memory.
Elastic started off under the Apache 2.0 license but is now dual licensed under both the Elastic License and SSPL. While the former disallows providing the software to third parties as a hosted or managed service, SSPL allows it but requires anyone who wants to offer ElasticSearch as a service to either release all surrounding infrastructure as SSPL or get a commercial license.
MarkLogic
MarkLogic is a commerical search system originally developed for XML. It indexes the content and structure of documents including words, phrases, relationships, and values in over 200 languages with tokenization, collation, and stemming for core languages. It supports search across its data and metadata using a word or phrase and incorporates Boolean logic, stemming, wildcards, case sensitivity, punctuation sensitivity, diacritic sensitivity, and search term weighting.
Rust Based
1) Tantivy: https://github.com/quickwit-oss/tantivy
Tantivy is a full text search engine library written in Rust.
It is closer to Apache Lucene than to Elasticsearch or Apache Solr in the sense it is not an off-the-shelf search engine server, but rather a crate that can be used to build such a search engine.
Tantivy is, in fact, strongly inspired by Lucene's design.
2) Bayard: https://github.com/bayard-search/bayard
Bayard is a full-text search and indexing server written in Rust built on top of Tantivy that implements Raft Consensus Algorithm and gRPC.
3) Toshi: https://github.com/toshi-search/Toshi
Toshi is meant to be a full-text search engine similar to Elasticsearch. Toshi strives to be to Elasticsearch what Tantivy is to Lucene.
4) Sonic: https://github.com/valeriansaliou/sonic
Sonic is a fast, lightweight and schema-less search backend. It ingests search texts and identifier tuples that can then be queried against in a microsecond's time.
Sonic can be used as a simple alternative to super-heavy and full-featured search backends such as Elasticsearch in some use-cases. It is capable of normalizing natural language search queries, auto-completing a search query and providing the most relevant results for a query. Sonic is an identifier index, rather than a document index; when queried, it returns IDs that can then be used to refer to the matched documents in an external database.
5) Vector https://vector.dev/ and https://github.com/vectordotdev/vector
Alternative to Elastic’s Logstash
Angolia
https://en.wikipedia.org/wiki/Algolia
Algolia was founded in 2012 by Nicolas Dessaigne and Julien Lemoine, both originally from Paris, France. It was originally a company focused on offline search on mobile phones. Later it was selected to be part of Y Combinator's Winter 2014 class.
The Algolia model provides search as a service, offering web search across a client's website using an externally hosted search engine
“Subscribe for free, pay based on your usage”
Typesense
https://github.com/typesense/typesense
Core Overview
| Typesense | Algolia | ElasticSearch | Meilisearch |
Source Code | Fully open source | Proprietary closed source | Source-available, licensed under SSPL | Fully open source |
First Commit | 2015 | 2012 | 2010 | 2018 |
Built Using | C++ | C++ | Java | Rust |
Core Search Algorithm | Built from the ground-up | Built from the ground-up | Built on top of Lucene | Built from the ground-up |
Best Suited For | Instant Search-as-you-type Experiences for data sets that can fit in RAM, up to 24 TB (or current commercially available RAM size). | Instant Search-as-you-type Experiences for datasets up to 128 GB in size. | General-purpose search & aggregations over petabyte-scale datasets (eg: log data) | Instant Search-as-you-type Experiences for up to a few hundred thousand records, that don't require a production-grade highly-available setup. |
Primary Index Location | RAM | RAM | Disk, with RAM cache | Disk with Memory Mapped files |
Re-Isearch
Re-Isearch is a reborn open-source search engine built on the basis of the Isearch open-source engine and IB (a proprietary fork). It indexes the content and structure of documents including words, phrases, relationships, and values. It supports search across its data and metadata using a word or phrase and incorporates Boolean logic, wildcards, case sensitivity, punctuation sensitivity, diacritic sensitivity, and search term weighting.
The default configuration requires a minimum heap size 3x the size of the largest document it intends to index. It has run well on machines with virtual memory and as little as 8 MB RAM.
The engine has, in various forms (and using differing algorithms), been available since the 1990s and was originally developed to provide structured field search via the NISO Z39.50 protocol (an international standard client–server, application layer communications protocol for searching and retrieving information from a database that dominates, for example, the library world) . Its well established user community was primarily in the public sector and included many high profile projects such as the U.S. Patent and Trademark Office (USPTO) patent search, the Federal Geographic Data Clearinghouse (FGDC), the NASA Global Change Master Directory, the NASA EOS Guide System, the NASA Catalog Interoperability Project, the astronomical pre-print service based at the Space Telescope Science Institute, The PCT Electronic Gazette at the World Intellectual Property Organization (WIPO), the SAGE Project of the Special Collections Department at Emory University, Eco Companion Australasia (an environmental geospatial resources catalog), Australian National Genomic Information Service (ANGIS), the Open Directory Project and numerous governmental portals.
The engine has a number of extensions that enables search functionality via ISO23950/Z39.50, OASIS SRU/W and OpenSearch REST API.
Re-Isearch is licensed under the Apache 2.0 license. The Z39.50, SRU/W and other servers are licensed under highly permissive MIT style licenses. Basically these means anyone has permission to use, copy, modify, distribute, and sell the software and its documentation, in whole or in part, for any purpose without fee.
re-Isearch | MarkLogic | Elasticsearch | Apache Solr |
NoSQL search engine | Operational and transactional Enterprise NoSQL database | A distributed, RESTful modern search and analytics engine based on Apache Lucene | A widely used distributed, scalable search engine based on Apache Lucene |
NativeXML DBMS, RDF Store, search engine | NativeXML DBMS, RDF Store, search engine | Search engine | Search engine |
Object DBMS including Spatial |
| Document store Spatial DBMS | Spatial DBMS |
1994-2011, reborn 2021 | Since 2001 | Since 2010 | Since 2006 |
C++ | C++ | Java | Java |
Free | Commerical | Partially Free | Free |
Open Source | Proprietary | Open Source | Open Source |
XML support | XML support | JSON Only | XML support |
Foreign keys, Join | No foreign keys | No foreign keys | No foreign keys |
Schema-Free | Schema-Free | Schema-Free | Schema |
Multi-language API, Z39.50, SRU/W. CQL, IB Query Language,... | Multi-language API, Xquery, SPARQL, | Java API RESTful HTTP/JSON API | Java API RESTful HTTP/JSON API |
Search during index | Search during index | No | No |
Own algorithms | Own algorithms | Based on Lucene (inverted index) | Based on Lucene (inverted index) |
Relation DB based
MantiCore Search
https://github.com/manticoresoftware/manticoresearch
Manticore Search is a multi-storage database designed specifically for search, including full-text search.
Craigslist, Socialgist, PubChem and many others use Manticore for efficient searching and stream filtering.
Manticore Search was forked from Sphinx 2.3.2 in 2017.
Sphinx
Sphinx can be used either as a stand-alone server or as a storage engine ("SphinxSE") for the MySQL family of databases. When run as a standalone server Sphinx operates similar to a DBMS and can communicate with MySQL, MariaDB and PostgreSQL through their native protocols or with any ODBC-compliant DBMS via ODBC. MariaDB, a fork of MySQL, is distributed with SphinxSE.
Lucene
Lucene vs. re-IsearchLast updated