Constellio is an open source search engine based on Apache Lucene/Solr. I really like Constellio. The search results are relevant, the admin system works great and they support several search filters to interact with the search results. Unfortunately I am experiencing a bug that’s files up the disk every 3-4 days making the application crash. This make me question how mature the system really is.
You can try a demo of Constellio online here.
Search result page
Constellio has a classic looking search result page. Constellio also supports several filters, you can use to restrict your search documents based on document language, tags, document type and last modified date.
Hits has a title or url at the top, three lines of hillited text extract and then the url at the bottom.
Constellio automatically categorize documents to help you navigate of unstructured content.
Constellio has a good web based administration interface.
Overview of collections
Setting up a new crawl
Status for crawling
Constellio has a cool status function that show you the latest crawled and indexed pages in real time.
I installed it on a wm client with 8 GB disk space. It did run out of space after 12 182 files crawled. The web gui reports “INDEX SIZE ON DISK : 208.77 MB”, but uses 4,9 GB ( 1.7G in constellio/tomcat/temp and 3.2G in constellio/tomcat/webapps ). Did try to delete all the indexed files by cliking the delete all button for that collection, but that didn’t do anything, except logging me out of the admin gui.
I also experienced several timeouts, and the crawler wold stop to crawl at about 20 000 documents.
After conacting Constellio support i found out that Constellio can have some issues with large data sets when one uses the default configuration. I was advised to do he following changes to resolve the problems with timeouts and crawler stopping:
1) Change the database engine. As default Constellio uses the Derby database, but can be configured to use MySQL insted.
2) Increase the maximum heap size the Java virtual machine can use from 1024mb to 2048mb by setting -Xmx2048m in the start_constellio.sh file .
After thus changes I was able to crawl the test sets successfully.
Unfortunately constellio/tomcat/temp is still being filled up with temporary files. This will eventually filled up the whole disk, and make Constellio stop responding, making it necessary to manually remove the temporarily files and restart Constellio ( sometime the whole server, thus a lot of Linux services don’t handle the that the disk is full either ).