Tuesday, 14. August 2007
Lucene Performance

I started a proof of concept if Lucence could be used for searching in log files. Here a first result of the indexer (see also benchmarks of the Lucene site):

    Hardware Environment

  • Dedicated machine for indexing: yes
  • CPU: Intel Pentium M, 1.6 GHz, 1 processor
  • RAM: 2 GB
  • Drive configuration: IDE 2,5" hard disk (in a Dell Lititude D810 notebook)
  • Software environment

  • Lucene Version: 2.2.0
  • Java Version: Java SE 1.6.0_02
  • Java VM: client VM
  • OS Version: WinXP with SP1
  • Location of index: local
  • Lucene indexing variables

  • Number of source documents: 9
  • Total filesize of source documents: 15 MB
  • Average filesize of source documents:2 MB
  • Source documents storage location: Filesystem
  • File type of source documents: log files
  • Parser(s) used, if any:
  • Analyzer(s) used: StandardAnalyzer
  • Number of fields per document: 3
  • Type of fields: text
  • Index persistence: FSDirectory
  • Index size: 3 MB
  • Figures

  • Time taken (in ms/s as an average of at least 3 indexing
    runs)
    : 7 s (first try: 37 s -> ignored)
  • Time taken / 1000 docs indexed: 150 s (estimated)
  • Memory consumption: started with -Xmx128m -Xms128m
  • Query speed: not yet measured
  • Notes

  • Note: first prototype, no special tuning/strategies
  • Note: maxFieldLength set to 1,000,000 (default of 10,000 was to small)

... comment

 
Update
I restarted the tests and corrected the values because the first try was not really relevant. I started with a default maxFieldLength of 10,000, i.e. not all files are really read till EOF in the beginning. Now the maxFieldLength is increased to 1,000,000 which is probably much to high, but the files are all read completely now.

... link  


... comment
 
Lucene Performance Tuning
A big influence on the indexing performance has the analyzer. If you don't use the StandardAnalyzer but the SimpleAnalyzer the performance increase about the factor 6-8. In a project "LogBrowser", where several GBs of log files are indexed, the indexing time slows down from 5 hours to about 50 minutes!

... link  


... comment