Apache SolrI used it once on a project where synonym searching was important. We could build a thesaurus of preferred and alternative terms, really useful for a project where the language used by "real people" isn't the same as the legal terminology.I remember it looked complex, but really wasn't so difficult to implement.
It can search common document formats inc Word and PDF, so not just HTML.If I remember rightly, you can get a free or paid-for hosted version, and there are a few WordPress plugins that help you integrate it into a website, for example WPSOLR .
Like Amar, I wondered about SearchWP which is my preferred search plugin and one of three I recommend. That said, it looks like it doesn't support out-of-Media-Library search:
Can SearchWP index PDFs & Documents stored outside the Media library?
NO. SearchWP requires that PDFs & documents be uploaded to your WordPress Media library. In order for SearchWP to index and return results, each entry must have it’s own canonical, WordPress-provided object ID. This ID is assigned when files are uploaded to the Media library, and is essential for SearchWP.If you are using a document management plugin that stores uploads outside of the Media library of your WordPress install, SearchWP will NOT be able to work with these files.
Then I went searching for S3 search and came across Amazon Cloud Search. I didn't spend much time with the docs or thinking about how the integration with WordPress would work, but that does seem like a place to start.
Good luck, Nancy!
We have installed SearchWP and are going to see if it will work with our WPEngine's LargeFS which sends files over to an S3 bucket after it's "lived" in media for 10 days. The support team at WPEngine says it should still work.
From what you describe, I bet SearchWP will work. (For document indexing, it extracts the contents—when possible—and saves that separately to the WordPress database, so the document doesn't need to be on the server once it's been indexed.)
One issue I think we will encounter is our users are accustomed to searching with phrases or multiple words like "charter schools" or "community colleges" and SearchWP does not seem to work with multiple words.
Sounds like something isn't working right, if you're not seeing results for those. (For starters, make sure that your index is fully built.) You may need to play with SearchWP's settings around "keyword stemming" or by installing one of their "fuzzy matching" add-ons which are free. For large datasets, it takes a bit of experimentation to get your search results quite right!Again, good luck!