Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets.
Here are some key features:
- Web-based search across large document and data sets.
- Imports many file formats, including popular office formats, spreadsheets, email and zipped archives. Processing includes optical character recognition, language and encoding detection and named entity extraction.
- Load structured entity graph data from databases and CSV files. This allows navigation of complex datasets like companies registries, sanctions lists or procurement data. Import tools for OpenSanctions. are included.
- Receive notifications for new search matches with a personal watchlist.
- OAuth authorization and access control on a per-source and per-watchlist basis.
Download && Tutorial
Source: https://github.com/alephdata/