User Tools

Site Tools


kep:data_container

The Data Container is a platform accessible through a regular web browser that enables users to search through massive amounts of documents, either crawled / scraped from public websites or uploaded manually by them.

Infrastructure

DC is built on top of Aleph, a visual tool for exploring large datasets. Under its hood are a few key components:

  • an elasticsearch cluster maintaining fast indexes, handling all the search and filtering operations
  • a set of crawlers and scrapers, some of them built within Aleph, some of them externally triggered
  • celeryd, a queue runner whose job is to make sure all requests are executed properly

Key features

  • fast search through large sets of documents
  • intuitive search and filtering interface
  • possibility to add own document sources and set a privacy level by defining if it will be public or shared only with the selected platform members
  • integration with DocumentCloud (Dropbox support is on the roadmap also)
  • OAuth authentication (i.e. Twitter, Google)

Extra features

As a plus over Aleph's own features, we are developing a few more useful things:

  • possibility to upload own datasets (starting with PDF only, future plans include more formats)
  • synchronized split search windows: no more cluttering the UI with filters, keywords, search results and previews! We are designing an interface that will separate the search query and the results as separate windows (optional, can be turned off), enhancing a range of dual screen search operations.
  • batch search by uploading a file containing multiple queries (one per line). The results will be rendered all at once in vertical UI tabs (for now), eliminating the need for doing the same searches multiple times (e.g. investigating/monitoring a set of keywords or feeding the system with search results from another source).
  • integrations with: ownCloud, FTP resource
kep/data_container.txt ยท Last modified: 2017/02/21 15:06 by andreeab