Infrastructure


The infrastructure has four components: data collection, data processing, data storage, and data analysis. Specifically, the first and second components consists of servers and scripts to carry out data collection and processing tasks which, importantly, can be scheduled (e.g., download a Twitter timeline automatically once a day). The third component consists of a database in which all information is stored and checked automatically for integrity and duplicates, once a day. The database is distributed over several servers to ensure data permanence: if there is a problem with a server, the database remains fully operational, including backup capabilities. The fourth component, data analysis, runs on additional servers with GUIs for R and Python analyses that, like for data collection, can be scheduled. For example, new documents can be classified automatically using existing scripts as they are added to the database.



Please refer to one of the subpages for more details regarding data collection and data access.