Access to data, and the inferences such access allows, present great opportunities for solving complex societal problems, but also entail risks to privacy and freedom. Recognizing this, researchers and governments alike moved away from the original ideal of “Open Data” and important efforts have been made to improve data protection and control over analytics, with the EU leading the way, through the GDPR and the Digital Services Act.
However, these necessary concerns and legislation should not lead to unnecessary bureaucratic burdens, hindering research and innovation. For example, Portugal has been a front-runner in promoting collaboration between academia and public institutions, offering grants to researchers that work with data to aid in public administration and policy design; however, the researchers often find that access to public data is too constrained or slow to allow for productive interactions or to contribute to decision-making. On the contrary, it is common for private companies that work on public projects to have almost unlimited access to citizens data, as they help build databases or offer strategic consultancy services. It is thus important to develop new systems that facilitate access to data, while guaranteeing ethical treatment and privacy.
In this post, we present the draft of one such system and open it for discussion. It is inspired by what other areas of science have adopted, from biological experimentation to hazardous chemicals or nuclear material manipulation.
In general terms, access should be granted to certified individuals for the pursuit of specific, bounded goals, subject to examination by ethics committees, but once these certifications are issued, they have a defined duration during which access to data is facilitated. In addition, unlike the examples above, citizens’ rights regarding the ownership of their data and their free will are at stake, so they should also have a say, either individually or through organizations dedicated to protect the digital rights of the public, resulting in legislation to regulate the field.
The simplified suggested schema would be the following:
- Step 1: Data classification. Different types of data would be classified according to risk levels. This is in line with the GDPR and more fine-grained classifications could be developed by the data owners and the data controllers, involving the citizens as relevant;
- Step 2: Researcher Clearance. Researchers (from both the public and private sectors) wanting to work with any type of human data would be required to take data handling courses. These could include ethics, GDPR compliance, and data management. Different courses would offer different certifications that would lead to different “clearance” levels. For example, there should be different courses for researchers handling personal or sensitive data (such as health data or non-anonymized information) or data considered not of concern by the GDPR; Similarly to what happens in courses on animal experimentation or handling of radioactive materials, these courses should be certified at the institutional, national or European levels, free and online, through MOOCs or similar;
- Step 3: Project Clearance. Research projects (from both the public and private sectors) would be evaluated independently by ethics and data protection committees. As already happens for all scientific research, this should be done by the researcher’s host institutions. This step is typically already required to access public funding and some data, but should be generalized to include all projects that involve human data;
- Step 4: Access to Research Infrastructures. Once certification authorities are established, both for projects and individual scientists, access control systems can be built into the data sources. These would stipulate who has access (certified scientists/projects of level A would only have access to level A data, and so on), when and for what purpose, including logging and monitoring of said access. In addition, such a certification scheme could easily be integrated with existing big data research infrastructures (RI).
This is a basic version of a possible vertical system to facilitate access to relevant research data, and we point out that recent efforts in the development of RIs have made ethical and privacy concerns central to their design. For example, recently developed RIs such as SoBigData or the New Zealand Integrated Data Infrastructure put forward privacy and ethics-sensitive social data mining platforms, where researchers may conduct their analysis in protected virtual environments, while having restricted access to the original sensitive data. Access control systems could easily be integrated in such a system by making the access to the virtual environments conditional on previous clearance which could, for example, depend on researchers taking mandatory training on ethics and privacy risks in research, as suggested above.
Overall, it is our belief that such a vertical system would make data access much faster, safer, and accountable.