Overview
The UK Data Service (UKDS) is a partnership between the Universities of Essex, Manchester, UCL, Edinburgh and Jisc and the UK service provider to the Consortium of Social Science Data Archives (CESSDA).
The focus of this use case is on the research ‘studies’ of sensitive nature deposited at the UK Data Service and the subsequent derived data products. The sensitivity issue is the handling of information containing directly identifiable personal data or data that has the capacity to lead to reidentification, either through related variables within the dataset or through linkage with other data. We need to consider the implications involved in handling sensitive data for PID management and associated kernel metadata of a related identifier/object, e.g. for a repository identifier (“is currently approved to hold sensitive data”) or a researcher identifier (“has current credentials for accessing sensitive data”).
We will ensure the inclusiveness of metadata by considering the different requirements of PID management for digital objects beyond datasets. In CoreTrustSeal terms, the focus is Digital Object Management. Version changes to a digital object can trigger many different outcomes and the presence of sensitive data sets additional conditions on the digital object management process. Maintaining the provenance information is important throughout the process.
Sensitive digital object may or may not be assigned a persistent identifier and associated sensitivity metadata during the Conceive, Create and Collect phase. At the point of deposit, this sensitivity-related information may already exist and need to be integrated, or the repository may be entirely responsible for the assignment and handling of the identifier and related metadata. Throughout the following work on curation, quality, compliance discovery, identification, access and reuse the sensitivity metadata may need to be updated as digital objects are copied, changed, versioned and linked.
Description
Dealing with sensitive data requires careful consideration in terms of sensitivity assessment and versioning management. This use case will define and identify all the possible scenarios for ‘change triggers’ that could impact PID generation, PID metadata changes (change logs) and the ‘declaration’ of a new object during the various phases of the research data life cycle. The domain for exploration is Social Sciences, which comprise varied data types, formats and sources. The data is of both qualitative and quantitative in nature (statistical and survey data) and DDI is the predominant form of metadata. The immediate scoping of this use case is on controlled data (for data that may be disclosive) held in deposited digital objects. However, there is a need to explore and clarify the relationship between Sensitivity, Confidentiality and Disclosure.
Once the change triggers have been defined, there is a call for expansion of views. As we expand and explore these issues the focus is on defining criteria when a new persistent identifier is required, when it is sufficient to update metadata without a new PID and when that metadata should form part of the metadata kernel. This stage is followed by review and alignment with the PID policy and any implementation guidance. There is also a clear call for alignment with the issues around complex data citation and data production workflows in PID management.
For file based objects we have to consider the dependencies between files and the potential for granular version changes that can be triggered by changes within a hierarchy, such as a series of questions or variables. We also need to consider whether a change/version/new identifier for a child has implications for a ‘parent’. For semantic data, the network of version and identifier changes could be influenced by third parties who have control over some part of the linked data ‘set’ (e.g. a semantic artefact such as a controlled vocabulary). The level of granularity to which PIDs are provided impacts the possible granularity of citation.
Later plans include looking in more detail at the implication of different levels of responsibility for the data and/or metadata and the rights management (including machine-actionable rights). We will also identify whether previous identifiers exist, whether these identifiers will be maintained, whether they will be updated by the repository or whether a new identifier (and version/provenance model) will be applied. Lastly, we will define the handling of multiple deposit events and the impact on any identifiers.
Challenges that need to be addressed
Communication and clarification of how different approaches to copies, changes and versions are handled across different environments and domains are a foundational challenge to any best practices around digital object lifecycle management. We will consider whether some initial general-purpose documentation on this topic is required to support a more brief and digestible approach to the specific issues surrounding sensitivity. Sensitive data about people faces issues of both perceived and actual risk which must be addressed for human subjects, researchers, repositories, funders and the wider public. Communicating disclosure risk and mitigation processes can be complicated and technical. Overall transparency of practice delivered through safe and trustworthy organisations are critical. For this reason, the metadata about digital objects and metadata about the research projects, repositories and reuse environments that care for them must be aligned. We will seek to provide a generally applicable set of approaches while also addressing the particular challenges of linked data.
Expected Impact of the Use Case
Leveraging on the expertise of a large international consortium among social science data archives, where we have access to use case specifications from across service providers (CESSDA) through the UK Data Service. A well-designed guidance of PID usage for sensitive data (access and management), will provide better-aligned and documented practices that support the interoperability of organisations and digital objects, across secure environment borders. Furthermore, more efficient use of PIDs for sensitive data will benefit research, and thus have great societal and economic impact.
Expected outputs
Best practice documentation.