Archaeological data pipeline

Client
Archeologie Delft
Period
2023
Role
Senior data analyst and model designer
Result
Harmonised dataset, quality guidelines, and repeatable enrichment process

Challenge

Archaeological fieldwork in Delft, and across the Netherlands, generates large volumes of data: find lists, context descriptions, coordinates, classifications. These data are recorded in varying formats, with different conventions per excavation period and per excavator. For research this is manageable; for reuse, policy analysis, and public presentation it is not. Archeologie Delft wanted an approach that would allow existing datasets to be systematically cleansed and enriched, not as a one-off cleanup but as a repeatable process.

What we built

The project delivered a semi-automated pipeline: a repeatable sequence of steps that ingests a raw archaeological dataset, validates it, harmonises it against a shared vocabulary, and enriches it with derived attributes. At the front end sits an information model of archaeological concepts (find, context, period, material, location) with associated validation rules. At the back end sit guidelines for new data recording, ensuring that data collected after the project already meets quality requirements from the start. The pipeline is publicly available at wasstraat.e-space.nl.

Key design decisions

  • Process over one-off cleanup. A cleansed dataset is contaminated again tomorrow if there is no process. The pipeline is therefore the real product; the cleansed dataset is an output of it.
  • Vocabulary as anchor point. Harmonisation begins with an explicit vocabulary of archaeological terms and their relationships. This makes mapping between datasets possible without re-debating meaning every time.
  • Guidelines for primary data recording. The best pipeline is one you never need to use. Guidelines for new fieldwork ensure that future data is usable from the outset.

Results and adoption

The project delivered a harmonised dataset of archaeological data with improved quality and reusability, a repeatable enrichment pipeline, and a guideline for future data recording. The pipeline runs publicly at wasstraat.e-space.nl and is generic enough to be applied, with modifications, in other heritage contexts.

Where to find

dataharmonisatie cultureel erfgoed datakwaliteit archeologie