TEXTCROWD - Collaborative semantic enrichment of text-based datasets


Machine learning technologies have suddenly acquired considerable importance, especially in disciplines such as archaeology, where the main information is contained in free text documents rather than in relational databases or other structured datasets. The Social Sciences and Humanities research communities face a fragmented research landscape as well, that can be supported by EOSC.

The EOSC would help overcome such fragmentation, by building on structuring and integrating initiatives such as the CLARIN, DARIAH and E-RIHS ERICs, and Digital Humanities Organizations (e.g. their Association ADHO) to offer advanced text-based services addressing common research needs (see recent survey by PARTHENOS). One example is enabling the semantic enrichment of text sources through cooperative, supervised crowdsourcing, based on shared semantics, and then to make this work available to others via EOSC. This would benefit many scientists in the long-tail even if delivering such a service presents real challenges around interoperability and multilingualism.

TEXTCROWD is an advanced cloud based tool developed within the framework of EOSCpilot project for processing textual archaeological reports. The tool has been boosted and made capable of browsing big online knowledge repositories, educating itself on demand and used for producing semantic metadata ready to be integrated with information coming from different domains, to establish an advanced machine learning scenario


Cultural heritage and humanities datasets are largely based on texts:

  • Reports
  • Archaeology: excavations, surveys
  • Conservation: diagnosis, restoration – often mixed with numeric results
  • Grey literature
  • Literary/historical sources
  • Research articles
  • Monographs


Achille Felicetti, Teaching Archaeology to Machines: Extracting Semantic Knowledge from Free Text Excavation Reports, ERCIM News 111, October 2017 - 


The TEXTCROWD development is part of the activity of EOSCpilot WP4 led by Hermann Lederer, and is supported by Kathrin Beck and Thomas Zastrow of Max Planck Computing and Data Facility (MPCDF)

TEXTCROWD on EOSC https://eoscpilot.eu/science-demos/textcrowd

Main contacts

TEXTCROWD is developed and maintained by VAST-LAB / PIN

Franco Niccolucci (franco.niccolucci@gmail.com)
Achille Felicetti (achille.felicetti@gmail.com)



Access Access

Access the TextCrowd VRE with your D4Science credentials.