7 ppl attended
Talk by Danlu Chen
Danlu Chen presented recent work on machine learning and natural language processing for ancient writing systems, focusing on cuneiform data.
CuneiML
- Paper: Chen et al., JOHD 2023, https://doi.org/10.5334/johd.151, https://github.com/taineleau/CuneiML
- The paper describes the preparation of a curated dataset of high-resolution photographs of Sumerian and Akkadian cuneiform tablets, linked to Unicode transcriptions, transliterations, line art, and metadata.
- The dataset is intended to support machine learning on cuneiform material, for example for paleographic dating, sign recognition and OCR, provenance attribution and cross-collection analysis, and places particular emphasis on consistent preprocessing of image data derived from the CDLI collection (https://cdli.earth).
- Challenges:
- Cuneiform tablet photographs are highly heterogeneous and can introduce substantial noise.
- For image preprocessing, CuneiML uses cutout detection with heuristic and computer vision approaches to identify the major face of a tablet.
- A key transformation step is the choice between Latin transliteration and cuneiform Unicode transcription.
- Danlu argued for Unicode-based representations as more robust for machine learning, while Latin transliteration already contains scholarly interpretation/bias.
- Image source can have a strong influence on model behaviour. This may partly reflect meaningful structure in the data (e.g. certain collections being associated with particular periods), but it also raises the risk that models are “cheating” by learning collection-specific features rather than the intended signal.
- Transliteration practices vary by source institution and influence downstream results.
- Approaches such as input denoising, sometimes used in medical machine learning, do not seem to work well for this kind of material.
LogogramNLP
- Paper: Chen et al., ACL 2024, https://doi.org/10.48550/arXiv.2408.04628, https://github.com/taineleau/logogramNLP
- This paper introduces a benchmark for Natural Language Processing (NLP) on ancient logographic writing systems that includes both visual and transcribed data for several tasks.
- Claim: Direct processing of visual representations can outperform text-based approaches and may help to make large amounts of currently untranscribed cultural heritage data accessible for NLP analysis.
Discussion
- The discussion addressed possible strategies for dealing with confounds in machine learning: In response to a question about whether this could be handled in a way similar to confound correction in Bayesian statistics, Danlu explained that the main strategy is usually to improve or rebalance the input data.
- The analysis of embeddings was also discussed as a way to explore/visualize textual datasets.