Automatic comparison of theater plays using edit distances and alignment – Laboratoire d'Informatique Gaspard-Monge UMR 8049

This thesis, situated in the digital humanities, focuses on the study of drama, and more particularly, on the design and evaluation of models to represent play texts, and perform automatic comparisons between them.

On the literary side, several digital corpora are used, with a particular attention paid to classical theater, and French 17\textsuperscript{th} century European theater. Among the most prevalent corpora used are the Hyperpièces dataset, comprising more than 500 pairs of similar plays, and files from the Dracor project, a European database of XML-TEI plays, totaling nearly 2000 plays for its French version. We also achieve corpus augmentation by porting 1350 new plays to XML-TEI, from the Theatre-Documentation website, enriching our corpus with 18\textsuperscript{th} century productions.

These corpora are used throughout our work as test data for the comparison algorithms we design. Our algorithms both leverage structural and semantic similarities to compare plays.
For structural comparison, we give an overview of the existing models, most notably character networks. We introduce the use of the parameterized words formalism to detect character renamings between plays. We give a complete study of the theoretical complexity of approximate parameterized matching, under several edit distances variants. We prove the NP-completeness of a large class of these problems, and give heuristics and approximate algorithms to solve them. We also provide exact algorithms for these problems, in the form of Max-SAT encodings. We introduce a generalization of parameterized matching using sets, and study the associated problems. Finally, we detail the usage of parameterized matching for the study of drama on a curated multilingual corpus and discuss the strengths and limitations of the technique.

To perform semantic comparisons, we use several NLP models to design algorithms to automatically generate alignments between plays. We introduce and analyze a model using TF-IDF metrics to compare character networks, and retrieve character renamings. We propose methods for distant reading comparisons using indicators of increasing complexity, using Word2Vec methods. Using BERT models to compare sentences, we introduce a method to produce alignments between plays. We perform close reading analysis on several of the generated alignments, and analyze the performance of the algorithms.

Localisation

Salle de séminaire 4B125 (bâtiment Copernic)

5 Boulevard Descartes 77420 Champs-sur-Marne

Contact

Informations réglementaires

Le laboratoire