Development and Application of Category Systems for Text Research

4th forTEXT expert workshop

In the digital humanities, computational social sciences and related fields, the development and use of category systems (e.g. ontologies, taxonomies or typologies) plays an important role in the systematization and analysis of texts. Categories allow for the linguistic labeling of texts or (textual) phenomena (cf. Reichartz 2011, 288), as well as for their combination or differentiation based on selected relevant features (cf. Suppe 1989, 292). By selecting suitable parameters for grouping, categories usually allow for a systematic reduction of complexity and the ordering of complex textual artifacts and data sets, which in turn considerably facilitates both their analysis and the communication of analysis results. At the same time, category systems offer the possibility of a detailed text systematization or description through the formation of subcategories (cf. Bailey 1994, 12-14). If category systems are created as ontologies or taxonomies, they can additionally provide information about the relationship of relevant (text) phenomena to each other (cf. Munn/Smith 2008, 17). Moreover, since the creation of categories usually requires the specification of explicit definitions of terms or categories, the scholarly exchange of information on subjects in the humanities, cultural studies, or social sciences is greatly facilitated – among other things, through easier understanding and better comparability of statements (cf. Carnap 1959, Pawłowski 1980, Fricke 2007).

While work on and with categories in the traditional humanities remains the exception rather than the rule and is limited to certain sub-disciplines (see Thomé 2007), it is omnipresent in the Digital Humanities due to the influence of standards from the formal sciences. This omnipresence of category-system development in the Digital Humanities is in stark contrast to the lack of systematic reflection in this field of research: Which categories or which types of category systems are appropriate for (certain) objects in the humanities? What determines the validity and fruitfulness of categories in this field? Which – existing or new – procedures can be used to develop a category system?

After theoretical-philosophical and technical facets of category systems in the (digital) humanities were the subject of a previous workshop in the context of the forTEXT project, this expert workshop will in turn focus on the presentation of – or theoretical-methodological reflection on – concrete studies in the field of the humanities, social sciences and related fields, during the course of which category systems for the study of texts are being developed. While the work on and with categories used for text annotation and analysis is the main focus of this workshop, systems and methods for the organization and classification of texts in the context of databases can also be discussed – as well as the connection between these two fields of work, in which category systems play a role. Solutions for organization and visual representation of complex category systems could, for example, be shared. Another possibility would be potential content-related connections between the two fields of application, e.g. by discussing the question of whether categories for text analysis can also be used to categorize texts – or vice versa.

The aim of the workshop is to identify – based on practical experience – the requirements that arise in the context of digital humanities projects for the development, organization, presentation, documentation, application, testing and further development of category systems.

References

Bailey, Kenneth D. (1994): Typologies and Taxonomies. An Introduction to Classification
Techniques. Thousand Oaks, Kalifornien, London und New Delhi: Sage Publications.
Carnap, Rudolf (1959): Induktive Logik und Wahrscheinlichkeit. Wien: Springer.
Fricke, Harald (2007): Terminologie. In: Müller, Jan-Dirk/Braungart, Georg (eds.): Reallexikon der deutschen Literaturwissenschaft, Vol. 3. Berlin: de Gruyter, 587–590.
Munn, Katherine (2008): Introduction. What is Ontology for? In: Munn, Katherine/Smith/Barry (Hrsg.): Applied Ontology. An Introduction. Frankfurt: Ontos, 7–19.
Pawłowski, Tadeusz (1980): Begriffsbildung und Definition. Berlin: De Gruyter.
Reichertz, Jo (2011): Abduktion: Die Logik der Entdeckung der Grounded Theory. In: Mey, Günter/Mruck, Katja (eds.): Grounded Theory Reader, 2nd ed. Wiesbaden: VS Verlag, 279–297.
Suppe, Frederick (1989): Classification. In: Barnouw, Erik (ed.): International encyclopedia of communications, Vol. 1. Oxford: Oxford University Press: 292–296.
Thomé, Horst (2007): Typologie2. In: Müller, Jan-Dirk/Braungart, Georg (eds.): Reallexikon der deutschen Literaturwissenschaft, Vol. 3. Berlin: de Gruyter, 709–712.

Program of the Workshop

Wednesday, 17.02.2021, 1:30–5:30 pm

1.30–2:00 pm: Introduction (Evelyn Gius, Janina Jacke; Technical University of Darmstadt)

2:00–3:30 pm: Session 1 (Chair: Maria Hinzmann, University of Trier)

Matthias Preis, Friedrich-Wilhelm Summann (Bielefeld University):
Medienverbünde digital explorieren. Strategien der Datenmodellierung und -visualisierung (Exploring media networks digitally. Strategies of data modelling and visualisation) [talk in German]
Stefan Heßbrüggen-Walter, Jörg Walter (HSE University, Moscow; Velbert):
Subject Indexing Early Modern Dissertations: Towards a Methodology for ML-based Text Classification Using Metadata

Break

4:00–5:30 pm: Session 2 (Chair: Lina Franken; University of Hamburg)

Johanna Drucker (UCLA – University of California, LA):
Time Frames: Graphic representations of temporality
Audrey Alejandro (LSE – London School of Economics and Political Science):
From social sciences to text research: problematising categories as a reflexive approach to improve analytical work

Thursday, 18.02.2021, 9:00 am–1:00pm

9:00–10:30 am: Session 3 (Chair: Berenike Herrmann; University of Basel / Free University of Berlin)

Julia Nantke, Nils Reiter (University of Hamburg, University of Cologne):
Computational Approaches to Intertextuality: Possibilities and Pitfalls
Federico Pianzola (University of Milano-Bicocca / Sogang University, South Korea):
Fandom metadata on AO3 and their use for literary research

Break

11:00–12:30 am: Session 4 (Chair: Marcus Müller, Technical University of Darmstadt)

Itay Marienberg-Milikowsky, Ophir Münz-Manor (Ben-Gurion University of the Negev, Open University of Israel):
Visualization of Categorization: How to See the Wood and the Trees
Silke Schwandt, Juliane Schiel (Bielefeld University, Vienna University):
“Slavery, Coercion and Work” – How to overcome a Eurocentric Language of Analysis in the Humanities? Collaborative Text Annotation and Text Mining Analysis for the Development of a New Analytical Framework

12:30 am–1:00 pm: Wrap up discussion (Evelyn Gius, Janina Jacke; Technical University of Darmstadt)

Registration

If you would like to participate, please register here: https://us02web.zoom.us/webinar/register/WN_8M6wZ5JyTB2A04gOxR-DMw

For questions, please contact: fortext[at]linglit.tu-darmstadt.de.

Abstracts of all Contributions (in alphabetical order)

Audrey Alejandro: From social sciences to text research: problematising categories as a reflexive approach to improve analytical work

Although researchers in social sciences and the humanities have long acknowledged the challenge of the categorisation function of language for knowledge production, they have not developed methodological tools demonstrating how to address this challenge in practice. I argue that problematising categories represent a promising route based on the theoretical and empirical rationale behind this tradition and develop guidelines for problematising categories that can be implemented at different stages of the research design. Based on the problematisation of the categorical pair ‘local’ vs ‘international’ in two case studies (the environmental impact of Chinese investment in Senegal and the 2017 political protests in Dominica), I demonstrate how this method enables to acknowledge and address the analytical blinders and ethical issues raised by categories in knowledge production. Overall, this article turns incentives to problematise categories into a research method that expands the practical tools for linguistic reflexivity and matches the methodological imperative to make conscious and informed choices for every dimension of our research.

Audrey Alejandro is an Assistant Professor of Qualitative Text Analysis at the Department of Methodology at the London School of Economics and Political Science (LSE).

back to program

Johanna Drucker: Time Frames: Graphic representations of temporality

The talk addresses ways in which graphical representations of time, temporality, and chronologies relate to textual expressions. The offers some possibilities for considering innovation in digital work by considering symbolic, affective, and non-linear representations of temporal relations. The talk also raises the question of what constitutes an “event” within a phenomenological framework of experience and how this might be represented graphically.

Johanna Drucker is the Breslauer Professor and Distinguished Professor in the Department of Information Studies at UCLA.

back to program

Stefan Heßbrüggen-Walter, Jörg Walter: Subject Indexing Early Modern Dissertations: Towards a Methodology for ML-based Text Classification Using Metadata

For a digital historian of philosophy, VD17, Germany's national bibliography of prints between 1601 and 1700, is an immensely valuable data source: it provides metadata for thousands of dissertations that were defended at a philosophical faculty. The printing of dissertations was linked to the practice of public disputations involving a praeses, a respondent and one or two opponents. The metadata for these prints allow for a bird's-eye view of teaching and scholarship in academic German philosophy of the 17th century. We will discuss how to classify these texts according to the subdiscipline they belong to, e. g. as metaphysical, ethical, or philological dissertation. Our approach was originally based on the hypothesis that the subject matter of dissertations mirrors the internal structure of a faculty of philosophy, e. g. with a professor for metaphysics and logic, another for poetics and eloquence etc. And we presumed that machine learning algorithms are a good fit for this classification task, since many dissertation titles exhibit similar patterns, containing a 'genre' term (e. g. 'dissertatio'), a disciplinary 'label' (e. g. 'physica'), and a topic indicated through n-grams starting with 'de' (e. g. 'de meteoris'). This working hypothesis was, however, only partially confirmed: the algorithms we used achieved an average precision of approx. 70% -- not good enough to draw any substantial conclusions about the number of dissertations published in the various disciplines. In our presentation we will show how precision could be significantly increased for those dissertations which in fact exhibited the pattern we were searching for and what factors were responsible for loss of precision in the 'long tail', titles that exhibited identifying features to a lesser degree. This finding prompted us to add another classification criterion, namely the reliability of our first-order classification: we now distinguish a 'core' of titles with high precision from 'non-core' titles with significantly lower precision and 'unclassifiable' titles that cannot be identified as belonging to a discipline. Our approach can lead to a more differentiated understanding of machine learning 'features' used as criteria for classification.

Stefan Heßbrüggen-Walter is Associate Professor at the School of Philosophy and Cultural Studies at HSE University in Moscow.

Jörg Walter is an independent software developer in Velbert/Germany.

back to program

Itay Marienberg-Milikowsky, Ophir Münz-Manor: Visualization of Categorization: How to See the Wood and the Trees

Manual annotation in the digital humanities is sometimes described as an area of close reading in a realm of distant reading. This is both true and not true: it is true because unlike some algorithmic analysis methods, a human annotation is based on sensitive attention to every detail in the text; it is not true, because, unlike close reading, annotation in DH is usually supposed to be systematic - and therefore lacks another important feature: sampling. As Paul Fleming (2017) puts it, “an essential element of close reading relies not just on the quality of the reading performed, but also on the example chosen. It has to be the right example”. The right example is the bridge between detail-orientedness and generalization, which is the traditional way to see the woods for the trees. However, how can one bridge this gap in, say, computational literary studies, without giving up systematic annotation? This is even more complicated when - typically of DH - the wood is not one text, but many, since category systems for text analysis usually helps us to see specific parts of the trees rather than the trees in their entirety.

In the workshop we present and theorize a case-study where a corpus of Byzantine Hebrew poetry was manually annotated using CATMA, focusing on figures of speech such as metaphors and similes, and then transferred automatically to ViS-À-ViS - a web-service that enables scholars to “see the wood” via various (aggregative) visualizations that single out inter alia repetitive patterns either at the level of the text or the tags. This is done from a bird’s-eye view at three interdependent layers - the specific poem, a set of poems, and the entire poetic corpus - and in two ways, human and computational. Human detection is made possible using visualizations that highlight potentially interesting patterns, while computational automatic detection uses time-series alignment algorithms. By exploring the case-study we seek to demonstrate that, by using the right methods and tools, one can ultimately see the trees without losing sight of the wood.

Itay Marienberg-Milikowsky is senior lecturer at the Department of Hebrew Literature and head of the LIterary Lab at the Ben Gurion University of the Negev.

Ophir Münz-Manor is an associate professor of Rabbinic Culture and Dean of Academic Studies at the Open University of Israel.

back to program

Julia Nantke, Nils Reiter: Computational Approaches to Intertextuality: Possibilities and Pitfalls

The concept of intertextuality has been a central subject in Literary Studies, especially in the 1980s and 90s. Research output can be roughly divided into two directions: 1. efforts to systematically establish a system of categories of intertextual ‘genres’ on the basis of a few relevant examples, and 2. detailed studies of intertextual modes of writing and their functions in individual works. These two perspectives lack mutual connections. The complexity of the phenomenon as well as capacity constraints have prevented a systematic intertwining of the two approaches within classical ‘analog‘ Literary Studies.

In Computational Literary Studies, there have also been attempts to use computational methods for capturing categories of intertextuality, which aimed to meet the desiderata of analog intertextuality research with machine-supported procedures. The presentation will first report from two projects that have pursued a systematic approach to the phenomenon of intertextuality by means of quantitative text-mining procedures and by means of manual annotation and modeling based on it, respectively. Both approaches have proven deficient for different reasons, each of which rooted in the complexity of the phenomenon itself. Based on our previous experiences, we will therefore present a new approach to indexing intertextual writing that combines quantitative-aggregative and qualitative-‘annotative’ steps.

Julia Nantke is Juniorprofessor for Modern German Literature with a special focus on Digital Humanities and written artifacts at the Institute of German Language and Literature at the University of Hamburg.

Nils Reiter is visiting professor for linguistic information processing at the Department of DIgital Humanities at the University of Cologne.

back to program

Federico Pianzola: Fandom metadata on AO3 and their use for literary research

The primary use of metadata is the retrieval of the correct kind of information searched by a user. In libraries and archives, metadata are mostly used to describe paratextual information – like author, title, and topics derived from cataloguing schemes (e.g. Dewey system). However, if more detailed information is recorded as metadata, they can potentially be used for the distant reading of literary texts and the study of literary history. The problem is that creating and aggregating metadata is a labour-intensive activity. But in the case of self-archived fandom the burden of metadata generation is distributed among authors, who upload, label, and publish their work in archives such as the Archive of Our Own (AO3) (Organization for Transformative Works, 2009). AO3 implemented an excellent system of tags management (Dalton, 2012; McCulloch, 2019) with which authors can specify tags for characters, relationships, and additional freeform tags for any use they may think of. Moreover, specialized volunteers, called “wranglers,” aggregate synonym tags: e.g. “harrypotter” and “Harry Potter” (AO3 Admin, 2012). The goal of AO3 is to help readers find exactly the kind of stories they are looking for, but researcher can exploit the well-maintained and accurate tags database to draw insights about the history and evolution of a specific genre of literature (fanfiction) and its readership (Pianzola et al., 2020). In particular, freeform tags offer authors the possibility to make explicit in the metadata any relevant aspect of the story, like a psychological trait of a character (e.g. “Morally grey Harry Potter”), a narrative strategy (e.g. “point of view of Draco”), a setting (e.g. “Diagon Alley”), a timeframe (e.g. “post first war with Voldemort”), etc. A distant reading of fanfiction through the lens of tags has benefits that go beyond the understanding of a widespread – and growing – cultural phenomenon. Data driven insights from research on AO3 can be used to formulate better hypotheses regarding the evolution of other cultural systems – like literary classics or genre fiction – and to more strategically plan labour-intensive and time-consuming tasks like manual annotation of textual corpora.

References

AO3 Admin. (2012). The Past, Present, and Hopeful Future for Tags and Tag Wrangling on the AO3. Archive of Our Own. https://archiveofourown.org/admin_posts/267 (accessed 25 February 2020).
Dalton, K. L. (2012). Searching the Archive of Our Own: The Usefulness of the Tagging Structure. University of Wisconsin-Milwaukee.http://dc.uwm.edu/etd/26/ (accessed 24 June 2013).
McCulloch, G. (2019). Fans Are Better Than Tech at Organizing Information Online | WIRED. Wired. https://www.wired.com/story/archive-of-our-own-fans-better-than-tech-org... (accessed 25 February 2020).
Organization for Transformative Works. (2009). Archive of Our Own.https://archiveofourown.org/ (accessed 5 May 2020).
Pianzola, F., Acerbi, A. and Rebora, S. (2020). Cultural Accumulation and Improvement in Online Fan Fiction. In CHR 2020: Workshop on Computational Humanities Research, November 18–20, 2020, Amsterdam, The Netherlands. CEUR Workshop Proceedings.

Federico Pianzola is a Marie Sklodowska-Curie Global Fellow for the project “Reading Literature in a Digital Culture” at the University of Milano-Bicocca (Italy) and at Sogang University (South Korea).

back to program

Matthias Preis, Friedrich-Wilhelm Summann: Medienverbünde digital explorieren. Strategien der Datenmodellierung und -visualisierung

Der Vortrag berichtet über den interdisziplinären Entwicklungsprozess eines Online-Portals für das DFG-geförderte Projekt Deutschsprachige Kinder- und Jugendliteratur im Medienverbund 1900 bis 1945, das als Kooperation des Fachbereichs Germanistik mit der Universitätsbibliothek in Bielefeld angelegt war. Im Zentrum des Projekts stand die manuelle Aufnahme von bibliographischen Metadaten zu Filmen, Hörfunksendungen, Printliteratur, Werbeartikeln usf. mit kinder-/jugendliterarischen Stoffen, um frühe historische Medienverbünde in ihrer Struktur, Ästhetik und Genese näher zu profilieren. Erfasst wurden viele tausend Medienangebote einschließlich der an ihrer Produktion, Distribution und Rezeption beteiligten Personen und Institutionen. Die größte Herausforderung bestand projektübergreifend darin, die vielfältigen Formate und Interaktionen im medialen Handlungssystem auf Datenebene adäquat zu modellieren. Ausgehend von bibliographischen Standards wurde dazu ein zunehmend komplexes Datenschema entwickelt, das in seinem Design die gegebene Verknüpfungsvielfalt mit unterschiedlichen, verschachtelten Attributstrukturen adressiert.

Im Fokus des Vortrags stehen zunächst Fragen der Kategorienbildung beim Umgang mit komplexen Metadaten, die sich aus dem Facettenreichtum ausgewerteter Medienangebote und ihrer Relationen ergeben. Ferner wird Einblick in die entwickelte Suchmaschinenumgebung gewährt, wobei die strategische Implementation verschiedener Visualisierungsformen besondere Berücksichtigung findet. In diesem Kontext wird auch die im Projektverlauf entwickelte technische Infrastruktur skizziert – von der Datenerfassung über die möglichst intuitiv angelegte Evaluation und Visualisierung der Daten bis hin zu vertiefenden Analysen des Datenpools. Die konstruktive Spannung zwischen literatur- bzw. medienwissenschaftlichen Zielvorstellungen und der konkreten Umsetzung und Anpassung der digitalen Systemumgebung bildet dabei die Hintergrundfolie. Argumentiert wird alternierend aus literaturwissenschaftlicher und informationstechnischer Perspektive, um die explorativen und häufig rekursiven Prozesse zwischen Datenmodellierung und -visualisierung nachzuzeichnen. Zu zeigen ist nicht zuletzt, wie Kategorisierung und Visualisierung den Blick auf das Gegenstandsfeld modulieren und damit (potenziell) zu einer Erweiterung des literaturwissenschaftlichen Horizonts im Sinne der Digital Humanities beitragen.

Matthias Preis is lecturer at the Faculty of Linguistics and Literary Studies at Bielefeld University.

Friedrich-Wilhelm Summann is head of LibTec at Bielefeld University Library.

back to program

Juliane Schiel, Silke Schwandt: “Slavery, Coercion and Work” - How to overcome a Eurocentric Language of Analysis in the Humanities? Collaborative Text Annotation and Text Mining Analysis for the Development of a New Analytical Framework

Annotating is a widespread scholarly practice (cf Unsworth 2000) that is used in very different ways in many different disciplines. It is often not even addressed as such as we can see in the humanities. Historians, for example, annotate while reading by taking down notes and marking passages of a text or document. These practices are oftentimes not reflected upon and seldom find a place in academic publications.

But annotations can well serve as a starting point in collaborative projects. Discussing annotations, concepts, and interpretations furthers the development of a common vocabulary as well as the evolution of a mutual understanding of the same.

We want to present an example of such a joint project, which involves about 250 scholars from all over Europe - coming from different academic backgrounds, scientific cultures, and speaking as well as reading sources in different languages. Using CATMA as a tool for collaborative annotation, the aforementioned practices come to the front of the research process as they structure our discussions and interpretations. We want to use this to arrive at a common vocabulary on the hand, but also to develop a new analytical language to address questions of labour and coercion without imposing modern concepts on historical times by using seemingly established, Western dominated terms and concepts.

Juliane Schiel is Associate Professor for Social and Economic History (1300–1800) at the Department of Economic and Social History of Vienna University.

Silke Schwandt is Associate Professor for Digital History at Bielefeld University.

back to program

Development and Application of Category Systems for Text Research

References

Program of the Workshop

Registration

Abstracts of all Contributions (in alphabetical order)

References

Neuigkeiten

Themenhefte "Projektkonzeption" und "Bibliografie" jetzt verfügbar!

CfP 2024: Textannotation in der Hochschullehre

Call for Papers 2024: Textannotation in der Hochschullehre OUT NOW!

Routinen

Ressourcen

Tools