Workshop report: Development and Application of Category Systems for Text Research

4th Expert Workshop, 18–19 February 2021

by Dominik Gerstorfer

drawing on the minutes and summaries of Mareike Schumacher and Malte Meister.

Evelyn Gius welcomed the speakers and guests to the workshop and introduced the topic of Development and application of category systems for text research.

She began by noting that categories were first introduced in ancient Greek philosophy as conceptual devices to classify the most fundamental features of the universe. But as the example of Franz Kuhn’s famous list of categories — as cited in Jorge Luis Borges’ “John Wilkins’ Analytical Language” — shows, all such category systems have a number of issues: In particular ambiguities, redundancies and deficiencies. These issues are unavoidable since, as Borges puts it: “[…] it is clear that there is no classification of the Universe not being arbitrary and full of conjectures. The reason for this is very simple: we do not know what thing the universe is.”

Gius pointed out that the Digital Humanities are not concerned with the universe itself, but only parts of it, which are much better accessible, which leads to the question, what role categories can play in the text-based Digital Humanities. In order to illustrate the use of category systems, she showed the forTEXT project page, which employs categories for methods, tools, and approaches to literary texts. This example of a small category system — which is not aimed at the world as a whole, but rather at a narrow segment — shows how categories can be used in everyday research settings. Categories can be used for systematisation and analysis of texts, for systematic reduction of complexity, for ordering complex textual artefacts and data sets, as well as for facilitating analysis and communication of results.

Category system in the Digital Humanities can be used as a minimal term, referring to a way of knowledge organisation. Gius went on to expand on two frequently used types of category systems, ontologies and taxonomies. Both are closely linked and used to capture a subject area in a terminological classification scheme, which satisfies different formal criteria. Ontologies define relationships between instances and categories, allowing to draw inferences and to capture and process knowledge, whereas taxonomies represent formalised, hierarchical relationships between categories, allowing to determine the particular class a given entity falls into.

Gius proceeded to ask, why do we bother about categories? We bother about categories, she answered, because dealing with category systems entails reflecting on the subject area by discerning types of phenomena in terms of difference and similarity. Employing categorisations allows systematising statements about a subject area and to put individual statements into a larger context. In general, a more focused scholarly exchange can thus be enabled.

After stressing the potential gains, Gius highlighted three problems arising when working with category systems in humanistic research:

Formal definitions often have to be developed and are not ready-made.
It is not clear what kind of relationships there are between certain categories.
Hierarchical structures are often used although in fact there is no strict hierarchy inherent to the category system. To face these problems, DH research at the moment is taking different approaches to categories and category systems: Theoretical-methodological discussion of category formation and systems could be fostered, quality criteria for category systems could be designed, or the formalization of the subject areas could be advanced further.

Gius concluded her introduction with three guiding questions for the workshop:

Which categories or which types of category systems are appropriate for objects in the humanities?
What determines the validity and fruitfulness of categories in this field?
Which — existing or new — procedures can be used to develop a category system?

Matthias Preis and Friedrich-Wilhelm Summann gave a talk on Exploring media networks digitally. Strategies of data modelling and visualisation in which they showcased their project German-language children’s and youth literature in media networks 1900–1945 (DFG 2015–2019).

Their starting point was a well-known exemplar of a media network: The Harry Potter franchise, which comprises not only the books but also movies, toys, computer games, and merchandise. Together all those different items constitute a complex network of interrelated media types and products. Media networks, Preis and Summan pointed out, are hardly a new phenomenon. The first large media network can already be found in 1929 with Erich Kästner’s Emil and the Detectives. Here, the novel, a movie, radio shows and even board games form together a media network.

The goal of their project is to create an as complete as possible online database of movies and radio shows based on children’s and youth literature, including books, their reception, and their further utilisation. The corpus includes almost 17000 radio shows, about 1800 movies, over 600 theatre plays, around 130 records, and many other materials. These were linked to over 11000 primary and about 7000 secondary texts as well as over 23000 persons and institutions.

After the introduction to the project, Preis and Summann focused on the issues concerning the categorisation of the source material. They proceeded in four steps:

A conceptual data model has categories for objects which holds subcategories for media types (like print, record, movie, theatre play), the material, and the involved actants, attributes (like production details and publication dates) as well as relations, which can either be links within the corpus or reference outside the corpus.
The data flow connecting the manual input of metadata with the final visualisation of results was realised in distinct software layers: The data is entered in Drupal forms and stored in a SQL database, from there the data dumped and transferred with Perl scripts to generate a search index with Solr. Then CSS and again Perl is used to build a search environment, from where the results are passed on to the visualisation layer, which is implemented in D3.js and CSS.
In order to construct a network of media, the different relations between the individual entries needed to be specified. Here a number of problems arose: How to incorporate rich text information in the metadata, how to make temporal and spatial information accessible. The solution was found in a perspective view on the data, that singles out one data set and supplements related data around it.
The centrepiece of the data presentation is the so-called navigator. The data sets are organised like drawers around a selected data set. Temporal information can be displayed in a clickable timeline. These views can be augmented with further layers of information, i.e., doughnut charts and chronological lists.

Data scheme and presentation module were developed in tight iterative integration. Category definitions, search engine and visualisation module are attuned optimally to each other.

Summann and Preis closed their talk with an epilogue on the antinomy of distant and close reading. They pointed out that their project connects the distant perspective on large amounts of searchable data to close reading of original texts. It is possible to travel from the top-level view to the scanned document in just two clicks, so there is no need to play the two perspectives off against each other.

Stefan Heßbrüggen-Walter gave a talk on Subject indexing early modern dissertations, in which he reported on the categorisation of metadata using machine learning. The goal of his project is to investigate what can be learned about teaching and scholarship in the 17th century. More precisely, what can be learned about the topics of philosophical dissertations in the Union Catalogue of Books Printed in German Speaking Countries in the 17th Century (VD 17).

Heßbrüggen-Walter’s approach was to automatically analyse the titles of 20450 dissertations with machine learning algorithms, which is facilitated by the circumstance that the titles of the dissertations often follow a canonical schema.

In order to classify the metadata of the VD17 corpus, Heßbrüggen-Walter developed a category system that can be divided into four classes:

External categories, containing categories, that are already found in the data.
Internal categories, that are added in the project.
Instrumental categories, given due to technological constraints.
Epistemological categories, that represent the reliability of first-order classifications.

There are two external categories: Dissertations in general, that were defended at an institution, with references to a praeses (supervisor) and a respondens (student) and philosophical dissertations, i.e. dissertations that were defended at a faculty of philosophy. But things are more complicated, there are dissertations that were not defended at any faculty or dissertation collections without sufficient information on the individual dissertations. It must also be considered, said Hessbrüggen-Walter, that the data is not complete: There must have been more dissertations than available now, and not all dissertations that can be found in libraries are in the VD17 corpus.

Internal categories consist of the subdiscipline, which is often given in the denomination of the professorship, the structure of libraries – which often mirror the structure of faculties in the university, the so called Fakultätenordnung –, and terms for subdiscipline which are part of canonical form of the title of the dissertation.

One challenge was the creation of appropriate categories, since there is no taxonomy of early modern subdisciplines of philosophy that can be applied to the corpus. Instead, a taxonomy had to be extracted from the corpus in the process. Luckily there is a canonical form of titles for many dissertations, consisting in ‘genre’ (e.g. dissertatio), ‘Label’ (e.g. physica), and topic, which is an n-gram starting with ‘de’ (e.g. de natura), which helped create the relevant labels.

Instrumental categories are the result of technological restraints: The capture of titles in VD17 is not unified, so it was necessary to remove noise (5800 titles were excluded because the relevant information was not automatically retrievable), subdisciplines hat to be grouped together to avoid unbalanced classes for the application of ML (e.g. philosophia morales grouped under ethics), further disciplines with an insufficient number of tokens had to be discarded.

Finally, epistemicolgical categories refer to the reliability of the classification. Hessbrüggen-Walter differentiates two main categories, core with high and extended core low to medium reliability. All titles to which no label was assigned, were categorised as unclassified.

Heßbrüggen-Walter summed up the workflow: He started with pre-existing categories in VD17 and discarded noisy titles, then he added categories based on the canonical form of the titles and requirements for balanced classifications, then assigned exactly one label to each dissertation, finally he classified results according to reliability.

The results of pure ML and only one algorithm were uniformly dissatisfactory. The precision lay between 71.6 (naive bayes) and 64.0 (label spread).

Considering only the core results, the picture changes: Precision of title labels across algorithms went up to over 90% for 2/3 of the titles and even the results with low reliability (just over 50%) were useful, since the keywords could be re-used to label another 742 dissertations with a precision of over 70%.

About 3/4 of the titles could be classified which poses the question, why are 1/4 of the titles unclassifiable with machine learning algorithms? Heßbrüggen-Walter hypothesised, that the titles either were too short, the bigramms were associated with more than one discipline or contained no genre terms or no terms for the discipline.

Hessbrüggen-Walter drew the following methodological conclusions: Don’t trust just one algorithm, try to identify a core (that exhibits the desired features) and investigate the long tail to see why classification fails outside the core, iterate!

Johanna Drucker gave a talk on Time Frames: Graphic representations of temporality. Her main concerns were the issues that arise in textual research with a focus on the classification of temporal phenomena.

Her talk had three parts:

The limitations of traditional timelines and chronologies for interpretative work. Temporal modelling using a graphical system to produce interpretations in relationship to narratives and discursive fields.
Research on graphical expressions of temporality, chronologies in particular.
Classification systems for temporal events.

In the first part Drucker elaborated the limitations of standard timelines in relation to narratives. Timelines are largely designed in the field of natural sciences to show certain continuous phenomena over time. Chronologies can be thought of as time frames into which time points are placed. Both devices cannot properly accommodate events because they conceive of time as unidirectional, continuous, unbroken, and homogenous. This approach is inadequate for narration and experience, since vital categorical distinctions in humanities research cannot be mapped adequately. When looking at narrative and discourse, it is necessary to be able to distinguish between telling and the told as well as to represent non-linear event spaces. In experience the affective dimensions and metrics of temporality are not uniform, and in discourse fields events are constructed across multiple documents.

This led her to the temporal modelling project, which is a software system that allows visually representing complex temporal relations. In developing this system the question arose, what are the most fundamental categories, which Drucker calls temporal primitives and how are they identified? In the project they implemented temporal objects representing these primitives, which are timelines, points, events, and intervals. These can be augmented with semantic and syntactic attributes that are connected to the temporal objects.

She went on to show further work that allows for non-static representation of temporal structures and differentiation between the planes of story order and narrative order.

In the second part, Drucker emphasised the questions that experience of temporality pose. The concept of experience of individual temporality entails that each individual has their own perspective on time and looks at different aspects of the unfolding of events. This causes timelines not to stay stable. Experienced temporality is always relational and relative to the subject’s point of view. This led Drucker to distinguish uniform time from standpoint dependent temporality.

Discourse fields consisting of multiple reports of an event create an elastic timeline that is not exactly linear, Drucker used the example of multiple witnesses to a crime, who give diverging reports to illustrate a complex network of points of view.

She then moved on to a second research area called hetero-chronologies, where the interest lies not in modelling but in analysing graphical representations of temporality. She introduced the Stream of Time, Or Chart of Universal History, From the Original German of Strass, a map from the middle of the 19th century, depicting the development of the world from its beginnings in biblical time, as an example for a semantically rich graphic of time. The leading question of the hetero-chronologies project was, how to analyse complex graphical representations of time like this. To tackle this question a new set of temporal primitives was created, consisting of frame, span, duration, interval, point, and annotation.

Graphical representation of time in those depictions relates to historical as well as to culture aspects and a main challenge takes the form of finding correlation points to match with other temporal systems.

The third part focused on events. Drucker started by noting that temporal events include a discrete temporal sequence and a complex representation within narratives and discourse. The temporality of experienced events is perspectival and depends on points of views. But there is another dimension to events, which Drucker called the event space of interpretation. Drucker elucidated that interpretation is a dynamical, complex temporal system itself, which has to be reflected in the process of interpretation and modelling.

The topic of the talk by Audrey Alejandro was From social sciences to text research: problematising categories as a reflexive approach to improve analytical work.

She began by asserting that although language and perception can be described using category systems, many aspects of them remain unconscious and have to be made visible first.

When categorising language and its functions one has to problematise categories first. But, so Alejandro, what entails problematising categories in practice remains a mystery, she intends to solve in three sections.

In the first section Alejandro gave an overview of the traditions of problematising categories. She began with Aristotle and Kant who think of categories as predicates not of objects but of groups of objects. They take the linguistic form of a mutual exclusive judgement, i.e. Kant says that something cannot be possible and impossible at the same time. Later in social science research categories become a field of empirical research. Categorisations are no longer viewed as isolated but composed in classificatory systems. Categories need not be mutually exclusive, neither are they considered to be innate. Categories are socially acquired and can vary between groups and change over time. This led to a more diverse field of research in sociology and anthropology. When problematisation becomes more conscious, it gets clearer that not only categories of analysis must be problematised but also the systems they belong to, since categories are used in the scientific process itself.

She went on to address what problematisation is and stated that it is a process that aims to identify unthought problems within taken for granted knowledge in order to make it explicit. This process can be the result of either an external event or the product of a directed action. The directed action can take the form of a methodology.

In the second section Alejandro presented problematisation as a research method with three steps:

The first step is to notice ‘critical juncture’. She suggested identifying key moments when to reflect on categories: Whenever choices involving categories are made, when constructing a research design that employs taxonomies or typologies, when the feeling arises that something doesn’t quite fit, whenever a research process appears to be stuck.

The second step is to identify potential categorical problems by asking a series of questions like: Are the routinely formed relations between categories relevant for the research? What would be the effect of not using these categories? How are categories linked? Are they polar opposites or do they constitute a spectrum?

The third and last step consists in reconstructing an alternative by considering using the same categories but reorganising the relation, by redefining the categories, by substituting the categories or by dropping them altogether.

In the last section Alejandro presented an example: The problematisation of the binary categories of local vs. international in her case study on political speeches of Dominica. The small island republic of Dominica sold passports for 100,000 dollars to high-profile criminals, which was uncovered in an episode of 60 minutes, which in turn led to protest in the capital. The binary categories of local and international were found to shape the discourse. Using her three-step method, Alejandro came to drop the binary opposition of local vs. international favour of a description as a continuous local-international space.

Julia Nantke and Nils Reiter started their talk Computational Approaches to Intertextuality: Possibilities and Pitfalls with a quotation by Gerhard Kocher: “Ask not what the government can do for you. Ask why it doesn’t.”

They highlighted that it is necessary to know the original, to find it funny. And that is intertextuality. Some things become meaningful or interesting if the reference can be deciphered. This leads Reiter and Nantke to the first — colloquial — definition of intertextuality: Re-use of a) textual material, b) meaning or c) form and recognisability by an audience.

Intertextuality was extensively researched in the last 70 years in literary and cultural studies by, amongst others, Kristeva, Genette, Clayton, and Keller. Related fields are plagiarism detection, attribution of authorship, and quote detection. In the Digital Humanities are projects like eTRAP, Viral Texts, and Vectorian already working on intertextuality.

In the first part Reiter talked qualitatively analysing intertextual references. The corpus used was Nietzsche Nachweise published in the Journal for Nietzsche Studies, consisting of 912 pairs of Nietzsche utterances and bibliographic sources. The categories were derived from Plett: quotes (with/without symbol and with/without transformation), paraphrases, and allusions.

Challenges were: Detecting references that require interpretation, especially conceptual overlap, for example “heiser denken” which is extraordinary and noticeable even with zero textual overlap. Multilinguality, when references are from or to other languages like latin or french or translations thereof. The detection of proper negative examples, i.e. a way of finding something that is not an intertextual reference. Another problem is the prominence of source, since the source must be extraordinary to be recognised as intertextual.

Nantke presented the second part on experiments on modelling intertextual relations in literary texts. She introduced the project FormIT, which had the goal of modelling intertextuality on different levels and to categorise types of intertextual relations.

The corpus was a set of literary text, where it was known before that a high intensity of intertextual relations existed. She went on and showed Günter Grass’ Katz und Maus and John Irving’s A prayer for Owen Meany as an example for intertextuality on the level of plot and text structure, although different in many aspects they show the same constellation between narrator and protagonist (the narrator is a homodiegetic first-level narrator, who is a minor character in the story, narrating retrospectively after the death of the protagonist, who’s best friend he is).

The commonalities and deviations between the text were annotated manually and linked, employing narratological categories. Annotations were made in XML, then everything was linked together with Xlink and visualised using a JavaScript library. The formal descriptions then should have been transformed into machine-readable RDF/OWL representations, which didn’t happen.

Nantke went on to the pitfalls of their approach and explained the difficulties they encountered. On the one hand, intertextuality designates a bundle of different structures and strategies on different textual levels, that are not easily operationalised, on the other hand there are obvious patterns of intertextual relations that induce systematic detection and definition of categories. Nantke states, that they learned from their previous approach, that systematic exploration can neither be achieved by sole mining of text passages, which is too distant from the textual phenomena on higher levels of abstraction and produces not enough textual features, nor by capturing every occurrence, which is too close to the texts and produces too many textual features.

Their interim conclusion was to find a middle way, resulting in a new workflow that has a functional approach. They used pre-definitions of concrete analytical units that are considered as highly relevant in the research literature and which can be approached computationally. This leads to more manageable categories or units, putting emphasis on categorisation first and detection after.

The workflow starts with the analytical units which are processed by NLP like NER and topic modelling, the results are then applied in a systematic comparative way, using paired comparison tests.

In the last part Nantke drew the following conclusions from the perspective of their planned workflow.

On the technological level, a tool is missing that allows for annotating texts in paired comparison, annotating more abstract levels than the actual text string, group annotations. It should also be able to visualise types of annotations in comparison.

On the conceptual level of category development, it turns out that categories in the (analogue and digital) humanities are complicated and hard to operationalise, therefore it is necessary to subdivide phenomena and categories in more simple units that are better manageable. Category development can be strengthened by systematic comparison, which should be established and as a heuristic technique to give more structure to the whole process.

Federico Pianzola gave insights into Fandom Metadata on AO3 and their use for literary research. The goal of his project is to investigate the role of metadata for literary studies, he also articulated that it would be desirable to have much richer metadata than is presently available in databases like the catalogue of the library of congress. The kind of desired data can be found in fanfiction metadata in the Archive of Our Own (AO3) which contains 6.5 Million stories of which around 90% are in English. AO3 can be used to draw insights about the history and evolution of a specific genre literature and its readership, which could in turn be generalised for other genres.

The tagging system in AO3 uses freeform tags, which allow the authors to cover any relevant aspects of their stories. e.g. the psychological traits of a character, the narrative strategy of the story, as well as its setting and timeframe. In total there are 13 million — mostly freeform — tags. The tags are curated by volunteers and aggregated, not replaced. To analyse the tags they have to be transformed into a canonical form. This is done by a tag wrangler, which adds tags in the canonical form in addition to the user provided tags.

Pianzola remarked that the data retrieval from the AO3 front end is problematic, because the mappings of the user generated tags to the canonical tags is stored in the backend only and is lost when scraping the site. Therefore, Pianzola had to go through several steps to create a knowledge base:

He created classes for the four main categories (FandomTag, CharacterTag, RelationshipTag, and FreeformTag) and several subclasses.
He then copied all the tags from the main page of the fandom page and created OWL objects for each of them and assigned them to the respective classes.
Pianzola defined which of them is considered to be canonical in the AO3 backend.
He copied the synonyms for each canonical tag and linked it to the respective OWL objects.
He linked the character tags to the relationship tags.
He also linked the freeform tags to the relationship and character tags.
Finally, he completed the knowledge base by resolving coreferences and inferring axioms.

Pianzola used the database to analyse the development of fanfiction over time, which revealed an increasing diversification of characters and relationships over time.

Next step Pianzola has planned is to extend the knowledge base to cover the whole of AO3 and to link it to other available databases. This will be possible, when he gets access to a dump of the AO3 database.

Itay Marienberg-Milikowsky and Ophir Münz-Manor talked about Visualization of Categorization: How to See the Wood and the Trees. Marienberg-Milikowsky started with the observation by Andrew Piper that generalisations can be boring, but that they are important because they are at the core of literary analysis. It is important to proceed systematically, cautious and slow. Immediately countering with Jonathan Cullers view, that only extreme interpretations are interesting.

He then asked if it is possible to reconcile both views? Is it possible to do meticulous scholarship and to produce interesting claims about the subject? The slow and boring but accurate approach can be identified with data-driven and the extreme interpretation approach with hypothesis-driven research.

Data-driven research and hypothesis-driven research should be balanced. But how, Marienberg-Milikowsky asked, can this goal be achieved? When students start annotating with CATMA for example, they tend to want to annotate just everything. Which leads to less precisely differentiated categories of annotation, interpretation and text. The students start like Culler and end like Piper. This issue can be tackled by transitioning from interpretation-oriented text annotation to reduction-oriented text classification. Text annotation is a device designed to decompose whereas text classification is a device designed to connect the parts in a meaningful way. This brought him to the metaphor of the forest and the trees, where text annotation equals the observation of an individual text and its properties, while the corpus, consisting of many texts, resembles the forest. To investigate the forest the whole trees need to be considered, not the single leaves and branches.

He emphasised that this approach differs from Morettis distant reading, because by manually annotating the texts, they are interpreted. After the painstaking and boring work, a space for speculation opens up: New classifications can be invented, hypotheses can be tested, extreme interpretations can be advanced.

Münz-Manor went on to present ViS-À-ViS, a tool for pattern recognition in annotated texts that is built on CATMA. He showed how it can be used to visualise different kinds of annotations which then can lead to different kinds of hypotheses and thus to interpretation.

He then detailed the principles that underlie ViS-À-ViS. Visualisation is the key, the gateway to distant reading in Morettis sense. Categories are artefacts which are formed in an interpretative process, i.e. they are not self-sufficient but human-based. Then there is a reciprocal interplay of annotation and categorisation which is (de)stabilising interpretations. And finally, annotations of text must be considered itself as text.

Staying with the metaphor of the tree and the forest, Münz-Manor asked what a tree and what a forest is. On one level a text can be at the same time a forest, composed of trees of linguistic features, and a tree, as a part of a corpus. The boundaries are not clear-cut and depend on the context and the intended use.

The features of ViS-À-ViS can help to get different views on the trees and forests, that is different levels of visualisations. He demonstrated the various views on a collection of 250 poems in the tool, pointing out the pattern detection capabilities that are allowed for.

Marienberg-Milikowsky concluded the presentation by underlining that instead of wanting to automate annotations and interpretation, we can start from generalisations and then narrow down from forest to trees to roots of trees. In this way the systematic investigation of literature and exciting speculation can be combined.

Silke Schwandt and Juliane Schiel gave a talk with the title “Slavery, Coercion and Work” — How to overcome a Eurocentric Language of Analysis in the Humanities? Collaborative Text Annotation and Text Mining Analysis for the Development of a New Analytical Framework, addressing the challenges that arise in interdisciplinary, multilingual and intercultural research.

Schiel started by introducing the project. It is part of the COST Action WORCK network, which connects about 150 scholars, through their common interest in labour and power relations, looking through the prism of coercion. The goal is to link projects across disciplines and also across time periods of research focus.

The focus of the network is on coercion, but since the participating projects differ in discipline, region, language and time period several questions arose. On the content level it was asked, how overarching forms and categories of coercion can be detected in the source documents? How can the language and perspective in the sources be related to the social context? The question on the meta level was, how can mutual understanding and cross-corpora comparisons be enabled across time periods, languages, regions, corpus genres, social contexts and scientific communities?

The key challenge was the circumstance that the analytical language of the humanities is a product of the modern West which inhibits dialogue at eye level, where all contexts of coercion can be put side by side without distorting effects of “false” translations.

Schiel then presented two ways out as answers to these questions. The first way out, formulates two imperatives: De-naturalise your own use of category systems! And: Train yourself in an analytical language derived from the taxonomies and ontologies of the source documents! The second way out determines two modes of operation. The first consists in regular online meetings of close reading of heterogeneous source material and the other in producing data stories with CATMA.

Next Schiel sketched a workflow for studying coercion, which involves the following steps:

Annotation of actors, contextual information, phrases of coercion in CATMA.
Working out a joint tagset across datasets mainly in CATMA.
Producing x-rays of the semantics and situations of coercion through reduction, abstraction, and visualisation.
Comparing x-rays and finding patterns in semantics and situations of coercion.
Remodelling the study of power and labour relations through a grammar of coercion.

The source-centred, bottom up approach led to challenges in the COST action WORCK project, explained Schwandt. The first issue is the multilingualism of the sources, there are regional differences, historical variations of the languages (e.g. the medieval Italian differs from the modern Italian the algorithms are trained on), and the open question how to handle the languages. Should they be translated or should they be transcribed? And which language should be applied to the categories?

The central question is what the common analytical language should be. Modern English is the best choice for a common language despite the “baggage” of inherent biases. Still the eurocentrism and the baggage is problematic and needs to be carefully mitigated by reflection and review.

Finally, Schwandt introduced nopaque, a platform that is developed at Bielefeld University that has the goal to provide an annotation workflow for researchers, facilitating data processing (from file setup over OCR to NLP) and data analysis with the goal to enable exchange with CATMA.

Evelyn Gius concluded the workshop noting that many connections between the varied research projects have been revealed during the sessions. And while both, the questions and the fields of research, differ greatly, it became clear, that everybody is engaged in developing of and reflecting on categories and many struggle with issues and problems linked with building category systems: How to handle categories that are implied by a system, how to process canonical categories, how to recombine existing categories, and how obtain new categories, based on machine learning or manual annotation.

Gius went on and emphasised that the levels and types of categories we are working with are often not clear. Once we begin to think about categories and start to work with them a perpetual process of reflection, problematisation and amelioration is set in motion.

Another point Gius made was, that one of the major differences between NLP and DH is that the Digital Humanities are more interested in building models of specific phenomena, models which cannot easily be generalised and adopted by other research settings, whereas NLP seeks general models with higher degrees of abstraction and universality.

Another communality that was addressed by many talks was how do we recognise the point when categories are good — or at least good enough?

Finally, Gius called attention to one more thing to be directly taken away from this workshop: the need for more standardisation when talking about categories or tags or taxonomies or types or vocabularies or ontologies or classifications or conceptualisations.

Workshop report: Development and Application of Category Systems for Text Research

Neuigkeiten

forTEXT Journal

CfP 2024: Textannotation in der Hochschullehre

Call for Papers 2024: Textannotation in der Hochschullehre OUT NOW!