Tools

CATMA – Computer Aided Textual Markup & Analysis

CATMA is a practical and intuitive tool for literary scholars, students and other parties with an interest in text analysis and literary research. Being implemented as a web application in the newest version, CATMA also facilitates the exchange of analytical results via the internet, which makes collaborative work more comfortable.

Key Features

  • Freely producible Tagsets, suitable to apply analytical categories of individual choice to the text
  • The possibility of advanced search in the text, using the natural language based Query Builder
  • A set of predefined statistical and non-statistical analytical functions
  • A context sensitive help function and a user manual for better usability
  • The visualization of the distribution of items of interest (e.g. words, word-groups or Tags) in the text
  • The possibility to analyze whole corpora of texts in one work step
  • Easy switching between the different modules
  • The easy exchange of documents, Tagsets and Markup information, facilitating cooperative textual analysis

CATMA integrates three functional, interactive modules: the Tagger, the Analyzer and the Visualizer. The Tagger module offers an intuitive graphical interface and a wide range of options for the definition of Tags suitable for marking up a text. Due to the use of Feature Structures, CATMA allows for flexibility and still corresponds to relevant XML and TEI standards, enabling tools’ interoperability. Being a Standoff Markup technique, Feature Structures also permit overlapping Markup. The Analyzer module contains different text analytical functions as well as a natural language based Query Builder, allowing the user to execute complex and powerful Queries without having to learn a complicated Query language. The Visualizer module offers the possibility of generating distribution charts of the results of analyses, making the evaluation of results more comfortable.

Within the frameworks of the project heureCLÉA, we are aiming to implement further (semi-)automated functions using machine learning processes. The ambition is to enable CATMA to generate automated Markup of time-related phenomena in literary texts up to a certain level of complexity as well as to point out the cases in which automated Markup is impossible due to high complexity or ambiguity.

CATMA can be used here.

HeidelTime

HeidelTime is a multilingual, cross-domain temporal tagger. It extracts temporal expressions from text documents and normalizes them according to the TIMEX3 annotation standard, which is part of the temporal markup language TimeML. In contrast to most other temporal taggers, HeidelTime is not focused on the news domain, but aims at extracting and normalizing temporal expressions with high quality from multiple domains, e.g., news documents, narrative-style documents (such as Wikipedia articles), and colloquial text (e.g., tweets). Different domains possess different characteristics and thus result in different challenges for temporal taggers. Only if these challenges are taken into account, temporal taggers can achieve high quality extraction and normalization results on different domains.

Key Features

  • HeidelTime applies domain-sensitive normalization strategies to address domain-dependent challenges.
  • Due to the strict separation between the source code and language-dependent resources, capabilities for additional languages can be integrated easily.
  • Based on HeidelTime’s well-defined rule syntax, existing rules can be modified and new rules can be added if necessary.
  • Currently supported languages are English, German, Dutch, Italian, Spanish, French, Arabic, Russian, Chinese, and Vietnamese.
  • HeidelTime was the best system for the full task of temporal tagging (extracting and normalizing) for English in the TempEval-2 challenge (2010) and for English and Spanish in the TempEval-3 challenge (2013).
  • HeidelTime is publicly available.
HeidelTime is available here. For testing HeidelTime, an online version is also available.