Customized MT engines for literary translators

Customized MT engines for literary translators – A case study

Most translation tools on the market don’t seem to be developed with a literary translator’s needs in mind. So why not train your own translation engine? Researcher Damien Hansen tested this approach together with literary translator Nathalie Serval. We asked him to share his insights with us and then followed up with a few additional questions.

June 07, 2024

Read this article in German translation: Personalisierte KI fürs Literaturübersetzen – eine Fallstudie

Most of us have grown accustomed to the machine translation (MT) hype, for better or worse, and the most recent wave fuelled by the arrival of large language models (LLMs) is no different, except perhaps for the fact that the world at large is now getting to grips with issues that translators have been dealing with for a number of years. However, things have reached a point where more and more people are now contemplating the possibility of using translation technologies in the literary domain: Google and Tencent are directly involved in the research,[1] some language providers are already specializing in literary post-editing, and it is now more or less obvious that some publishers are pushing for this model, if not downright forcing it.

Our research[2], therefore, aimed to take an objective look at what MT is capable of – or not – when it comes to literary texts, by training a tailor-made MT tool with this specific aim in mind. And although I would have never imagined these developments to occur so quickly when the project began, it has become all the more important to anticipate and reflect on these issues, in my opinion.


Why exactly did you feel the need to train your own engine for literary translation?

While it’s important to evaluate the use of existing tools, it’s equally important to keep in mind that MT needs to be trained on relevant data (i.e. taken from the same domain in which it will be used), so this was a necessary step for us to test the possibility of literary MT on a more fundamental level. In this case, we adapted a system not just to literary data, but also to a specific genre and translator (taking as a fantasy saga translated into French by Nathalie Serval as a test case). This seemed logical when it came to building our training corpus, but we didn’t expect it to work so well out of the box, so we’re now arguing for the use of customized systems that translators might train themselves.

Fortunately, more and more studies are shedding light on the many facets of literary MT (LMT). Dorothy Kenny and her colleague Marion Winters are about to publish an overview on the subject as it just happens,[3] focusing in particular on issues of personalization and style. In our case, we noticed that the output was closer to Nathalie’s own work in terms of lexical richness or syntactic restructuring, and that the system had ‘learned’ patterns beyond the word level, omission strategies, etc. — even though it was still very far from being on par with a human translation.


How much data would you need to adequately fine-tune a specific engine for a literary use case; in your experiment, you used 6 books translated by Nathalie Serval, right?

Initially, yes. We used 6 books from the same series (45 thousand segments), on top of the 4 million out-of-domain segments of our generic corpus, in order to see how well the system would perform on the 7th and last volume. This improved the output by a very large margin. However, it became apparent that we needed much more data altogether – systems are typically trained on tens of millions of segments nowadays. We tried to add a bigger and more varied corpus of literary texts while keeping the focus on Nathalie’s work, which improved the fluency of the output but also made her style less apparent, so we still need to find the right balance. Nevertheless, this was a first prototype and it was enough to show that customization at the individual level is a possible and promising avenue, especially if professionals can do it without relinquishing the rights over their data.

Now, an improved MT tool does not alleviate all of the issues of LMT, so we recently conducted an experiment with Nathalie for which Prof. Kenny has been a tremendous help, as it was inspired by her and Marion’s study on literary post-editing with translator Hans Christian Oeser[4]. It quickly turned out that while quality was not much of an issue – she is an experienced translator and is able to work with the material at hand – the process itself made the translation more complicated and time-consuming, while the segmentation and presence of pre-translated text made it much, much more constraining. In the end, our discussion revealed that there were interesting uses for MT (to spark new ideas, confront various solutions, etc.), but that we need an alternative and less pervasive integration for it. These reflections are critical to the use of translation technologies in creative domains, but have not yet been addressed extensively and could prove useful in other domains as well.


What approach did you and Nathalie Serval take to evaluate the output quality of your engine?

This was the subject of an important discussion before the experiment and, while I’m confident that the classical and current post-editing model is unsuitable for creative texts due to the priming effects and constraints on creativity that the scientific literature has already established,[5] we both felt that it was the easiest way for Nathalie to interact and get acquainted with the tool, which she was very curious to see for herself. We therefore met, and I asked her to post-edit segments of a few chapters from the book that she had partly translated about ten years ago. All in a standard CAT tool, with our MT tool and DeepL plugged in. In the future, we would like to see if using a different approach and implementation of machine translation could alleviate some of the constraining ergonomic problems that came up.

Another question that often comes back these days and that seems to have really gained the interest of literary translators is the use of LLMs. Unsurprisingly, people were quick to test their performance on translation tasks, but it is hard to make sense of all the preliminary and contradictory findings at the moment. What seems to be certain is that they might allow us to work at the paragraph level instead of isolated sentences, and that quality is much lower when English is not the target language – my own use of LLMs with French was quite unsatisfactory, as I found them to produce simple translation mistakes that MT stopped making ages ago. Nevertheless, I think that they could offer interesting avenues if used differently.


What reasons would we have today to train a specific engine for literary translation, versus simply using existing LLMs such as ChatGPT?

For now, I would say that the advantage of MT is that it’s much easier to control, train and customize; and that it’s trained specifically for translation, making it somewhat trustworthy. We’ll see, in time, how easily LLMs can be fine-tuned on translation tasks, but research is just starting, as I mentioned. I do think that they already offer interesting uses if we think of these tools not as translation engines, but as a generation tool that you can interact with and ask questions to then make decisions yourself (e.g. asking for examples of a term in context, prompting it to rephrase in a different style or using a different structure, etc.). Of course, this requires you to refine the prompts and compare the various outputs, which can take up some time. Whatever the use, however, one should remain cautious when using LLMs.

Essentially large language models are based on the same technology as machine translation, although they require much, much more data. This is what allows them to handle different tasks, but this also means that they bring new and exacerbated problems to the table. With MT, there were already concerns regarding loss of creativity, reader engagement and translators’ voice, as well risks regarding the salary and recognition of translators’ work. LLMs bring increased concerns with issues such as transparency, cost and lack of accountability, and they are more prone to omitting information or producing hallucinations (made up or nonsensical content). Another key element in that respect is the use of data that comes with training these systems.


What sort of training data is used by online and open-source engines?

For generic MT systems, most of the data can be downloaded from repositories such as OPUS, which contains corpora from many domains in many language pairs (news and scientific articles, patents, EU and government proceedings, subtitles etc.). This is what we did for our out-of-domain corpus. And seeing as part of the output was very close to DeepL, we can assume that they use more or less the same data, to which they likely add some of their own. The advantage of using an open-source framework to train your own engine – we used OpenNMT, but OPUS-MT aims to make this easier – is that you can choose exactly what goes into it, and aim for tailored quality data.

For LLMs, that’s a different story. Partly because most of it is kept under wraps, and partly because the need for data is exponentially higher. One thing we know is that it involves a lot of crawling to gather as much of it as possible, including public domain works, but also data that contains personal information, resources requiring explicit authorization from the authors, copyrighted material, etc. Even open language models released online for the sake of transparency make use of freely shared repositories of copyrighted books, for instance. This isn’t new – Google doesn’t hide the use of its ebook database for its MT engine – but this is now on a much larger scale, and they don’t expect any successful complaint against it, due to the heavy transformation that the data is submitted to as well as the opaque process of it all.

Researchers are starting to address these issues of ownership in MT training data, and LLMs will probably follow.[6] In any case, these discussions are becoming increasingly essential for the profession and will be needed for negotiation with clients or translator training, but also to collectively address these developments, ideally in collaboration with representatives of other professions.


[1] See, for instance, the recent WMT campaign (

[2] Hansen, Emmanuelle Esperança-Rodier. Human-Adapted MT for Literary Texts: Reality or Fantasy?. NeTTT 2022, Jul 2022, Rhodes, Greece, pp. 178-190 (

[3] Kenny, D. & M. Winters. Forth. “Customization, Personalization and Style in Literary Machine Translation”. Translation, Interpreting and Technological Changes: Innovations in Research, Practice and Training, edited by M. Winters, S. Deane-Cox, and U. Böser, London, Bloomsbury.

[4] Oeser, „Duel with DeepL. Literary translator Hans-Christian Oeser on machine translation and the translator’s voice“, in: Counterpoint 4 (2030), pp. 20–24 (

[5] Guerberof-Arenas, A. & A. Toral. 2022. “Creativity in Translation: Machine Translation as a Constraint for Literary Texts”. Translation Spaces, 11(2), 184–212.

[6] See, for instance, “Foundation Models and Fair Use” by Henderson et al. (


Damien Hansen is a PhD Candidate in Translation Studies at the University of Liège (CIRTI) and in Computer Science at the Grenoble Alpes University (LIG/GETALP). His thesis mainly focuses on literary machine translation, but other and past research interests include computer-assisted literary translation, the evolution and reception of translation technologies, as well as machine translation for game localization and the semiotics of video games. You can find more information about his work on his Website, on X (Twitter) @LiteraryLudeme, or on LinkedIn


Translation: Andreas G. Förster
Interview: Heide Franck

Picture credits: kokoshka