Copyright & Licensing: Current context and considerations for researchers and libraries using AI in research today

Contributed by: Alex Fenlon, ORCID iD and Maria Rehbinder, ORCID iD

Original published date: 12/06/2024
Last modified: See Github page history

Suggested Citation: Alex Fenlon, Maria Rehbinder “Copyright & Licensing: Current context and considerations for researchers and libraries using AI in research today]” Digital Scholarship & Data Science Essentials for Library Professionals (2024), [DOI link tbd]

Introduction

This guide aims to provide library professionals at European research institutions, particularly those supporting or undertaking activities that combines AI tools and methods and digital cultural heritage collections and data, with a brief overview of the current copyright and licensing context for such research today

Licensing basics

Licensing agreements are a central type of legal contract concerning copyright works. Use of published content within institutions will be covered by licences. Access to ebooks, e-journals, databases etc are all covered by licences. It is important to know and understand the terms under which access to content is provided so that users can use the content without risk of breaching the licence terms.

Many of the permitted uses within licences closely mirror the copyright exceptions mentioned above, however they provide a clarity and certainty that the exceptions may not.

In general, licences give institutions, their staff and students, permission to use the licensed content for specific purposes. These purposes are usually limited to education and non-commercial research activities, i.e. the students, researchers and educators can use the content for their purposes but HR or financial teams that are not directly engaged in research or teaching delivery cannot.

The terms of the licences will vary but they frequently allow for saving or printing of parts of the licensed works. Some licences will allow users to include extracts within teaching materials or even within publications. How much can be used will also vary.

It is important that these purposes and licence terms are clearly understood and that the use they expect to make of the content is expressly permitted within the licence terms. For example if an institution has an active research interest in data mining, licences that seek to prevent or restrict this activity will be problematic. Where licence terms are unclear this should be raised with the provider. Where they conflict with the legal exceptions in law this should also be queried.

Institutions may also rely on licences provided by collective licensing societies- official bodies that represent a group of authors, publishers or rights holders. The licences may cover things like photocopying and scanning of printed works, showing broadcast television programmes or playing recorded music. Again understanding the terms, the uses and obligations is important.

Relevance to the Library Sector (Case Studies/Use Cases)

Libraries and cultural heritage institutions today don’t just support academics to undertake computational research by providing guidance and increasingly computational access to digital collections, they undertake digital research in their own right as part of library work, using existing models and developing new ones for analysing digital collections at scale for metadata improvement and enhancement and a whole host of other applications. For European institutions understanding TDM rights granted by the Directive on Copyright in the Digital Single Market (later the DSM or Directive (EU) 2019/790) and the EU AI Act is important when undertaking this work, as well as keeping up to date on your local contexts which aligns with these but in some cases may differ, such as in the UK.

This is an evolving area and our aim here is to provide a brief overview of the current copyright and licensing context for European researchers and institutions using AI tools and methods in research today.

Text and Data Mining (TDM)

Text and data mining (TDM) is a common research technique that allows researchers, and research organisations to analyse large volumes of data using modern computing power. Under EU law the Directive on Copyright in the Digital Single Market (later the DSM or Directive (EU) 2019/790) introduced TDM copyright exceptions. TDM is defined Article 2(2) as:

‘text and data mining’ means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations;

TDM is a core part of machine learning and artificial intelligence (AI) technologies. It could include the harvesting and scrapping of online data sources, or digitising printing items so that they can be read by computers.

DSM directive Article 3 allows for the use of copyright works in TDM activities for the purposes of non-commercial scientific research by research organisations and cultural heritage institutions provided they have lawful access to the content. This exception is mandatory and rights holders can not override it using agreements or technical measures when EU legislation is applicable. If the agreement is done with an organisation outside EU and governed by legislation other than EU legislation, researchers should contact their legal department for advice on how EU legislation exceptions can be applied to contractual obligations (see Regulation (EC) N o 593/2008 of the European Parliament and of the Council of 17 June 2008 on the law applicable to contractual obligations (Rome I). It is advisable to define in agreements with US/other non-EU companies that mandatory exceptions to copyright in national EU member state legislation apply to the agreement, despite being otherwise governed by US legislation.

DSM directive Article 4 allows for text and data mining for any purpose by any organisation or person, provided they have lawful access. However, right holders are permitted to opt out of this broad exception as defined using a machine-readable opt-out, as defined in Article 4(3):3.

the exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as m_achine-readable means in the case of content made publicly available online.

What is the way to opt out in “appropriate manner, such as machine-readable means” is now to be discussed in a court case in Germany Machine readable or not? - notes on the hearing in LAION e.v. vs Kneschke - Kluwer Copyright Blog (kluweriplaw.com). District Court of Hamburg, Germany has on 27.9.2024 made a decision in the first European case that examines the relying on the the TDM exception for the purpose of training generative AI models

As the DSM is a directive, member states were able to implement certain elements as they see fit and this has led to some disjointed approaches across the EU with some countries taking different approaches. Ireland for example requires that an author is entitled to be informed that the copy has been made for text and data mining purposes and ask for details about the steps taken to ensure the security of the works copied (see Copyright and Related Rights Act, 2000 (as amended) sections 53A and 53B). Similar requirement of transparency regarding the materials used for AI training of generative AI models is a key point in the AI Act; the sections concerning generative AI are applicable from August 2025.

Example 1

The library is a partner on a non-commercial collaborative research project at a university where a research team wants to engage in linguistic analysis, using computational methods, of EU newspaper articles the library has made available publicly online. In order to complete the research they need to extract all of the articles during a certain period to build a corpus of data sourced from different EU newspapers. Once the corpus is complete the researcher wants to use a computer program to perform the analysis. As the data is sourced from different newspaper websites, there is some data cleaning required.

All of the copying of the articles for the purposes of this research is permitted as is any data cleaning under the terms of the TDM exception above. The research team is permitted to carry out any steps necessary to obtain and format the data to enable them to complete the analysis using computational methods. If the researchers want to use printed articles those could be digitised and made machine-readable too.

Example 2

Now the corpus is complete the researcher wants to use a free online AI service to complete the analysis.

The model’s terms say that the user of the service declares owning all of the input data they provide and grants a licence to the model provider which allows the model provider to retain the input data and use it as a part of the training data. The DSM directive Article 3 exception or its national legislations do not allow this granting of licence to commercial companies to the input data, so the researcher can only use paid services, which the university has purchased, and which have terms that allow the input data to stay on the university VPN.

While the mandatory exception would cover the collection and analysis of the data for scientific research purposes, it does not enable the researcher to own the data nor does the exception give the researcher the authority to grant permissions for non-scientific training data uses. If the researcher would give the material to the AI system provider, contrary to the scope of the exception legislation, then the AI system provider would not have acquired the data lawfully, so the general text and data mining exception would not apply.

Considerations for libraries and research projects using AI

If a research project intends to use copyright protected works as training data for AI models, researchers should consider how the text and data mining exception in EU legislation would allow the intended reproduction of works as training data. The recent court case from Hamburg District Court, considers scientific research to be a wide concept. If the research outputs are to be commercialised, consideration must also be given to the new EU AI Act and its requirement in Article 53 to document all copyright protected training data.

If researchers are using copyright protected works in existing third party AI tools, use should be balanced against the exceptions or licences covering the content they wish to use. Researchers should consider whether they are required to grant the third party tool permissions to use the content, and whether this is possible within the scope of the licence or exception. As the researcher may not own the content they wish to input, they may not have the authority to grant those permissions. This is especially important where third party tools develop their model using input data.

Increasingly, publishers are seeking to restrict the use of the licensed content as training data to train AI models or as input used in AI systems, while in other instances they are creating licencing agreements to allowing large technology companies access to scholarly content (see Generative AI Licensing Agreement Tracker - Ithaka S+R and An academic publisher has struck an AI data deal with Microsoft – without their authors’ knowledge). The ICOLC Statement on AI in Licensing offers a useful template to begin to push back to ensure researchers and institutions are able to use licensed content for non-commercial research at least. Of course institutions are heavily involved in the translation of research into commercial activity so the ICOLC statement is only of limited use, but it’s a start.

Consideration should also be given to the impact of AI tools on library systems and the ingestion of content, openly licensed or otherwise, by generative AI tools. For instance, the KB restricts access to collections for training commercial AI | KB, National Library of the Netherlands and in January 2024 issued a Statement on commercial generative AI | KB, National Library of the Netherlands outlining their position “that commercial parties who crawl digital resources on websites on a large scale for training models, using applications such as ChatGPT, are not complying with the AI principles established by the KB in 2020.”

The harvesting of vast volumes of online content as training data for AI models is under increasing scrutiny, as is the harvesting of personal data. It should be noted that some right holders are actively seeking to prevent their content being analysed, accessed, or processed by any AI tools, regardless of whether they are “inhouse”, third party, secure/ protected/ ring fenced or otherwise. Different jurisdictions have differing legislation and court cases will further define AI and copyright legal questions over time.

Taking the next step

If you are employed in an institution that is a member of LIBER do join or reach out to the LIBER Copyright & Legal Matters Working Group, a group of librarians, lawyers, professors and communications professionals who monitor current European law and react to proposed changes, on behalf of libraries, archives, researchers and students.