Working with Data

Contributed by: Niamh Malin, ORCID
Original published date: [DD/MM/YYYY]
Last modified: See Github page history

Suggested Citation: Niamh Malin, “Working with Data,” Digital Scholarship & Data Science Essentials for Library Professionals (2024), [DOI link tbd]

Introduction

Data is all around us, and it almost cannot be avoided. Libraries and universities are full of data, but staff often think they do not work with data. Have you ever been asked how many loans a book has had? What about how many visitors came in at the weekend? Or if a certain book is available? These are all data questions, answerable with quantifiable facts.

Data is often considered to be a number, like the number of times a book has been loaned, but what about who borrowed it, or what their review was? These are all different forms of data, and can provide all new insights to the popularity of a book - maybe a book was borrowed 10 times, but in fact got 10 1-star reviews, versus a book borrowed 5 times with 5 5-star reviews, using this data, which one would you recommend?

What is Data?

Data is often overwhelming (the term ‘Big Data’ floats around a lot!), but that shouldn’t stop you from exploring it, because it can greatly improve your work. Just to clarify, Big Data will not be discussed here, Big Data requires data to be high in volume and velocity, which is not the case for most library data (think about downloading a document every time the hashtag #photooftheday is used).

So first, let’s establish some definitions (for this guide at least) while looking at one of our most central collections of data:

A new book has arrived, you create a record on the library management system filling in the title, ISBN and price.:

Definition
Data A collection of information.
Data Point Single point of data (e.g. a data point could be the title of a book, however alone that information tells us very little about the book. Therefore a ‘record’ containing relevant data points will be created.).
Metadata Data that provides information about other data (e.g. columns of data in a spreadsheet). (e.g. in a library catalalogue the metadata about the book includes the price information, the isbn, author and title
Dataset A collection of data from a single source or for a single project. (e.g. we might create a dataset of all catalogue records relating to a particular topic of interest to researchers)
Database A table with structured (organised) data.
Dataframe A structured representation of data
Software Program for a computer.
Hardware The physical components of technology.
Back-end The part of a system that is not usually visible or accessible to a user of that system.
Front-end A software interface designed to enable user-friendly interaction with software.

Where is the data in a library?

Data can be qualitative (sentiment) or quantitative (numerical), therefore when we think about visitors to a library, do you know how they felt about their visit (qualitative data) or how many hours they spent at the library (quantitative data)?

Here are some examples to get you thinking about what data points stem from a library;

Institution Staffing Acquisitions
The Building / Space

Wider Community

Stakeholders

Facilities

Business

Opening Hours
Library Staff

Institution Departments

Qualifications and Education
Subscriptions (ongoing)

Purchasing (one time)

Software

Suppliers

Spending / Financial Information
Resources Engagement Patrons
Materials metadata

Collection(s)

Care and Maintenance

Discovery Services

Accessibility Services
Research Data Management

Training

Outreach

Tours

Communication

Quantity of Users
Helpdesk (in-person)

Queries (email, online chat etc.)

User Support

Feedback

Interactions

Considering the above, what do you interact with? Do you consider yourself to work with data?

Relevance to the Library Sector (Case Studies/Use Cases)

Data exists, so what?

Data should have a purpose. If data has no purpose, why waste your time managing it? The purpose of data could be any of the following;

  • Provide insights to pertinent questions, and thus make informed decisions.
  • Provide evidence of a decision or conclusion.
  • Explore relationships / trends, to tell a story and provide insights.
  • Explain (and display) complex information more effectively.

Data can provide additional power when making a decision (big or small) - knowing exactly how many users visit at the weekend can prove the library does require the funding and staffing to stay open on the weekend. However, if it proves no one is visiting, do not fudge the numbers. Use those numbers to question why, such as what else could attract users, what do patrons use it for currently, or was there a cause to the decline in weekend patrons?

Proof is in the Pudding

Knowing that data exists in a library is the first step, but what can be done with it? While this guide has touched on a few questions you can use data to answer, below will explore real world examples of libraries utilising their data to make powerful changes.

Data can inform decisions about:

  • Collection Management: what is being used, when, by who, and why (if you have a review system).
    • Example: Within our collection, how often do users borrow material published before 2000? (But remember, borrowing is not the only form of usage!).
  • Outreach: who forms our community, what matters to the community,
    • Example: Does the demographic of people attending our events reflect the demographic of people in our wider community? Which events could make the library more appealing to the wider community?
  • Funding / Expenditure: Use data about patron usage, reader requests or patron feedback as part of an application for additional funding or staffing to prove your standing within the community.
    • Example: How much would it cost to buy an additional copy of a book, if it had more than three reservations at a given time?
  • Reviewing Success: When starting a new project or approach, consider how it will be measured as a success. Knowing how to measure the impact of a project will enable you to actively reflect and produce solid facts about the success of your project.
    • Example: How can we measure patron satisfaction currently, so that at the end of the project a 10% increase can be quantified and thus the project can be considered a success?

Here are a few real world examples of big data projects within libraries, but remember, small or internal projects should be informed by data too.

The Metropolitan Museum of Art (New York) Watson Library explores their production of data visualisations using Microsoft Power BI in regards to their library activities. Using the library management system (data which they already had), google analytics, digital collections and some manual tallies, they explored six years of library data. This exploration demonstrated the benefit of moving their blog from an external website, created an index of African American Artists,and ease in sharing library metrics for the previous six years publicly.

It is well known that publications databases take time to update, but how long? Collection Analysis Librarian at Iowa State University, Eric Schares, wanted to know how long to wait before being able to analyse a calendar year of publications, and set about doing so by comparing data from three different publication services (Dimensions, Web of Science and Open Alex). The data (which continues to be updated), documented the rise of Open Alex, and demonstrated part of the effectiveness of each service for Iowa State University - however he also shares his code and thus you can replicate this data question at your own institution.

Data Analyst at the University of Cambridge Library Niamh Malin, was responsible for supporting the many publisher negotiations. The first paper addresses the use of Microsoft Excel and Dimensions publications data to document the relationship of the university with Springer Nature. The second presentation documents the transition to Microsoft Power BI to create data dashboards which ensure every publisher negotiation is informed by usage and publications data.

The National Information Standards Organization (NISO) hosted a two part webinar in 2018 about extracting value from library data. The slides for the six presentations are available, and cover topics such as; setting up a data analysis strategy, analysing metadata, utilising data visualisations, actioning data insights for utilising physical spaces and building confidence in your data analysis skills.

Hands-on activity and other self-guided tutorial(s)

Now that you’ve found some data, and/or you’ve formulated some questions to answer, how do you know the data will be good enough? Let’s take some time to assess the data, and ensure it is fit for purpose, because a key analogy in data analysis is;

GIGO: Garbage in, garbage out!.

The process of utilising data is a cycle, a standard process model that describes a common approach to data is CRISP-DM (Cross-industry standard process for data mining). It has six stages, listed below.

This section will enable you to assess your data, through a variety of leading questions, at each stage of the cycle. How long each phase takes will depend entirely on the project, there is no right or wrong, and it is cyclical, you will return to questions as you develop the project. These are structured as questions because one guide cannot have all the answers, and therefore giving a variety of questions which hopefully ensure you do not miss anything, and learn much more which is relevant to your project. This approach will enable you to assess if the data you have is relevant to the question, and if it has the potential to answer it correctly.

This guide hopes to calm your nerves about using data, and to see that you are capable of answering data questions confidently.

Business Understanding

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Here you are planning your project

  • What questions are you answering?
  • Who are your stakeholders?
  • What are the criteria and limitations to the project?
  • What is the goal of using this data? Doing this project?
  • What resources are (or are not) available?
  • What is the timeline on the project? And the data input specifically?
  • What stage of the project requires data analysis?
  • What is the expected outcome of the data analysis?

Data Understanding

Data Understanding drives the focus to identify, collect, and analyse the relevant datasets.

  • Can you access the data you require? What is the source?
  • What data (and metadata) is missing?
  • How reliable is the data?
  • What relationships are relevant within the data?
  • Who is involved in gathering and preparing the data?

Data Preparation

A common rule of thumb is that 80% of the project is data preparation.

  • What data is not necessary from the dataset?
  • How are errors or duplications handled?
  • What new attributes or formulas are required?
  • How do datasets interact with one another?
  • Is the data formatted correctly?
  • What acronyms are in use? Are they formatted correctly?
  • Has the data been standardised?
  • Are the column names useful and appropriate?
  • Does every database have a unique identifier column?
  • How do you handle missing metadata?
  • Are you editing the master / only copy?
  • Have you documented the process so that you can provide evidence if needed?
  • Is the data in a clear and useable tabular format (rows and columns)?
  • Have all merged cells been removed / updated?

Modelling

What is often regarded as the most exciting work is also often the shortest phase. Now is the time to build and assess various models or visualisations.

  • What conclusions can be drawn from the data?
  • What visualisations are appropriate to display the data?
  • What comparisons and relationships should be highlighted to align with your initial goal?
  • Does the data contradict the hypothesis? Why?
  • Do the formulas need to be live? Tip: Live formula within a spreadsheet can cause it to be slow and large in size.

Evaluation

This phase looks more broadly at the data project and what to do next.

  • Did the data fulfil the goal of this project?
  • What was unable to be achieved?
  • What is required for the next data project to run more efficiently?
  • Has the project been completed successfully?

Deployment

Depending on the requirements, the deployment phase can be as simple as emailing a graph, or as complex as publishing a live dashboard of intricate data and visuals.

  • Who requires the outcome of this project?
  • Will the outcome need to be presented in multiple formats?
  • Will this be repeated again? Has it been documented?
  • Where will the data and outcome be stored?

Storing a datafile / spreadsheet

Once the data exists, it must be accessible. There are a variety of software which can store your data, and taking the time to assess the needs of your data is important.

  • Is the file formatted correctly (CSV vs Microsoft Excel)?
  • Will it need to be accessed online and/or offline?
  • Does the file require a software licence to access?
  • What backups are required?
  • How long are files required to be stored?

Taking the next step

Ensuring Good Practice

Working with data is very exciting and rewarding, but can quickly become overwhelming. There are fears of being hacked and having data stolen, or even losing data due to bad organisation or not knowing it needed saving. Sadly this guide cannot have all of the answers on the practices and policies which should be in place. But it will instead ask a few provoking questions to ensure you can rest the worries of data loss and theft.

  • Have you established a naming convention for files? Does it account for versions, dating and author? Tip: Have titles be meaningful, in a relevant order, and versions discernible.
  • When will the file(s) be deleted or archived?
  • Is there clear documentation of the processes undertaken? Therefore ensuring the task can be repeated, or to justify any conclusions. Tip: This should include the source of the data, any cleaning steps made, and all relevant metadata.
  • Where are the files stored, are the folders easy to access? Is there a naming convention and folder hierarchy in place for the project, and the team?

What skills are you looking for?

If you have read through this guide, and still thirst for more, here are some questions to encourage your exploration in library data!

  • What in-house data can you combine (such as the library management system and google analytics)?
  • Think of a report that is repeated regularly but often time consuming, is there a software or data task that could make this quicker?
  • What software do you regularly use, how could you expand those skills? Does the library management system provide specialist training?
  • Is coding, AI or visualisation an area you would like to explore? What else do you want to do with data?
  • Which elements of the data process do you want to focus? Cleaning, analysing, visualising etc.
  • Can you attend a conference which discusses library data you are interested in?