By Junior Tidal (Web Services & Multimedia Librarian, Associate Professor, New York City College of Technology, CUNY)
Against the Grain Vol. 33 No. 1
As the COVID-19 pandemic shifts numerous libraries into using virtual services, one of the metrics being used to gauge patron engagement is web analytics. However, as web analytics can reveal data on library users, pulling analytics can be problematic for a number of reasons. Most notably, the collection of user data from web analytics causes concern for user privacy. This column will outline the definition of library web analytics, popular tools to collect analytics, how they are used at the New York City College of Technology’s Ursula C. Schwerin Library, and potential problems that can arise when pulling and examining user data.
Web analytics can be defined as data collected from users who visit a website. Merek explains it as “a process through which statistics about website use are gathered and compiled electronically” (2011). This includes such metrics as when (date and time), where (geographical location based on IP address), and what device (desktop vs. mobile devices) users are utilizing when loading the site. The origins of web analytics can be found in web server log files, which typically record data requested by clients to a web server. This eventually evolves into more granular data collection, including specific pages being accessed on the server, the length of time by which users view a page, and exit and entry pages, or referring sources.
This information can be useful for library staff to better serve their users. My own institution, the Ursula C. Schwerin Library (known locally as the City Tech Library), uses web analytics to make statistical, evidence-based decisions. Knowing more about users, much like search engines, social media platforms, and ecommerce websites, can help enhance the user experience and support the ease of finding information. For example, we have used web analytics information to determine which web pages to weed out based on users’ interactions, discover the specific browsers most compatible with our library website, trending blog themes, to determine which social media platforms to use for marketing campaigns, how to best serve mobile web pages to users for particular devices, optimal channels to market electronic resources, and much more.
As an alternative, the City Tech Library utilizes Matomo to collect user data and help inform marketing and web-based decisions. Matomo is an open source, freely available web analytics tool. It provides many metrics that GA collects — the main difference is that data collected by Matomo is controlled and collected by the library. It is not shared with Google or other third parties. Libraries concerned with patron privacy may find Matomo useful for that reason alone. In a similar vein, libraries may also reconsider using Google-based applications and hardware as it can be assumed that aspects of the analytics data and metadata may be collected and used by third parties. The downside of Matomo is the technical know-how necessary to administer the software is quite advanced. However, since it is open source, Matomo has a rich community that provides support for administrators.
It can be argued that the reliance of GA by so many library websites is caused by a lack of funding to support in-house web analytics. Libraries were greatly underfunded prior to the pandemic and currently suffer from austerity budgets. Web librarians and server administrators rely on third-party created tools, be it a proprietary content-management system such as LibGuides, an open-source platform such as WordPress, or freely available tools such as GA. Even open-source software such as Matomo comes with the price of installing, maintaining and customizing the software for particular server setups. This may be problematic for some institutions without funding to skilled labor, making GA the preferred platform.
Similar to web analytics, social media platforms such as Facebook, Instagram, Twitter, and many more, allow users to sign up for free, where their activities generate statistics. These analytics are equally valuable to libraries that rely on social media to connect with patrons. However, the patterns of users’ data, behaviors and actions conducted on these sites can be monetized by the social channel for customized advertisements and marketing. This is a massive amount of data that users willingly produce for free. User interactions such as likes, friend networks, and other related metadata is also subject to being collected and sold.
Web analytics data collected from hundreds to thousands of users can lead to the creation of predictive models of users’ behaviors. This is a powerful aspect of web and social platform analytics that can benefit search engines and social media services. This wide berth of data collection can also be used to uphold structures of social inequality through algorithms based on aggregated user data. The intentional and unintentional biases of developers are inherited by these systems. This is evident in the short life of Tay, a chat bot developed by Microsoft which “learned” to be racist based on social media (Neff and Nagy, 2016), as well as the racial and gender discrimination present in face recognition software (Buolamwini, 2017).
How can libraries then resist the collection, classification and selling of user data? One concrete step is to end the use of Google Analytics and switch to open-source software like Matomo, or alternatively, stop collecting user data altogether. Libraries can help protect patron privacy by determining why web analytics information should be collected in the first place, negotiating contracts with library vendors, educating patrons and using institutional and organizational processes to protect user privacy.
When asking themselves why they are collecting web analytics, it’s important to remember that some libraries may rely on web interactions to justify grants and budgets which are factored into annual reports. Web analytics can be used to make design and interface decisions concerning a library website; however, it only provides a small facet of users who are reduced to web statistics. Instead, more useful data collection may be done through usability task-testing. This is a process where librarians, developers, or designers can collect direct feedback from users based on how they accomplish tasks on the library website. Metadata and web analytics only paint a partial picture of web visitors. Usability testing provides a more detailed account directly from the perspective of the user, rather than data collected from their interactions. Reducing users to web statistics can skew search algorithms, which poses a risk to incorporate bias when analyzing web data. Usability testing can open new avenues, incorporating accessible, compassionate, universal and diversity into design.
Libraries and library consortia can negotiate vendor contracts to protect their users’ privacy when considering analytics information collected by online services such as databases and integrated library systems. Stroshane of the North Dakota State Library outlines vendor negotiations and provides a checklist on the American Libraries Choose Privacy Blog (2017). One of his points of advisement is to express concern to vendors about potential patron privacy violations when utilizing third-party web analytics platforms. Librarians should ask vendors if this information can be barred when developing their contracts.
Additionally, librarians can inform their patrons about the practices of web analytics and algorithms. The San Jose Public Library makes their vendor privacy agreements openly available so patrons can see how their data is used (https://www.sjpl.org/vendor-privacy-policies). Providing this information is one step forward in educating library patrons. Libraries are inherently educational institutions, and providing users with strong digital literacy skills is synonymous with reading literacy skills in today’s social information age. It’s important to note that smartphones provide a wealth of data, including where users connect, what kind of device they are using at what time of day, and, even in some cases, user health through the use of wearable technology. This makes lower-income users particularly vulnerable for data tracking, as the Pew Internet Research Center states that “[In] 2019, 26 percent of adults living in households earning less than $30,000 a year are “smartphone-dependent” internet users.” (Andersen and Kumar, 2019) This data can be sold to the state, as ICE has used data against vulnerable library populations such as immigrants (Lamdan, 2019).
If libraries are leery of these practices, they can resist the use of analytics not only from their own practices, but from their institutions. The use of digital surveillance by educational institutions has been enabled because of the recent COVID-19 pandemic and the evolution of remote learning and digital resources. There is a reliance on software that factors in student behavior, which is then processed to determine student success. For example, the Virginia Commonwealth University, prior to the pandemic, conducted a pilot with Ram Attend, tracking student attendance when they connected to the institution’s WiFi (King, 2019). Not only is this practice problematic in ensuring user privacy, as it tracks the geolocation of students and associated metadata, it adds another metric within a larger algorithmic framework that may or may not determine student success. One way to mitigate these issues is through institutional governance. The work of CUNY librarian and University Faculty Senate member Roxanne Shirazi and others have highlighted the monetization and use of learning data and, in response, have pushed through faculty governance a resolution to protect student data from such practices.(2020).
The use of web analytics in libraries is helpful in understanding user patterns and preferences making virtual services more accessible, but it shouldn’t come at the cost of free access to user data. When used inappropriately, data can be leveraged for nefarious purposes such as upholding structures of white supremacy, invading user privacy, and dehumanizing patrons into statistics. As a solution, libraries should really question why they are collecting analytics data, how they can inform users of the purposes of data collection by vendors, and the best way to protect the privacy of their patrons.
Academic Affairs (2020). Affirming the Privacy of Learning Data at CUNY. CUNY UFS. Retrieved from https://www1.cuny.edu/sites/cunyufs/2020/09/10/affirming-the-privacy-of-learning-data-at-cuny.
Andersen, Kumar (2019, May 7). Digital divide persists even as lower-income Americans make gains in tech adoption. Pew Internet Research Center. https://www.pewresearch.org/fact-tank/2019/05/07/digital-divide-persists-even-as-lower-income-americans-make-gains-in-tech-adoption/
Buolamwini, J., & Gebru, T. (2018, January). Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency (pp. 77-91).
King, B. (2019, Nov. 4). VCU to begin pilot program tracking students’ attendance through Wi-Fi. WTVR. Retrieved from https://www.wtvr.com/2019/11/14/vcu-to-begin-pilot-program-tracking-students-attendance-through-wi-fi/.
Lamdan, S. (2019). Librarianship at the crossroads of ICE surveillance. In the Library with the Lead Pipe. Retrieved from http://www.inthelibrarywiththeleadpipe.org/2019/ice-surveillance/.
Merek, K. (2011). Web Analytics Overview. ALA TechSource, 5. Retrieved from https://journals.ala.org/index.php/ltr/article/view/4233/4827.
Neff, G., & Nagy, P. (2016). Automation, algorithms, and politics talking tobBots: Symbiotic agency and the case of Tay. International Journal of Communication, 10, 17.
Stroshane, E. (2017). Negotiating contracts with vendors for privacy. American Library Association Choose Privacy Everyday. Retrieved from https://chooseprivacyeveryday.org/negotiating-contracts-for-privacy/.