Privacy in the Context of Content Platforms and Discovery Tools

These are the presentation notes for the Privacy in the Context of Content Platforms and Discovery Tools presentation during the NISO Information Freedom, Ethics, and Integrity virtual conference on Wednesday, April 18, 2018.

Presentation Slides

Privacy in the Context of Content Platforms & Discovery Tools from National Information Standards Organization (NISO)

Talk Transcript

First Principles

Whether your work in a library, provide services to libraries as part of an association, or work for a company that offers products to libraries, I think it is important to be grounded in the ethos of what privacy means for libraries. Bill Marden walked through this in his opening keynote, and I wanted to make the thread explicit from the library ethos to how those are manifested. In the United States, these grounding principles come from the American Library Association.

The Third Statement of the American Library Association Code of Ethics says “We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.” Building on this code of ethics, ALA offers a Library Bill of Rights. Much like the Bill of Rights to the U.S. Constitution, there isn’t an explicit right of privacy in the Library Bill of Rights.

Rather it is woven into the interpretation of the enumerated rights. ALA publishes an Interpretation of the Library Bill of Rights with regards to privacy. In the introduction to that document, it says that “Privacy is essential to the exercise of free speech, free thought, and free association.”

In the section labeled “Rights of Library Users”, it says that “the Library Bill of Rights affirms the ethical imperative to provide unrestricted access to information and to guard against impediments to open inquiry.” And “when users recognize or fear that their privacy or confidentiality is compromised, true freedom of inquiry no longer exists.”

It goes on to describe the Responsibility of Libraries: “The library profession has a long-standing commitment to an ethic of facilitating, not monitoring, access to information.” Which is all well and good, but I would be negligent if I didn’t point out that protecting patron privacy is not a new struggle for libraries. After all, we used to use circulation systems that recorded patron activity like this:

Slide 6 from Privacy in the Context of Content Platforms & Discovery Tools

cards in the back of books with names scribbled on them or embossed from raised lettering library cards. Still, we discussed the role of privacy in our services, and expectations changed.

As library systems became more sophisticated, we labeled items with barcodes and used a central digital system to store the relationship between the patron and the item. When we no longer needed to keep track of that association, we would delete it from our database. Not perfect, but still better than letting anyone see who was interested in a book by walking up to a shelf and looking in the book pocket.

We have become a lot more sophisticated. Rather than books and journals on a shelf, we now have networks and servers that distribute information to laptops and mobile phones. Rather than having to come to a physical space to be where the information is, information comes to the patron. I think we can all agree that this is certainly better for information access. After all, the first statement in the ALA Code of Ethics says, in part: “We provide the highest level of service to all library users through appropriate and usefully organized resources”

But are we really better off? Rather than having the data about what patrons are interested in stored in thousands of circulation systems, we now have consolidated records of patron interest – from across all of our libraries – into a handful of service providers. What do service providers know about our users? We used to share our interest in something “by word of mouth”; but now by-word-of-mouth means a whole host of social media services. Many libraries want to be helpful to users, so they stamp their pages with share and like and tweet and pin buttons. What do these social media companies know about our users? In an age of increasing numerically driven accountability, we rely on outsourced analytics providers to tell us how are services are being used. What does our analytics provider know about our users?

I hope everyone is asking these questions. I really hope everyone has thoughtful answers to these questions. And a thoughtful answer doesn’t mean not using them centralized content providers and discovery layers, not using Facebook “Like” buttons and Pinterest “Pin” buttons, not using Google Analytics or Clicky or Chartbeat. In my mind, a thoughtful answer balances the benefits to patrons and to the library with the risks inherent in our more sophisticated world. And to have a thoughtful answer, we need a curious, creative, and open mind about the technology we use and when we choose to use it.

What Do These Services Collect

Out of the necessity of time, I’m going to narrow the scope of what we are talking about with respect to technologies and services. For instance, I’m going to assume that our patrons are connected to the network and using their own devices. I’m also going to make some assumptions about our libraries -- that we can’t host all of the content on our own servers, that we rely on service providers, and that our service providers rely on other service providers.

Who Am I?

When we think of privacy, we usually start with PII: personally identifiable information. What is it that we have or use that links our being with the activity we are conducting. In the example of the back-of-the-book date due slip, that PII was our name scrawled on a line or embossed on a card. Patrons connected to a network and using their own devices have an IP address. You can think of an IP address like you think about the parcel identifier for the land where your home is located. It is likely an obscure set of numbers like 003-11-038. To make it easier for humans to use, we give street addresses such as 1456 Edgewood Drive, Santa Clara, California.

Or: Mark Zuckerberg’s house. What is important, though, is that an IP address pinpoints a computer on a network. A server, after all, has to know how to get the information back to the user. But that IP address is also a form of identity for the user. Now, those that are more sophisticated in the audience are now thinking that IP addresses are shared -- that on my home router I have my laptop, my spouse’s laptop, my daughter’s smartphone and so forth. They are all sharing that one IP address from my internet service provider.

Well, there are other ways to determine who you are. Your browser is leaking all sorts of information about your machine -- what plugins you have installed, the list of fonts on your system, its screen resolution and color depth -- and about a dozen other attributes. This is the Panopticlick service from the Electronic Frontier Foundation. This service tests your browser configuration against many of the known fingerprinting techniques to tell let you know how unique your browser is.

Combine the results of the techniques together, and a particular machine and a particular browser has a reliable fingerprint -- no matter what the IP address is. That’s right. Are you thinking your library’s proxy server is protecting your patrons by hiding their IP address? Well, sorry, think again. So unless you are really aggressive in suppressing all of these fingerprinting techniques, it is fairly straightforward to identify you.

Who Knows?

So let’s go the next level deep -- who knows about me? As we’re becoming increasingly aware, it isn’t just the web service of the page you are viewing. It is also the embedded content on that page -- like images, videos, advertisements, and social media buttons. Then it is all of the invisible agents used by those services -- web analytics, advertising bureaus, software performance tools, and so forth. Each one of those touch points is contacted when the page is loaded. I tried some samples last week in preparing for this talk. I’m not going to name names, but I went from one library to one commercial content service provider and one discovery layer provider. The library uses Google Analytics and the “add-this” social media toolkit. The content provider contacted two Google services – Google Analytics and Google Tag Manager – and two other web performance testing tools: visualwebsiteoptimizer.com and foresee.com. The discovery layer provider users Google and “add-this”. Now at least Google and the “addthis.com” service can triangulate my activity across the three independent sites -- to say nothing of any non-library sites I visit that might also use the same services.

What Do They Know About Me?

So what do they know about me? Well, whatever is on the web page, for a start. Third-party JavaScript running in the browser has access to all of the information on that page. It can scan the contents of the page to get its title and keywords, then send those back to third party server. This is technically known as Cross-Site Script Inclusion, or shortened to XSSI. So if one of these JavaScript-driven buttons is on your page, the owner of that JavaScript button can know the title of the journal article you were looking at as well as the set of search terms you used to get there.

So as you think about the scope of the services offered to library patrons, it isn’t just about the service providers. The world wide web is called a web for a reason -- it is very easy to reach out and touch far away points. What is happening on the web pages we offer to library patrons often does reach out and touch parts of the internet that we can’t anticipate or don’t intend. So that is something to be should have a thoughtful answer to. Unfortunately, if your thoughtfulness ends with: “well, they say they anonymize the data they collect, so I’m okay” -- you aren’t done thinking yet.

Issues with Anonymization of Data

It is hard to do a good job to anonymize data. Really hard. One of the first examples I remember of this is this story from 2006. This is the same example that Emily brought up in her talk. AOL published a list of 20 million web searches by their users. They didn’t include the screen name of the user in the data; they just replaced it with a number. Reporters from the New York Times pinpointed one stream of searches down to a single individual, and then called up that person.

In another case, researchers at MIT found that the dates and locations of four purchases were enough to identify 90 percent of people in a data set of 1.1 million users and three months of credit card transactions. I think this is hard for our minds to full grasp, and so stories like this surprise us. Yet we think our data is different. Indefinitely recording transaction data like what someone searches or what someone buys increases the chances that a single person could be identified.

Information in libraries is not different. Emily talked about de-identification of data in the previous talk. Problems with hashing and salts. Just about one year ago, Becky Yoose wrote an article for the ALA Office of Intellectual Freedom blog on de-identification and re-identification of patron data. She notes that data about patron activity is scattered around the libraries now -- some of it in integrated library systems, some of it coming from service providers, and some in paper-based tallies on a reference desk. If this raw data was brought together, it would form a really powerful tool for analyzing usage trends and service opportunities. As she notes, it is also a high risk for patron privacy violations.

Which brings us to GDPR, the European Union’s General Data Protection Regulation. It is a new set of rules that is coming into effect at the end of next month to provide a set of standardized data protection laws across all of the EU member countries. The goal is to make it easier for EU citizens to understand how their data is being used and raise complaints, even if they are not in the country where its located. The practical effect is, though, that it is becoming a baseline standard for privacy practices in at least the western democracies. Part of the hubbub around Facebook’s response to the Cambridge Analytica data misuse story was whether Facebook was going to provide the consumer capabilities and protections to people outside the European Union.

In late breaking news, yesterday Facebook announced that they were turning on their GDPR privacy tools for everyone worldwide. I imagine they determined that it was going to be too difficult and costly to segregate EU citizens from their streams of global data. So if was going to be too difficult for Facebook and easy for you -- well, then maybe you might want to rethink that. GDPR itself is pretty dense -- 88 pages of definitions and regulations.

The big pieces are this, though: a user needs to give meaningful consent based on “intelligible and easily accessible form”; they must be able to understand how the data was used and be able to get their data back in an electronic form; they have the right to withdraw consent and have their data removed from further processing; the data collected should be only what is required to offer the service; and there are requirements for notifications for data breaches. The regulations also speak of “privacy by design” -- privacy not as an add-on, but designing systems from the beginning to include use a minimum amount of personal data as a protection mechanism. At the end of the day, Skott is going to be talking more about this designing for privacy. But at this point I’ll add that even though GDPR is usually couched in legal or information technology terms, it isn’t a legal or technology problem. It is an organization problem; specifically, how organizations think of their users and manipulate their personal information to provide services. Think back to the Policy-Standards-Guidelines-Procedures pyramid in Bill’s opening keynote presentation.

Now, I am not a lawyer, but I’ve kept up on what is happening with GDPR -- particularly as part of my day job with the FOLIO project. One of the best interpretations that I’ve seen is this one from Thomas Baekdal on what GDPR means to publishers, and I think it applies just as much to libraries and library service providers. I’ll provide a link to Thomas’ article in the presentation notes at the end of this presentation. He goes into an interesting amount of depth about what big companies like Google and Facebook are doing about GDPR as well as the general trends in the internet industry. In his article he says we need to think about four audiences: one-time users, limited interaction users, full interaction users, and cancelled users.

One-time users have given no consent for their personal information to be collected, so Thomas says that you cannot load any third-party tools -- social widgets, automatically embedded media and so forth. It is not enough that those third party tools have privacy policies and limit their own data collection. As the publisher of the data, or what GDPR calls a “Data Controller”, you cannot passively allow other sites to collect data on your users. And for your own statistics gathering purposes, you can only do “non-identifying analytics”. I think we’re going to need to figure out what this means for libraries. For instance, a table that shows the most downloaded articles from the institutional repositories seems okay. A map that shows where anonymous users are downloading articles from your institutional repository? That seems problematic under GDPR.

In the “limited interaction” category, Thomas puts users who have signed up for a newsletter. Here it is important to keep the minimization principle in mind. In our context these might be people who have signed up for an account on our institutional repository in order to save a list of repository objects. Even though the user has permitted their email address to be collected, that doesn’t imply consent to start loading third party web trackers on web pages that these limited interaction users visit.

Then there is the full interaction users. In the context of libraries, I think these are the people we would traditionally call patrons. In academic libraries, these are the students, faculty and staff of our institution. In public libraries, it is the population of our town. We’re going to have more information about these users, and we would likely be able to learn more about them through analysis of their usage patterns. It is here that we have the greatest responsibility for transparency, data control, and all of the other privacy-by-design aspects of GDPR. We need to be able to give a full accounting to the user of the information we have stored about them. And not only through our own services but also those of any service provider we might use.

There is yet one more category of user -- those that want their information deleted. For those in the European Union or offer services to library patrons in the EU, you are probably already aware of this requirement and are hopefully well on your way to implementing procedures and technical processes to handle this. For what is probably the vast majority of people on this call today coming from North America, you may not have thought about what it would take to erase the data of EU citizens. Think back up to that case where someone signed up for your institutional repository to save a list of repository objects. Or a visiting scholar who had a full patron account on your system and is now returning to their home country. Are you now in the same boat with Facebook in having to think about the application of GDPR rights and requirements for any user of your services? To the best of my knowledge, these aspects of GDPR haven't been tested in courts so we don’t know the real world effects yet. And this is coming from the perspective of the user of the resource. After the lunch break I’m eager to listen to Virginia and Cynthia talk about the right for information to be deleted that are an actual part of our digital collections.

To bring us back around full-circle, I think there is a nice alignment of the policy goals of GDPR and the ALA Code of Ethics. Both are seeking to make user privacy a lead requirement of our systems and services. And thinking about GDPR -- whether it affects us directly or not -- takes us pretty far down the path of that Third Statement of the American Library Association Code of Ethics: protecting each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.