A Philosophy, of Sorts, for the Collection and Use of Person-based Data

Paul Neto
November 17, 2018
Image referencing the blog page title

This is the second in a series of posts about building a blockchain protocol for market research. You can read the first post here.

In this post we’re going to lay out a set of views on the collection and use of person-based data that inform our decisions around product design and market strategy.

Before we begin, though, let’s quickly define what we mean by “person-based data”. Person-based data is any information relating to an individual or natural person pertaining to their characteristics, opinions, preferences, or behaviors, whether declared, observed, or inferred. Declared data is typically collected via a survey instrument of some sort. Observed data is usually collected passively by software or physical devices. For Measure, in particular, person-based data is only ever provided under direct consent.

In general, society benefits from increased access to accurate data about individuals and populations.

Person-based data is essential for the running of our modern economy and society. People benefit greatly from decisions made by individuals, organizations, and computer systems on the basis of their data. From social, policy, and marketing research to medicine, finance, and insurance — data about individuals and populations drives everything from product development to public policy. In recent years, this has become even more apparent as advances in AI have meant that, in some sense, data itself can become software and directly benefit society.

As such, we should seek to maximize access to accurate person-based data.

However, an individual’s right to privacy and sovereignty over their data is paramount.

We take this as axiomatic. While we want legislators, town planners, and product designers to have access to all of the data they can put to use, it cannot come at the cost of individual privacy or data sovereignty.

Today, companies control most person-based data and the result is socially inefficient — consumers have less privacy than they would like, they are rarely and poorly compensated for the use of their data, and society has less data than is optimal.

It is frequently unclear to consumers when and where data about them is being captured or used and when it is, they are being compensated poorly or not at all and have to trust the companies using it to keep it private. Companies, however, only care about privacy to the extent that it has a material effect on their business.

In their paper, Nonrivalry and the Economics of Data, Jones and Tonetti observe that data is nonrival. Unlike most economic goods, it is not depleted through use. Because of this, there are large social gains to sharing data. A single dataset can be used simultaneously by any number of individuals, companies, or AI systems without reducing the amount of data available to anyone else. Companies, however, are generally incentivized to hoard data they own. For many, proprietary data is the basis of their business model and a key competitive advantage. As a consequence, data that could be used productively by many at a low social cost is used only by one and society is made poorer.

Putting consumers in control of their data allows them to appropriately weigh their privacy concerns against potential economic gains and is likely to result in a near-optimal data economy.

Moving custody and control of data from companies to individuals solves an allocation problem. Consumers are inherently incentivized to weigh their privacy concerns regarding particular types of data against the economic gains that come from sharing that data to interested parties. (Note that when we refer here to “economic gains” our focus is primarily on financial rewards but the reality is much broader. In the context of public policy and social change, for example, it’s not difficult to imagine that consumers feel sufficiently compensated if it’s clear that they are advancing an agenda they care about.) On average, the more sensitive the data, the higher the price demanded by consumers, and the higher the expectations around privacy. (A corollary to this is that the easier we can make it to share data while maintaining privacy/anonymity, the more likely it is that consumers will share sensitive data.)

Further, unlike companies, consumers have no fear of creative destruction, and will willingly sell data to multiple buyers. Overall, this should greatly increase the availability of data.

Instituting a minimal set of pricing protections will avoid race-to-the-bottom pricing for data and result in a larger and more diverse population of participating consumers.

In most quantitative applications, for person-based data to be useful (and, by definition, accurate) it must be representative of some population. An optimal data economy requires participation from a diverse population of consumers. While putting consumers in control of their data is sufficient to properly balance privacy concerns and increase overall availability, if pricing is left to the open market, even under ideal conditions (i.e. a transparent marketplace free of rent-seeking middlemen) it is likely to lead to a race-to-the-bottom and a demographically-skewed consumer population. This is not a failure of market economics. Instead, it’s a consequence of the fact that the way we describe populations is incomplete.

When we are looking for data representative of the general population of the United States, for example, what we typically get is data from a group of consumers that matches the population across several core demographic attributes, such as age, gender, and income. This is good as far as it goes but, clearly, there is a lot of room for bias to creep in. The bias that we see in any particular dataset is a function of the way in which that data was collected. If the consumers who provided the data were only from California, even if they have age, gender, and income distributions equivalent to the country as a whole, they are likely to diverge in substantive ways when it comes to opinions, habits, and non-core demographics. This is, obviously, a contrived example that most data buyers would avoid. However, in market research, at least, there is an enormous volume of data collected knowing full well how little respondents are being paid for it.

Certainly, there is no simple linear relationship between respondent compensation and levels or quality of participation. Paying 20% more for data does not mean that there will be 20% more willing participants, nor that the respondent pool will be 20% more diverse or 20% more representative. However, we can make reasonable assumptions at the extremes. We know that today we can incentivize roughly enough consumers to satisfy our basic data collection requirements for the equivalent of about $0.40 for a 15-minute survey. (This is a typical payout from mainstream survey panels as of November, 2018.) We also know intuitively that this is a completely inadequate level of compensation for that amount of effort. However, if we look narrowly through the lens of our population descriptions — age, gender, income, in the case of general population — then we can proceed under the delusion that we are, in fact, getting what we need.

This is a market failure, and an inevitable one due to the impracticality of exhaustively describing populations. As such, we think the solution is to find a way to enforce pricing controls. At minimum, this would be some sort of price floor, but even better would be a more comprehensive pricing model that is fair, transparent, and easily understood. If, through pricing controls, we are able to set conditions such that more consumers are willing to participate, then, even with incomplete population descriptions, we are much more likely to end up with representative data.

As for where a price floor should be set, it’s difficult to know for sure, but we can’t let the perfect be the enemy of the good. We know intuitively that there is a meaningful difference in what most consumers are willing to do for tens of cents worth of proprietary survey points versus what they are willing to do for even a small number of actual dollars. And so we should start there.

As for the question of who pays for this increased respondent compensation, in the current market environment there is no good answer. It’s irrelevant in this context, anyway, because there is no obvious way to introduce or enforce pricing controls. In a blockchain-based marketplace, however, it is entirely possible to make and enforce collective agreements such as this. Further, cost savings due to automation and a reduction in intermediaries can offset the increased payments to respondents.

In the next post we’ll start to design our protocol. In particular, we’re going to identify who the stakeholders are and figure out how we balance their interests and then we’ll try to derive an objective function for the protocol.

Comments and questions are always welcome. Drop them below or find us on Twitter at @johnm or @measureprotocol.

As republished from original post on Medium.