How Big Data Is Failing Us
Big Data: The phrase conjures up images of nerdy techno-alchemists finessing valuable information about the universe with complicated and expensive computers. It is that, yes—but it's also a necessary progression in technological evolution, a term for any complex analysis of large databases that would have been impossible using older technology. The potential benefits of its use are enormous, and the uses vary widely as the information collected about us grows exponentially and the ability to process it grows in tandem.
But like any powerful tool, Big Data has a dark side.
While the messages of data evangelists like Oakland A's General Manager Billy Beane or ESPN analyst Nate Silver—or, more specifically, the press coverage of their accomplishments—seem to portray data as an impartial arbiter of "truth," the fact remains that any set of data is biased, and any technique for studying it is as well. In addition to statistical bias, the bleak fact is that institutional biases, discrimination, racism, sexism, classism, inequality, and their ilk are, too, part of humanity's dataset.
As Big Data becomes more sensitive to picking up these things and better at predicting, it also becomes better at reinforcing outcomes based on factors of inequality. Like most tools in the hands of unscrupulous or uncaring researchers, Big Data can be as destructive as it can be productive. So, the question must be asked: Is Big Data an instrument of prejudice?
Answering may mean taking a long step back through history. As the capacity for analysis outstrips our ability to even apprehend it, let along ensure its ethical or moral grounding, even institutions like the White House have noted that the use of Big Data "raises crucial questions about whether our legal, ethical, and social norms are sufficient to protect privacy and other values." The ethical structures simply aren't in place yet to guarantee fair use and privacy in data. The challenges facing society and Big Data parallel the Wild West era of medicine and scientific discovery in the early 20th century that led to several terrible human rights abuses. For poor and minority communities in the U.S., one of the biggest blights was the 40-year long disaster known as the Tuskegee Syphilis Study.
The Tuskegee Study began in the midst of the Great Depression as an extension of studies on the advancement and treatment of syphilis. The study examined the effects on untreated syphilis in 400 poor black farmers for six to eight months under the guise of a last-chance treatment program. Ethical concerns, even in the 1930s medical frontier landscape, went mostly ignored in no small part because of the race and income of the study participants.
Many members of the cohort of black farmers were denied treatment and misled or developed permanent or fatal complications by the time the study was finally terminated in 1972. Whistleblowers within the government agencies that directly controlled the study, like Dr. Bill Jenkins, were ignored for years as the explicit research goal had shifted: to follow the men until their deaths. When the study was finally terminated after 40 long years, the carnage was unthinkable: only 74 members of the cohort still lived, almost half of the deceased had died due to syphilis or complications, at least 40 wives and sexual partners contracted syphilis from the group, and several of their children were born with congenital syphilis.
Despite being a publicly acknowledged program administered by the federal government and regularly cited in medical texts, national reactions at the end of the Tuskegee Study resulted in sweeping changes in medical ethics, racial dynamics in research, and research conduct in general. The 1978 Belmont Report, the first real code of research ethics in the country, was a direct result of the backlash against the Tuskegee Study. In it, researchers finally realized that "conceptions of justice are relevant to research involving human subjects." Groundbreaking.
But what does the Tuskegee Study have to do with the price of tea? Just like those researchers in 1932, Big Data stands at the shallow end of an ocean of possibility. We now have the power to answer complex questions about things mundane and deep. But like those researchers, curiosity and potential profitability have much more momentum than the ethical considerations needed to keep subjects safe.
To Dr. Bill Jenkins, the epidemiologist who spoke out against the Tuskegee Study long ago, it is clear that ethics aren't often a primary concern for researchers. Jenkins said that "researchers don't even second-guess themselves; they just assume that the process is ethical." Like in the Tuskegee Study, the future is bleak for those with the least agency over the data collected from them.
The invasiveness of data collection is easy to overlook because of how ubiquitous it is. Like those spinal taps in the 1930s, the collection of data is often sold under the guise of helpful and essential services. Everyone who uses almost any social media or commerce site has vital information, often including IP addresses (which can be used to identify individuals with extreme precision), collected at many different points during the day. While people with lower incomes and persons of color are probably not more likely to have data collected on them than other populations, they have fewer consumer protections and less advocacy and information guiding them through such complex decisions (and privacy forms). Outside of internet data, disadvantaged communities are readily taken advantage of in surveys, polls, and other data collection methods, often through some form of coercion.
Truly informed consent in this age of research is a lost concept. To Dr. Jenkins, this is intentional. "The fact of the matter is that the data is often owned by whoever collects it," he told me. "For the federal government there are some policies on informed consent but any motivated collector can get around them. People who do research are very smart. They will always figure out a way."
The conclusions and aims of Big Data can also prove harmful to disadvantaged populations. As the research designs of the Tuskegee Study warped along racial and class lines, so do many projects involving Big Data. Attorney General Eric Holder has recently spoken out about the conclusions reached in using data analysis for criminal sentencing. Noting that basing sentencing and bail decisions on socioeconomic factors and generalized traits could result in discrimination and could even worsen existing bias in sentencing, Holder stated that these initiatives "may exacerbate unwarranted and unjust disparities that are already far too common in our criminal justice system and in our society."
Unequal outcomes are also popping up in other areas as well. In school admissions, moves to make admissions processes more automated using Big Data have amplified already existing racial and gender biases in admissions data and standardized tests. Analysts have developed extensive and complicated algorithms for predicting which students will fare well at universities and then using that metric as an "objective" admissions test. But that approach is equivalent to taking a calculus course to learn arithmetic.
We already know the two main factors that make a person most likely to succeed in college: race and income. Advanced metrics in this case are just more sophisticated ways to obfuscate decisions essentially based on the color of one's skin and how much money one's parents make. Big Data-assisted decisions to extend or deny credit or to offer employment based on complex metadata aren't fundamentally different than classical redlining or job discrimination. In fact, the perceived unbiasedness of Big Data makes these new forms of discrimination much more difficult to combat.
The final element of danger in Big Data is that of power. Who benefits from the data and are individuals entitled to gains made with data collected from them?
The Tuskegee Study was blasted commonly on the grounds of "beneficence," or the idea that research should in some way benefit the studied population. In an age where profiteers, corporations, politicians and media drive data and discovery as much as scientists do, the social responsibility inherent to scientific uses of human data has been swallowed whole by the newfound responsibility of data analysis to benefit the user. While disadvantaged populations never had the upper hand in control of the data gleaned from them, the Big Data revolution increases the disparity and concentrates the vast predictive power of metadata into already-powerful hands. If the decisions made from using Big Data already increase inequality, then the concentration of power it enables does so even more.
Big Data needs a Belmont Report. But given the stakes, communities can't sit idly by this time and document abuses until the disaster has already happened. The Belmont Report and the ethical revolution that followed the Tuskegee Study focused on beneficence, nonmaleficence, autonomy, and justice as four key principles for studying human subjects. While these four don't quite appropriately address the concerns with Big Data of today, they are a starting point.
Researchers Neil M. Richards and Jonathan H. King at Washington University in St. Louis have identified four additional principles of privacy, confidentiality, transparency, and identity that are useful in creating an ethics structure for using Big Data. According to Richards, the ethical responsibility often extends to the algorithm and outcomes themselves. "We can't outsource responsibility to an algorithm" Richards said. "We have to have social confidence in the kinds of outcomes it's going to produce. All decisions to entrust a decision to an algorithm are relying on flawed human decisions and human creations."
Big Data users should have a burden placed on them not to harm individuals from whom data is gained. They should be held under the threat of criminal charges to absolute privacy and datasets should be limited to truly non-identifiable data unless the subjects explicitly agree otherwise. Privacy statements and consent should be written in 8th-grade language and limited to a few pages. Above all, persons should be respected as persons and not as faceless points of data. According to Richards, all of these solutions "mean that all of us, particularly politicians, policymakers and judges are going to need to get their hands dirty" in the actual meanings and intent of data use.
It's not clear right now how to get to the point where any of those recommendations are followed in any measure. We aren't even in a place yet where most or all scientific and medical researchers abide by the Belmont Report's guidelines. What is clear is that such an effort will be a monumental task between advocates, scientists, ethicists, legal systems and the government.
But the effort will be worthwhile as it will allow us all to move forward as a nation and to actually use Big Data to create a more inclusive and better society. The Big Data world is the new frontier. Its hazards are hostile to us all. But its conquest deserves the best of all mankind.
Vann R. Newkirk II is a data geek, fiction writer, sc-fi lover and professional curmudgeon. You can find him at @fivefifths on Twitter.
[Image by Jim Cooke]