Cities Turn to Software to Predict When Police Will Go Rogue

Joshua Brustein

16 Jul 2020, 06:47 AM IST

(Bloomberg) -- There are no easy fixes for the Minneapolis Police Department. State lawmakers tried and failed last month to come up with a reform plan after four officers were charged in the death of George Floyd, an unarmed Black man; the city council is proceeding with a proposal to dismantle the department altogether.

Last month Police Chief Medaria Arradondo outlined his own ideas for change, which included working with an obscure technology firm named Benchmark Analytics. The company takes the idea of predictive policing—which uses algorithms to forecast where crimes will occur or who will commit them—and turns it on its head, using a computer model to predict which officers are most likely to be involved in misconduct.

Benchmark’s system feeds data— citizen complaints, which cops hold second jobs, who’s been called to traumatic situations such as domestic abuse or suicide prevention— through a computer model that ranks each officer according to their perceived risk score. In a recent demo conducted via video, Chief Executive Officer Ron Huberman showed off a series of slick dashboards enabling departments to compare cops and units within a department, as well as analyze officers on specific attributes, such as how proficient they are at de-escalating confrontation.

It’s not clear Minneapolis will ever use such dashboards; the city’s deal with Benchmark stalled after an initial plan to pay for it fell through. But even if Benchmark doesn’t move ahead in Minneapolis, it’s doing so elsewhere. Huberman says the company has signed up 70 law enforcement agencies over the last three years, all of whom will be using the company’s technology within 12 months. Nashville, which began working with Benchmark in 2018, will officially launch its system, called First Sign, in late July.

Huberman, who spent almost a decade at the Chicago Police Department before moving into leadership roles in city government, doesn’t blame systemic policy failures or training for the violent police encounters that have sparked a nationwide reckoning. In his view, police violence is mostly the result of a small number of officers he describes as “malevolent actors” and “bad apples.” The main challenge is identifying them before anything goes wrong. “And I can tell you from the data,” he says, “that this is really a knowable thing.”

Cities Turn to Software to Predict When Police Will Go Rogue

But Benchmark’s technology, based on a program by the University of Chicago’s Center for Data Science and Public Policy, faces significant challenges. Its success depends on the quality of data from the police departments, which can vary widely. The company’s progress has been halting even in precincts that have signed on, and it has struggled to follow through on its promises, according to people familiar with its technology.

Even if the kinks are worked out, there’s no consensus that what American policing needs most is a new technical tool. The last decade has seen waves of innovation, from elaborate surveillance systems to automated decision-making. But cities have begun to ban the police use of technologies like facial recognition and predictive policing over concerns about their efficacy and fairness. Meanwhile, the most prominent use of technology for police accountability— body cameras—has shown mixed results, in part because of uneven practices about when officers turn them on.

Andrew Ferguson, a law professor at American University and author of The Rise of Big Data Policing, says that during moments of controversy there’s always a spike in enthusiasm for new tools, but they don’t usually make much of a difference. “When you don’t trust people, you try to trust data,” he says. “There is a market for companies to sell the hope of police accountability, even if they can’t sell actual police accountability.”

Police departments have been developing methods to anticipate potentially problematic officers since the early 1970s. Most forces now have some form of the personnel-tracking programs usually known as early intervention systems. They’re often as simple as flagging any officer associated with a certain number of complaints or use-of-force reports, and they’re widely regarded as ineffective. Police departments have periodically experimented with more advanced systems as well, such as a 1990s-era effort in Chicago called “Brainmaker.” They haven’t stuck.

Researchers at the University of Chicago built an early intervention system as part of the Obama-era Police Data Initiative, started in the wake of the 2014 police killing of Michael Brown in Ferguson, Missouri. Most departments weren’t producing the kind of data necessary to make the system work, said Rayid Ghani, then-director of the university’s Data Science for Social Good program, who was in charge of the project. Its primary partner was the Charlotte-Mecklenburg Police Department, an un-unionized force of just under 2,000 officers. The CMPD is technologically advanced by police standards, and had its own tech staff to help implement the bespoke system.

The department and the researchers set up a way to continuously monitor and analyze 1,100 different factors for each officer, then rank them against one another based on their perceived risk score. According to a study conducted by the university and the department using historical data, the system produced up to 50% fewer false positives—where it flagged officers who ended up not being involved in adverse incidents—than CMPD’s previous system. It also identified between 10% and 20% more true positives.

The CMPD’s tool went live in late 2017. Every month since then Mark Santaniello, a police captain, receives a list of the officers whose risk scores are in the top five percent. He then decides what to do with that information. “I apply some reason to it,” Santaniello says. He often passes information on to the supervisors of officers who have been flagged.

CMPD takes pains not to describe the program as a way to punish officers, and the specific officers it flags remain private. Because the model simply provides department officials with information to use as they see fit, there’s no way to attribute any specific decisions to the technology itself. The rank and file have had few objections, according to Mark Michalec, head of the local Fraternal Order of Police. “The trust issue is always there,” he wrote in an email. “But for the most part I think officers are in favor of it.”

Benchmark licensed the University of Chicago’s technology in 2018, and set out to make its own changes to the computer model, and transform what was effectively an academic experiment into into a commercial product.

One of the company’s first clients was the Nashville Police Department, which had already been working with Ghani’s team. Matt Morley, a civilian employee of the department who handles technology issues, said Nashville had been generating risk scores with the system since at least 2016, but doing so required him to do his own coding and running calculations. Nashville agreed in 2018 to pay Benchmark $455,000 for a three-year contract, and is on the verge of an official launch, according to Morley. From that point, it will produce daily reports.

Since 2016, the system that Nashville has run—first with the University of Chicago and then with Benchmark—has flagged officers 367 times. About half of them went on to receive a complaint within the next year. Almost 70 officers were eventually suspended, and four were either fired or resigned.

Three people with knowledge of Benchmark’s operations said Nashville was easily its most advanced client. Many other cities, they said, are in early stages, and it wasn’t clear if many clients would ever be able to run credible risk scores based on the data they collect. The company disputed that, but declined to provide information about any of its other clients.

Some departments who have proceeded—mostly smaller agencies—are currently using the technology only to digitize their internal reporting systems, a precursor to the use of the machine learning product at best. “We haven’t made it that far yet,” said Evanston Deputy Chief Jody Wright, when asked how his department, which first contracted with Benchmark in 2018, used the risk scores. “We just don’t have the volume. At the same time, that’s the primary reason we got involved.”

Benchmark’s plan has been to sign up as many cities as possible, collect data, and feed into a generalized computer model to make predictions. It could then adapt it to each client with minimal adjustments. Two people, who requested anonymity to avoid retaliation, said Benchmark regularly agreed to build custom solutions then got bogged down in delivering them. The promises of customization complicated this strategy by threatening to make the data streams incompatible, even as they overtaxed Benchmark’s small technical staff.

Benchmark cut some jobs earlier this year, due to concern that Covid would lead to budget reductions and undermine its demand. Some of its technical shortcomings were documented in an internal review prepared this spring. Bloomberg News viewed documents related to its examination of Benchmark Management System, the company’s technology for ingesting and managing data from agencies. “Benchmark could not reliably deliver software; they could not meet their functional commitments, could not meet their timeline commitments, and could not release with quality,” wrote Shane Cheary, a consultant who Benchmark later hired as its chief technology officer, in an email this spring.

Huberman says it’s inaccurate to read Cheary’s critique as an overall statement on the company’s technical capabilities, saying it referred to a specific part of its operation. The company said the technical problems Cheary identified have all been addressed, and that staffing levels are not a concern.

Ghani, now a professor at Carnegie Mellon University, is no longer involved in daily operations in the Charlotte program. He said the technology he built is an improvement on previous early intervention systems, but the impact it can have is modest at best. His team was never able to establish a direct line between the use of the system and a reduction in civilian complaints or police misconduct. “This is not the way to solve systemic issues in policing,” he says.

Ghani hasn’t been involved in Benchmark’s adaptation of his models, and expressed concern about the company associating itself too closely with his work. “They’re not saying they’re using our software, they’re saying it’s based on our software, which I’m very uncomfortable with,” he said. While the initial code was open source, meaning anyone could evaluate it, Benchmark said it has made changes to the model which it doesn’t make public because they are proprietary. In both cases, the data being fed into the models and the scores it spit out are kept confidential, though Benchmark says it plans to release data showing how effective its product is.

Such opaqueness is a concern across machine learning systems. In policing, experts say, there are questions about both the computer models themselves and the data being fed into them. Crime data and information on internal police operations like citizen complaints are vulnerable to selective reporting, and there have been regular reports of police manipulating data in ways that impacted presumably neutral technical analyses.

Some experts say a system like Benchmark’s, to the extent it can prove useful, would only inspire confidence if operated from outside the standard police chain of command. Like debates over technology, it ultimately comes down to control. “Okay, let’s say a police officer kills somebody. Will anyone outside the department ever know he was in the top 10 predicted problematic officers for three months in 2020?” asks Sarah Brayne, an assistant professor at the University of Texas at Austin who studies police technology. “If it continues to be just the police who have that data, then we’ll continue to say they’re the ones who hold themselves accountable.”