"They're Watching You at Work"—Testing And Disparate Impact
Print Friendly and PDF
Don Peck writes in The Atlantic:
They're Watching You at Work 
What happens when Big Data meets human resources? The emerging practice of "people analytics" is already transforming how employers hire, fire, and promote. 
... By the end of World War II, however, American corporations were facing severe talent shortages. Their senior executives were growing old, and a dearth of hiring from the Depression through the war had resulted in a shortfall of able, well-trained managers. Finding people who had the potential to rise quickly through the ranks became an overriding preoccupation of American businesses. They began to devise a formal hiring-and-management system based in part on new studies of human behavior, and in part on military techniques developed during both world wars, when huge mobilization efforts and mass casualties created the need to get the right people into the right roles as efficiently as possible. By the 1950s, it was not unusual for companies to spend days with young applicants for professional jobs, conducting a battery of tests, all with an eye toward corner-office potential. “P&G picks its executive crop right out of college,” BusinessWeek noted in 1950, in the unmistakable patter of an age besotted with technocratic possibility. IQ tests, math tests, vocabulary tests, professional-aptitude tests, vocational-interest questionnaires, Rorschach tests, a host of other personality assessments, and even medical exams (who, after all, would want to hire a man who might die before the company’s investment in him was fully realized?)—all were used regularly by large companies in their quest to make the right hire. 

Hilariously elaborate testing suites were fashionable in the immediate postwar era. Robert Heinlein's 1948 sci-fi juvenile Space Cadet begins with the hero undergoing a couple of days of extremely expensive testing to try to get into the Space Academy (rooms turn upside down, ringers try to provoke test-takers into fistfights, etc.). The actual astronaut applicant testing a decade later was even more convoluted than Heinlein had imagined.

The process didn’t end when somebody started work, either. In his classic 1956 cultural critique, The Organization Man, the business journalist William Whyte reported that about a quarter of the country’s corporations were using similar tests to evaluate managers and junior executives, usually to assess whether they were ready for bigger roles. “Should Jones be promoted or put on the shelf?” he wrote. “Once, the man’s superiors would have had to thresh this out among themselves; now they can check with psychologists to see what the tests say.” 
Remarkably, this regime, so widespread in corporate America at mid-century, had almost disappeared by 1990. “I think an HR person from the late 1970s would be stunned to see how casually companies hire now,” Peter Cappelli told me—the days of testing replaced by a handful of ad hoc interviews, with the questions dreamed up on the fly. Many factors explain the change, he said, and then he ticked off a number of them: Increased job-switching has made it less important and less economical for companies to test so thoroughly. A heightened focus on short-term financial results has led to deep cuts in corporate functions that bear fruit only in the long term. The Civil Rights Act of 1964, which exposed companies to legal liability for discriminatory hiring practices, has made HR departments wary of any broadly applied and clearly scored test that might later be shown to be systematically biased. Instead, companies came to favor the more informal qualitative hiring practices that are still largely in place today. 
But companies abandoned their hard-edged practices for another important reason: many of their methods of evaluation turned out not to be very scientific. 
Some were based on untested psychological theories. Others were originally designed to assess mental illness, and revealed nothing more than where subjects fell on a “normal” distribution of responses—which in some cases had been determined by testing a relatively small, unrepresentative group of people, such as college freshmen. When William Whyte administered a battery of tests to a group of corporate presidents, he found that not one of them scored in the “acceptable” range for hiring. Such assessments, he concluded, measured not potential but simply conformity. Some of them were highly intrusive, too, asking questions about personal habits, for instance, or parental affection. 
Unsurprisingly, subjects didn’t like being so impersonally poked and prodded (sometimes literally). 

Tom Wolfe's The Right Stuff has a very funny chapter about how around 1959 researchers went nuts with joy trying out any test they could think of on the initial astronaut applicants. The doctors were used to testing either sick people or average people, but here were hundreds of above-average test pilots and fighter aces willing to put up with anything to go into outer space. A radioactive enema test? Sure!

The federal government's 1960 Project Talent exam, a post-Sputnik study of 440,000 high school students, contained two dozen subtests and took two days to administer.

One discovery from all these massive exercises in social science was that you didn't actually need all these different kinds of tests. Some tests were just fashionable Freudian quackery. But lots of other tests all came up with reasonable but highly correlated results. Standard IQ-type tests would carry most of the load.

For example, the military expanded its hiring test from the four-part IQ-like AFQT to the ten-part ASVAB, but it continues to use the AFQT subset to eliminate applicants. The other six parts of the ASVAB superset are then used for placement: e.g., if you score well on the vehicle repair knowledge subtest you might find yourself fixing trucks. But even if you ace the auto repair subtest, you have to make the grade on the IQ-like AFQT core to be allowed to enlist.

I spent a couple of hours on the phone nine years ago with the retired head psychometrician of one of the major wings of the armed forces and he told me that the biggest discovery of his decades on the job was that g dominated practically anything else you could test for.

This finding actually took a lot of the fun out of psychometrics. You'd dream up some seemingly brilliant test to find the perfect fighter jock or cook or file clerk, but when you'd get done extracting the general factor of intelligence from the results, you'd find that all the customization for the job you'd done hadn't added much predictive value over that of the heavily g-loaded AFQT scores. It makes sense to test for how much applicants already know about flying planes or fixing engines because the military can save time on training and how much they've already learned likely says something about their motivation to learn more. But testing for specific potential hasn't worked out the way Heinlein expected. Instead, testing for g works, and other tests for potential haven't proven terribly helpful.

For all these reasons and more, the idea that hiring was a science fell out of favor.

Which mostly shows how fad driven corporate America is. Serious institutions like the military (AFQT) and Procter & Gamble still use IQ-type tests in hiring. Procter & Gamble provides a sample of its venerable Reasoning Test here. P&G paid a lot of money to validate that its Reasoning Test was correlated with on-the-job performance to get the EEOC off its back.

In contrast, the federal government developed a superb test battery in the 1970s for federal civil service hiring, the outgoing Carter Administration junked it in January 1981 because of disparate impact in the Luevano case. The Carter Administration promised that Real Soon Now they would replace PACE with a test that was equally valid at hiring competent government bureaucrats, but upon which blacks and Hispanics didn't score worse. That was 32 years ago.

Similarly, at the moderate-sized marketing research firm where I worked, initially they just gave Dr. Gerry Eskin's Advanced Quantitative Methods in Marketing Research 302 final exam from the U. of Iowa to each MBA who walked in the door looking for a job. It did a pretty good job at hiring good people. Eventually the company grew large enough that the EEOC noticed the hiring exam. Instead of ponying up the money to validate Eskin's exam, though, we just junked it and winged it after that, with less satisfactory results.

The turn against the postwar objective P&G-style testing hasn't made America more fair. Peck notes:

Perhaps the most widespread bias in hiring today cannot even be detected with the eye. In a recent survey of some 500 hiring managers, undertaken by the Corporate Executive Board, a research firm, 74 percent reported that their most recent hire had a personality “similar to mine.” Lauren Rivera, a sociologist at Northwestern, spent parts of the three years from 2006 to 2008 interviewing professionals from elite investment banks, consultancies, and law firms about how they recruited, interviewed, and evaluated candidates, and concluded that among the most important factors driving their hiring recommendations were—wait for it—shared leisure interests. “The best way I could describe it,” one attorney told her, “is like if you were going on a date. You kind of know when there’s a match.” Asked to choose the most-promising candidates from a sheaf of fake résumés Rivera had prepared, a manager at one particularly buttoned-down investment bank told her, “I’d have to pick Blake and Sarah. With his lacrosse and her squash, they’d really get along [with the people] on the trading floor.” Lacking “reliable predictors of future performance,” Rivera writes, “assessors purposefully used their own experiences as models of merit.” Former college athletes “typically prized participation in varsity sports above all other types of involvement.” People who’d majored in engineering gave engineers a leg up, believing they were better prepared.

Funny how that works.

It's not a coincidence that when I read up on the history of psychometrics in the U.S. in the mid-20th Century, an awful lot of breakthroughs took place at land grant colleges rather than at Harvard and Yale. People in places like Iowa City thought better objective testing was going to be better for people in Iowa. And they were largely right. Of course, we now know — instinctively! — that these midwestern methodologies were a giant conspiracy by the white male power structure. So today we fight the power by just hiring Harvard and Yale grads.

But now it’s coming back, thanks to new technologies and methods of analysis that are cheaper, faster, and much-wider-ranging than what we had before. For better or worse, a new era of technocratic possibility has begun. 
Consider Knack, a tiny start-up based in Silicon Valley. Knack makes app-based video games, among them Dungeon Scrawl, a quest game requiring the player to navigate a maze and solve puzzles, and Wasabi Waiter, which involves delivering the right sushi to the right customer at an increasingly crowded happy hour. These games aren’t just for play: they’ve been designed by a team of neuroscientists, psychologists, and data scientists to suss out human potential. 
Play one of them for just 20 minutes, says Guy Halfteck, Knack’s founder, and you’ll generate several megabytes of data, exponentially more than what’s collected by the SAT or a personality test.

A lot of what Silicon Valley does these days is wheel re-invention. Nobody remembers the past because so much effort has been invested in distorting memories to validate current power arrangements, so a lot of things that are sold as technological breakthroughs never before possible are really just ways to get around government regulations that were imposed because they seemed like a good idea at the time. 

For example, there are now a lot of Ride Sharing companies that you can hire via your smartphone to come pick you up and drive you somewhere. In other words, they are taxicab companies, but because they are High Tech and all that, they feel entitled to ignore all the expensive rules the government has piled on taxicab firms about how they have to take people in wheelchairs to South-Central. 

Here's a guess: much of what these Silicon Valley startups measure that's actually useful is good old IQ. And it will have the same disparate impact problems as everything else did.

... Because the algorithmic assessment of workers’ potential is so new, not much hard data yet exist demonstrating its effectiveness. 

Actually, the military has been measuring job performance versus test scores for 60 years. Much of the results are available online, typically in Rand Corp. documents.  But, who is interested in that?

There are some data that Evolv simply won’t use, out of a concern that the information might lead to systematic bias against whole classes of people. The distance an employee lives from work, for instance, is never factored into the score given each applicant, although it is reported to some clients. That’s because different neighborhoods and towns can have different racial profiles, which means that scoring distance from work could violate equal-employment-opportunity standards. Marital status? Motherhood? Church membership? “Stuff like that,” Meyerle said, “we just don’t touch”—at least not in the U.S., where the legal environment is strict. Meyerle told me that Evolv has looked into these sorts of factors in its work for clients abroad, and that some of them produce “startling results.” Citing client confidentiality, he wouldn’t say more.

That's what my marketing models professor at UCLA B-school said in 1982: on the hiring and insurance sides of the business, it's easy to come up with highly effective models of who you want and who you don't want if you are allowed to use race. But you aren't allowed to, so that's where the challenge is.

A long time ago, Americans thought that one of America's advantages was that we were pretty good at building and maintaining giant organizations like Procter & Gamble that just keep going decade after decade. Motley Fools says:

Of all the Dow Jones Industrial Average components, Procter & Gamble (NYSE: PG) might stand out as being one of the most boring ...

But now we know that Americans are actually terrible at institutional maintenance and the only thing we are good at is creating tiny Silicon Valley start-ups with whimsical names. Thus, these little job applicant testing companies are the only hope big firms have of ever hiring anybody any good because it's impossible to come up with an effective system like P&G has. (Sarcasm alert.)

Print Friendly and PDF