Saturday, September 27, 2008

Data Crusader

Data crusader

Josh Tauberer ’04 is someone a policy wonk could love

By Brett Tomlinson, associate editor at PAW.
July 16, 2008

Since graduating from Princeton in 2004, Josh Tauberer has led a double life. By day, he’s a mild-mannered graduate student in linguistics at the University of Pennsylvania. By night, he commands a legion of computer programs, trolling the Internet for data about congressional bills and republishing the information on GovTrack.us, a popular Web site for bloggers, policy wonks, and concerned voters.

Some 10,000 visitors view GovTrack each day — more when a hot bill is up for debate — and its freely available databases feed a handful of government watchdog sites, including OpenCongress.org, a portal of congressional news; and MAPLight.org, which tracks the votes of members of Congress in parallel with the contributions they receive from special-interest groups. At the center of this web of information is Tauberer, GovTrack’s sole employee, who works from a slightly cluttered desktop in his Philadelphia apartment.

He’s just one citizen, doing his part for democracy.

“You could put it that way,” Tauberer says, stifling a laugh, “but ... I happen to enjoy it. It’s not like I get up in the morning and [say], ‘Oh, I’ve got to save the world by making this site.’”

Indeed, when Tauberer began organizing his site as an undergraduate, few thought that there was any need for it. The Library of Congress had been publishing congressional bills on its THOMAS.loc.gov site since 1995. But Tauberer found THOMAS difficult to navigate and filled with cumbersome quirks. So, with hopes of building a better source for legislative data, Tauberer, a largely self-taught computer programmer, began creating “screen-scraping” programs that look for specific patterns on Web pages, copy the information they find, and store it in a database. Technically, screen-scraping is not very difficult, he says, but it can be a hassle to decipher page formats and sort through data that may be incomplete, inconsistent, or unreliable. And when a source Web site is redesigned, the screen-scrapers need to be retooled as well. (“Fortunately, the government doesn’t change anything — ever,” Tauberer jokes.)

Perseverance paid off for Tauberer when he launched GovTrack in September 2004, more than three years after he first envisioned the site. Users began to take notice later that year after Tauberer was awarded the top prize in a Web development contest run by Technorati.com — the citation called GovTrack “School House Rock on steroids” — and a January 2005 New York Times story about the site provided an additional boost. Today, when Web searchers type a congressional bill number into Google, more often than not the top result is a URL that begins with “www.govtrack.us.” Other GovTrack-supported sites are close behind.

“GovTrack is really the central hub in federal legislative information,” says John Wonderlich, director of the Sunlight Foundation’s Open House Project, which lobbies for better Web access to legislative data. “It’s the clearinghouse for data coming from the Library of Congress, and that’s kind of amazing that [Tauberer] has managed to do that on his own.”

While Tauberer’s hope was to improve government accountability by making it easier to access and digest the details of legislation, he is the first to admit that “information only gets you so far.” Footnotes, references, and amendments to amendments to amendments can make bills nearly indecipherable, even to well-informed readers. So, in addition to publishing the full text, status, and Library of Congress summary for each bill introduced on Capitol Hill, GovTrack provides other useful tools: e-mail alerts linked to specific bills, members of Congress, committees, or topics of interest; detailed maps of congressional districts, created by Tauberer using census data and Google maps; graphs that illustrate votes on a particular bill; and a blog of legislative analysis, written mainly by unpaid contributors.

Each senator and representative also has a GovTrack page that includes the member’s voting history, links to bills he or she has sponsored, and a graphic that shows the member’s standing on GovTrack’s “Ideometer,” an ideological spectrum that Tauberer created using a statistical analysis of bill sponsorship patterns. John McCain, for example, pushes the Ideometer’s needle to the right, about a third of the way toward the Republican end of the spectrum, while Barack Obama is positioned to the left, about two-thirds of the way toward the Democratic end. Both are labeled “rank-and-file,” which means they fall within the middle 50 percent of their respective parties. Sen. Barbara Boxer (D-Calif.) occupies the far left pole, and Sen. Jim DeMint (R-S.C.) stands on the far right.

Tauberer would like to add more analytical features like the Ideometer, but he concedes there are limitations to his skills. While he’s a whiz with databases, he lacks the design expertise needed to generate the slick infographics that newspapers and magazines create. And then there’s the simple arithmetic of time. After four years of graduate school, Tauberer is drafting a proposal for his dissertation in linguistics, which he hopes to complete in the coming year.

Working on a Web site that promotes government transparency has been a significant departure from Tauberer’s academic work, which deals with phonetics and how children acquire language skills. He majored in psychology at Princeton while pursuing a certificate in computer science, and it was his interest in technology — not politics — that drove the development of GovTrack.

In the spring of 2001, Tauberer was a student in computer science professor Andrew Appel ’81’s freshman seminar, “Speech Is a Machine,” which addressed tech-related topics like copyright in the Internet age and whether computer programs qualify as “speech.” Tauberer first encountered THOMAS while studying a recently passed bill that restricted fair-use rights. He began thinking of ways to make the site’s vast wells of information more accessible. One year later, he devised a rough system for GovTrack, and in his senior year, when most of his classmates were immersed in thesis research, Tauberer laid the framework for his site, which he would finish in the summer after graduation.

Classmate David Robinson ’04, now the associate director of Princeton’s Center for Information Technology Policy, says that Tauberer showed the same sort of dedication and ingenuity as an editor for The Daily Princetonian. In 2003, Tauberer designed a survey methodology for conducting student-opinion polls, using a random list of student phone numbers and a customized, secure Web site in which pollsters could enter the data they collected.

Robinson says Tauberer was “sublimely confident” that GovTrack would find an audience. (Tauberer calls it “naïveté.”) And as one who urges the government to share its data in more user-friendly formats, Tauberer, who hopes to work as an advocate for better access to government data after completing his Ph.D., has stayed true to those principles, providing free access to his own databases. Advertising on GovTrack pays for operating expenses like server space and provides a modest profit.

The openness that Tauberer sought to expand with his own site is becoming a major focus among scholars at Princeton and other institutions who are studying technology, says Robinson. “In general, we’re looking at all the ways that digital technology and public life interact with one another, and it’s becoming clear that transparency is one of the main ways that digital technology and public policy interact,” he notes. Robinson, Professor Ed Felten, and graduate students Harlan Yu and William Zeller recently wrote a paper, “Government Data and the Invisible Hand,” outlining a novel strategy for more transparency: Reduce the federal role in presenting data on the government’s own Web sites, such as THOMAS, but step up government provision of reliable, raw data that nonprofit and commercial groups can use on their sites.

Wonderlich, of the Sunlight Foundation, says that most issues in government openness still are unresolved. Some are technical, such as standardizing the format in which data are released. Other issues involve making information easier to obtain. In pre-Internet days, items of public record were acceptably relegated to file drawers and dusty bookshelves. Today, Web users have grown to expect instant access. “People see that the Internet is making it easier to shop and do a lot of other things,” Wonderlich says. “It intuitively makes sense to people that Congress should operate in the same way.”

On the campaign trail, both Barack Obama and Hillary Clinton spoke about technology as a means for openness. Clinton, in a January Meet the Press interview, called for more transparency and Web access to government information, and Obama’s technology plan, outlined on his campaign Web site, vows to “mak[e] government data available online in universally accessible formats to allow citizens to make use of that data to comment, derive value, and take action in their own communities.”

Last year, Tauberer dipped a toe into the political waters, contributing to the Sunlight Foundation’s Open House Project report, which drew endorsements from a handful of representatives on both sides of the aisle, and writing an op-ed piece for The Hill, a daily newspaper that covers Congress, on improving government databases. But Tauberer’s main interest is in civics, not politics, he says, and GovTrack is nonpartisan. The site’s only official position is that the government should publish more data.

“Definitely, the information has to be out there and usable,” Tauberer says. “I can only hope that it makes some sort of real difference.”

See how GovTrack presents the next president of the United States: For Barack Obama Click Here; for John McCain Click Here.

No comments: