mattwaite.com

Not a blog. Not anything, really.

Inching closer

Posted Saturday, March 22nd, 2008 at 12:37 p.m. by Matthew Waite

I'm still fiddling around with the code for this blog, but I'm getting close to calling it good and bringing in my old content. Thanks for your patience.

*Update: I've rigged up a parallel set of urls for each blog post. Why? For all the people who linked to my old blog posts in social bookmarking sites. Now, when I load up the old content, those old social bookmarks won't 404. For instance, if you click on the data ghettos post below, you'll see no one has bookmarked that in Del.icio.us. Not so. Change /jan/ in the url to /01/ and hit enter. You'll see the del.icio.us badge change to show that 19 people in fact bookmarked that post.

* Update 2: Well... most old blog posts will resolve to their old URL. Someone, I won't name names, apparently liked really long headlines, and those are choking the slug field that makes the urls. I could probably fix this, but I want to get feeds working and old content into the database first.

Thoughts on EveryBlock and context

Posted Sunday, January 27th, 2008 at 11:35 a.m. by Matthew Waite

First, the grains of salt:

  • I don’t live in an EveryBlock city. I think a key part of EveryBlock is the visceral connection you have with your neighborhood. I don’t have that, so view my comments with that in mind.
  • Huge Adrian Holovaty fan. Huge Django fan. Big believer in breaking out of the story centric worldview.

And now let me get this out of the way: EveryBlock is not something newspapers should fear. Here it is. I was wrong. I wrote that last year. Now, I don’t think newspapers need to fear this. Study it? Emulate parts of it? Learn from it? Absolutely. Fear it? No. Even if EveryBlock becomes crazy popular — and I think it will — it points to your content. If anything, EveryBlock will help people get to your content that interests them.

Okay. I feel better now. On to the meat of the post.

Is EveryBlock a data ghetto? This is tough. I can see why some would read my data ghetto post and think yep, EveryBlock is a data ghetto. I’ve honestly held off writing this post until I could answer the question for myself. It’s not completely clear cut, but I don’t think EveryBlock, as it stands now, is a data ghetto.

Why?

I’ve seen several people say that EveryBlock is data without context, but that’s not entirely true. The context comes from the user through geography. The block is the context. The value you put into that context is based entirely on the fact that you live there. And, from that powerful context, the user provides further context by choosing what’s next, what is interesting to them.

What also gets EveryBlock out of the data ghetto, outside of the whole geographic construct that almost no newspaper.com’s use, is that in some places EveryBlock provides it’s own context via graphing and counting instances of a thing. Compare that to data ghettos as I classified them: couple of search boxes and a results page. I’ll add another data ghetto indicator to the pile: single subject searches. On most newspaper.com data ghettos, you can search real estate transactions or crime - you can’t do both at the same time. Few have taken the next step of putting data together into one place.

But, but, not that context, this context: I think there are some legit complaints about context in EveryBlock. But I also think calling it data without context, perspective or meaning is wrong. The question is what context? And whose? That is a deeply complicated issue.

Mark Schaver points to the pothole paradox (short version: your pothole is interesting to you, the one not on your commute could not be any less interesting). So distance is important — the closer, the more you’re interested. But even that’s not that simple. If my neighbor has his lawn mower stolen, I care a lot. Two blocks away? Meh. But if someone in my neighborhood is murdered 5 or even 10 blocks away, I care a lot. It doesn’t have to be all that close. Same with building permits. I kinda care if my neighbors are pulling permits to build a garage or a bathroom. Much beyond that, I don’t. But if Wal-Mart is building a Super Center one or two neighborhoods over, I really care.

So I think the journalist’s complaint about context in EveryBlock — and it’s valid if you’re thinking like a journalist — is that there’s no mechanism to make one data point stand out from another. The example Schaver pointed to was the death of Heath Ledger. In New York’s EveryBlock? Yep. But on the map view, deaths are just another dot. Hollywood stars, homeless people, crack dealers — just another dot.

Not all dots are created equal. And there are legions of factors as to what makes one more interesting than another, and another legion of factors as to why some things are more interesting to different people. Interestingness — or News Value — is an exponentially complicated equation.

To my mind, that EveryBlock doesn’t try to set some subjective one-size-fits-all standard of importance for data is proof that Adrian isn’t lying when he says EveryBlock is a supplement, not a competitor, to traditional news sources. Pointing out Important Stuff and making a Big Deal Out Of It is what newspapers do. EveryBlock doesn’t even try.

That said, I think the next step — for EveryBlock, for newspaper data ghettos, whoever — is personalization. Imagine if EveryBlock took your home address and a list of the things you cared about and displayed data with some kind of distance/importance weighting algorithm. Not everything, like now. Just a good guess at what’s important to you based on the distance from your home and how important you said certain things were to you. And done in such a way that you can tweak your own settings. Just have a kid? Might be time to move the schools slider up. And while you’re there, move the restaurant review sliders down. Trust me. With a new baby, you won’t be going out for a while.

Even switching to some kind of straight distance algorithm solves one problem with EveryBlock: Just because it didn’t happen in your neighborhood/block/zip code doesn’t mean it wasn’t near you. What if you live on the edge of a neighborhood — one block over is Whispering Pines but you’re in Whispering Oaks? If someone is murdered one block away, but they’re in Whispering Pines, do you care? Of course you do. It’s one block away. Using a distance algorithm, you aren’t hemmed in by boundaries or forced to check more than one neighborhood — no matter how easy it is — to be sure you’ve seen all you want to see.

Don’t take anything I’ve written to mean I don’t think EveryBlock is amazing. It is. I’m blown away by it. I’m humbled by it — first thing I thought when I was going through it was “crap, I need to work harder/sleep less/take brain steroids if this is the bar being set.” I just want to be clear that for EveryBlock and for anyone doing news data apps, context is an extremely important factor … and extremely complicated.

The absurdly simple way to add content to Twitter via Django

Posted Monday, January 14th, 2008 at 11:51 a.m. by Matthew Waite

I haven’t found the right project to implement this yet, but I thought I would share it anyway.

Say you’ve got an application in Django. Say you’ve got a Twitteraccount for that application. Now you’d like your content to go fromyour Django app to your Twitter account.

There’s lots of ways to do this. Here’s an absurdly simple way: A save method and the Twitter API.

Let’s say you built a blog application in Django. You’d have a model that looked something like this:

class Posts(model.Model):
    headline = models.CharField(maxlength=255)
    ... other stuff ...

So, since you only have 140 characters in Twitter, you just wantyour headline and a link in your tweet. The easiest way I can think ofto do this is in a save method. Save methods in Django are where youwant something to happen between when the user clicks the save buttonand when it actually goes into the database. In this case, we want totake a couple of pieces of data, pass it to Twitter via its API, andthen save it to the database. To do this you have to define a savemethod. The whole thing -- all four lines of code -- looks like this:

def save(self):
    params = urllib.urlencode({'status': 'My latest: %s at www.yoururl.com'}) % self.headline
    o = urllib.urlopen("http://USERNAME:PASSWORD@twitter.com/statuses/update.xml?", params)
    super(statement, self).save()

So what’s happening here?

First line just defines the save method.

The second line creates a variable called params and then urlencodesthe string starting with ’status’. Since Twitter’s API is mostly a RESTinterface, you put all your data into the URL and Twitter interpretsthat. Thus, you have to make your content into a URL. Thus, urlencode.You can style this any way you want. The way I’ve done it, the tweetwould look like: My latest: This is a headline at hyperlinked url,probably tinyurl’d. You could make it whatever — change the words,directly link to the blog post using a get_absolute_url_method,whatever you want, keeping the 140 character limit. Go wild.

The third line creates a variable called “o” which sends a requestto the Twitter API url, with the username and password of the user inthe URL so it autheticates, and the parameters that you just URLencoded in the previous line. So, if you were implementing this, you’dchange USERNAME and PASSWORD with your username and password,obviously.

The fourth line then goes ahead and saves it all to the database.

That’s it. Four lines and your Django content is now feeding your Twitter account.

Molten content, data ghettos and why your CMS problems are an excuse, not a reason

Posted Friday, January 11th, 2008 at 11:59 a.m. by Matthew Waite

The other component of the data ghetto that bothers me is that you can’t find that data outside the ghetto. Please, someone point me to a place where there’s dynamic content being fed to the story level pages. I have yet to see where someone’s crime data is being fed into a story about a crime, i.e. a map of murders from the data ghetto’s crime application dynamically generated on a story page about a murder. Or a list of the largest donors to a politician from a campaign finance app on a story about a politician.

And that seems to be a problem we’re creating for ourselves — we’re only thinking about getting the data online, not about what to do next. Or about what else could we do with our data. Or what could someone else do with it if we let them. We’re content with a couple of search boxes, a button and a results page. And we’re content to leave it right where we put it.

Here’s why I’m thinking about this now: I spent all this time building PolitiFact to be a layered, data-driven approach to political and fact check journalism. What have I spent my time doing since? Trying to figure out how to get my content out of that site and into other forms. First was the automatically generated email newsletter (sign-up here). Then it was a widget, which required me turning the Truth-O-Meter into a JSON stream that I could parse with a little javascript. Now? We’re syndicating PolitiFact content to newspapers who sign up at a whole other site. Now, subscribers can go to PolitiFactMedia and get our content for their publications via a password protected site. That site includes a pure REST API for automated import into whatever CMS the customer is using.

Doing all this got me thinking of another concept: molten content.

I’ve always thought of the work I was doing as building something out of raw materials. As a reporter, I did interviews, read documents and analyzed data. All that raw material was worked into a story, a graphic, maybe some photos, and lately some online interactive content. Building news applications, I’m finding, is more like working with metal. The more malleable you make your content, the easier it is to mold your application into all the different places it may need to go.

Most places, the data in the main newspaper.com CMS is cold iron — hard as hell to work with, if not impossible. So most of the time, we’re not going to be bringing our newspaper content to our applications. But what if we brought our application to our newspaper content?

Ask yourself this: Why can’t you find data from a newspaper.com’s data ghetto outside of there? A large reason for a lot of data ghettos is that the CMS is on one set of servers based on one technology and the place for data is on a whole other server setup with another technology. That’s driven by the horror stories most newspaper.com webworkers have about how gawdawful their CMS is and how the CMS won’t serve up this or handle that (or hell, even shovel the previous day’s paper online cleanly).

But, if you designed your application right, your CMS problems are an excuse, not a reason.

As I go forward, I’m adding a hidden requirement to my applications: make the data molten — the stage where the metal is nearly liquid, easy to pour in whatever form I need it to go into.

Here’s a simple example: PolitiFact’s widget. It was my first attempt at molten content. I’m going to breeze over the code for now — if anyone wants me to detail it, post a comment below and I’ll do it in another post.

Basically, the widget is the sum of a few parts. First, a user embeds a little piece of script on their page (you can get it here). That calls a piece of Javascript that looks like this. That Javascript code then makes a couple of calls itself: to a CSS file and to a page that returns a query to the PolitiFact database in JSON format. That page is actually pretty hackish: It’s a pretty vanilla view in Django which then returns a template that fakes the JSON. If I were to do it over again, there’s better means to serialize data in Django. But my hack worked and I haven’t had time to make it work the more elegant way. Anyway, the script goes to that page, parses out that data and then writes it to your browser screen.

Here’s what I mean by molten content. PolitiFact resides on a server at my employer’s server room. This blog is hosted somewhere else. That’s Django, this is Wordpress (which is PHP based)(* this is a post from my old blog. It's all Django now). Different systems, different servers, different states, different everything. And, here’s that dynamic content:

So, if I can take PolitiFact and put it on my blog with these tools, why can’t we take it a step further and put any and all data from our news applications into our story pages?

Because, as I just showed you, we can. We can do it better than this. If you’re developing news apps and don’t know anything about web services, you should start learning.

The broader point here, divorced from technologies and implementations, is that we need to start thinking about where our data is going to go and what we’re going to do with it beyond search and results pages at our one URL. More on this in the coming months, when some other projects I’m working on go live.

Data ghettos

Posted Wednesday, January 2nd, 2008 at 10:51 p.m. by Matthew Waite

One resolution for this year: Post more often. Starting now.

I’m not sold on the whole Data Desk/Data Center idea that a lot of newspaper websites are trying out. I hate to say all this because at a lot of places, the people responsible for them are my friends. But for all the love I have for putting data online, there’s something that has bothered me about the way they’re going about it. A friend summed it up for me recently: The Data Ghetto.

The Data Ghetto is that one mishmash page where all of that site’s databases are lumped together.

I won’t take the time to criticize how these pages are constructed — the criticisms are obvious and even people who have made them don’t like them. But if you take a step into one of the databases and you get to my second problem with them: couple of search boxes and a button.

Is that really it? Is that the big newspaper.com push into data? Sprawling, barely organized pages to get to a couple of search boxes and a button? This fails on a number of levels:

  • Creativity: Can we offer no more creative way into the data than to make a user put stuff here and hit search? Search is fine in context, but it’s also limiting. What if someone doesn’t know how to spell something, or doesn’t know what they want, or all they want to do is explore the data their own way? You’ve cut that off. In my opinion, browsing is much better. If your data is normalized — i.e. all the cities are spelled the same way, etc. — then you can let people click on the things they’re interested in and get those things. And in the process, they may see other things they’re interested in. To see it in action, look at how PolitiFact is browsable (here, here, here and here). Or better yet, if you don’t believe me, look at chicagocrime.org. There are 10 different ways to browse that data. There’s one search box. Search is a part of it — it has a value in context — but it shouldn’t be your whole app.
  • Repeat customers: A lot has been written about the traffic these database sites get. But I want to see what the traffic is like months after it first goes up. What’s the traffic like after the third or fourth update. The reason I ask is because some of these search apps to me seem like a pure voyeur play. What I mean by that is the user sees a salary database, goes and looks up their neighbor and … what? They’re done. They’ve answered the one question they wanted to ask. How are you bringing people back to your data?
  • Shaky business model: Are we really building a business model, or even a component of a business model, around making public data searchable? Because guess what? Google is too. That’s right. The search giant is dealing directly with government agencies to help them make their own data searchable. Sound familiar? Think your data ghetto can compete with Google? Do you think people are going to remember your newspaper.com url over Google? Really?

Here’s the to-be-fair portion of the post: I have exactly one data-driven app under my belt (PolitiFact). I have a half dozen more in the works, so I’m thinking about this stuff constantly. But for now, I can only talk. I can’t show, at least not yet.

That said, here’s how we can get out of the data ghetto: add some journalism to it.

Back in November, Will Sullivan tried to coin a term where multimedia and data collided into something he, jokingly, called multimedata journalism. Of note was a New York Times effort where they did a story about people freed from prison by DNA evidence. They interviewed 137 of 200 people released. They then put an app online that allows you to click on each name and see details about each case (data) and hear their story (audio).

I’d argue there doesn’t need to be a new term: This is what it’s supposed to be. Journalists are supposed to add context and value to information. Heaving databases online should be no different. Does each app require the type of effort that the NYT put in? No. But flatly serving up data with no context or analysis or value outside the record itself is hardly journalism. A public service maybe, but not journalism.

Next post: Instead of bringing journalism to your app, why not bring your app to your journalism?

Learning Django

Posted Friday, December 14th, 2007 at 09:39 p.m. by Matthew Waite

I’ve had this post sitting in my queue for months now giving advice on how to learn Django. I’ve had several people email me and ask how I learned. Amazon delivered some inspiration to finish this on my doorstep last night.

My copy of the Django book arrived on my doorstep. I’ve had it on pre-order since August. I’ve read two chapters (Caching and users), skimmed the intro and browsed the whole book. Want to learn Django? Make your Christmas a little more merry for yourself and go buy it. Worth every penny. Clear. Engaging. Takes you from the very beginnings to very advanced topics.

You’ll go a long way with just that book. Other resources I used developing PolitiFact were Adrain Holovaty’s manifesto on newspapers and structured data, the unbelievably well done documentation on the Django Project website, the Django users Google Group and, if you find yourself totally lost on Python, read How to Think Like a Computer Scientist.

Quoting from the acknowledgments on the book “the most gratifying aspect of working on Django is the community.” It’s true, and I’ve been among the beneficiaries of the community’s generosity and willingness to share code and advice. So what are you waiting for? If you think you want to learn how to make data-driven web apps, Django is a fine choice to jump in and learn. And there are smart people in the world doing everything they can to help make it easy.

Knight, round 2

Posted Tuesday, November 27th, 2007 at 09:47 p.m. by Matthew Waite

I finally submitted my answers to the Knight News Challenge’s second round of questions. No link to the application itself this time because the applications aren’t open for public viewing anymore. Here’s my first post on my Knight idea if you missed it. There were more questions this time, some of them pretty mundane, others rather hard to answer. But one question was particularly interesting to me, so I’ll post it and my answer:

8. What specific, unique opportunity do you see that will make this project more successful than others trying to fill that general need? * (2075 characters maximum, approximately 325 words)

Most news sites online now - large and small - have been charitably described as a giant ball of mud: an accumulation of materials loosely held together to take some semblance of shape. My own employer’s site at one point was created by 23 different Javascript programs that added bits and pieces of functionality. Many sites, when adding applications or functionality, far too often take a third-party application and bolt it onto a page. Or they wind up having to go completely outside their production systems to a different system altogether because the systems they chose to put news online won’t adapt or support new applications. Several efforts I’ve seen aimed at socially networking the news take this same tack - develop an application outside the main CMS and bolt it onto the site or create an entirely separate site away from the main URL. Louretta will take the opposite approach - news, social networking and databases will be in a site’s DNA. There won’t be external calls or separate servers or different URLs. A news organization will use one site, one set of servers, one set of technologies. No vendors. No outside support. And it will be designed from the start to run on an absolute minimum of staff members, who only have to learn one system.

Blogging has been non-existent for me lately. Now that Knight is done for a while I’ll finish the three half-written posts I have in the queue that I’ve been letting slide.

We interrupt this career...

Posted Saturday, October 27th, 2007 at 09:52 p.m. by Matthew Waite

My job title changed today. For the first time that I can remember, I’m not a reporter. As of today, I am the News Technologist at the St. Petersburg Times.

What is that? We’re not sure yet, and it’s going to change a lot. It’s a hybrid job, a programmer-journalist job. We even toyed with programmer-journalist as the title. Simply put, I’m moving to the web to develop new products and tools and maybe reinvent a few things we’re doing already. I’ve been given the mandate to take my skills as a journalist and my growing skills as a coder and bash them together really hard to see what comes out. I’m going to take my 12 years of data-driven journalism and try to build things online that people want. I learned a lot making PolitiFact, and I’m burning up with more ideas. Now, instead of working them in around stories, these ideas are what I’m working on. It’s technology. It’s R&D. It’s databases. It’s local, national, mobile. It’s scary looking code I didn’t understand yesterday. And then, really, it’s all journalism. No matter how far from my journalism undergraduate education I get, it all comes back to getting compelling information into the hands of readers.

I’ll never really stop being an investigative reporter. But I’m tired of talking about and worrying about the future of journalism. It’s time help make the future. Directly. Hands on. Starting today.

Knight News Challenge and me

Posted Thursday, October 11th, 2007 at 10:02 p.m. by Matthew Waite

So I’ve taken the plunge and submitted an idea to the Knight News Challenge. You can read my application here, but the short version is this: I want to make my hometown twice-weekly, and as many papers like it as I can convince, much better online. How? By putting the tools and applications that the big boys are talking about in their hands.

From my application:

I would like to build a world-class online content management system for small town papers - the weeklies and twice-weeklies for whom a high-quality CMS is unaffordable. The system would be built in Django, an open-source Web framework, and the code creating the site would be freely available. In addition to making the code free, this project would host sites for small town papers without cost for five years. I call the project a CMS for lack anything better to call it. It’s much more than a CMS. I want to build a site that creates online social networks around the very real social networks that exist in small towns. The site will allow people to participate in and contribute to those social networks, helping breathe life into the site. The social networks will then drive a customized, personalized experience for users of the site. Beyond the social networks, I want to build the mechanisms for people in small towns to follow their institutions of government, schools and local sports through databases of public records, newspaper-created information or data entered by the users themselves.

It’s just an idea I had reading my hometown paper online. I’m not asking for quit-my-job kind of money and my bosses know it’s a side project for me. But, if you read the application, you’ll see that I’m taking my current work and melding it with my small town upbringing. Wish me luck.

Advice for journalism students

Posted Sunday, September 30th, 2007 at 10:04 p.m. by Matthew Waite

Paul Bradshaw kicked off a minor meme titled How To Be a Journalism Student (with other’s posting here, here and here). All fine advice. But I thought I would chime in on something sort of related.

This past week, editors from a mid-sized northeastern daily came to the St. Petersburg Times to see how we do our thing online. With PolitiFact, I’m a stop on the tour now apparently. After a few minutes of talking to these editors about how PolitiFact was built, showing them a few other things I’m working on, and talking about Django, they started asking me where they could go to find journalists who could do data-driven web development. They, like a lot of places, really want to hire some. Where can you find them?

I had no answer.

There’s no journalism program I know of cranking out programmer journalists (Northwestern is making a go of it, and god bless Rich Gordon for doing it, but it’s a Masters in Journalism for people with undergraduate computer-science training, not an undergraduate journalism track. And it just started, so no graduates yet). There’s no Poynter course for this. There’s no NICAR boot camp.

So, without further ado, my advice to journalism students:

Learn how to put data on the web.

Like computers? Like problem solving? Not afraid of a little code? Interested in doing something new? Want to tell stories, but don’t believe a story is the only way to do it? Want more to do with the future of news? Learn how to put data on the web. I guarantee your campus has a Python users group or a Rails users group or a group of computer-science students doing this stuff right now. Find them. Find an idea. Build an app. If you can do it, and do it well, and do it with an eye toward news, editors are dying to hire you right now.