ShortURL Redux with PHP and AWS Elasticbeanstalk and DynamoDB

We certainly had our share of shortURL discussion and implementations a few years back and the topic was well covered including some nice work by Dave Winer and Joe Moreno to do an Amazon S3 based solution to provide portability.

I’ve been wanting to play with Amazon’s Elastic Beanstalk since it started supporting PHP. And I’ve always still longed for a ShortURL software that was a “set it and forget it” solution. Infinitely scalable with little to no maintenance. Pretty much rules out the use of a relational database and in the unlikely event of extremely high usage I don’t even want to worry about hitting the limits of a single server.

Happily, with all the cloud based offerings these days we have some options. Even better, a PHP app on Elastic Beanstalk should scale without (in theory) any of the more complex setup that would be necessary if we just used EC2. there are plenty of other ways to skin this cat but, like I said, I wanted to try this method of app development.

Regarding persistent storage, back when the topic was hot, I had built a shortener based on Amazon’s SimpleDB, but that service has it’s limitations and the goal here was theoretically infinite scalability and as little maintenance as possible.

I chose Amazon’s DynamoDB for those reasons.

For the record, I don’t normally choose vendor specific offerings because I hate lock-in as much as the next guy, but this was the best fit for the goals of this exercise and it’s not so complicated of a piece of software that I couldn’t make an easy port to something like MongoDB (which I’ve also been playing with and beginning to love. Go NYC!).

Maybe the coolest thing about Elastic Beanstalk is the ease of use thanks to the Git integration.* Here is an excellent post about getting set up to use Git with Elastic Beanstalk. You install a Git extension that let’s you configure it with the credentials that AWS needs. This is for one-way pushes only so you’ll probably want to do development and testing locally or on another server to speed up your round-trips. James Gosling felt similarly. Though PHP doesn’t have all that overhead it still is a bit painful to develop straight to Elastic Beanstalk. No real need to anyway.

Once you have an app set up and are able to push your commits up to it, you are ready to roll. For me that meant setting up CodeIgniter and the AWS SDK for PHP.

I unzipped CodeIgniter 2.1.o and added it to my Git repository. Committed and pushed it using the “git aws.push” command. Looked at the app and it was serving a CodeIgniter page right out of the box.

I downloaded the latest AWS SDK for PHP and unzipped it. Making it work with CodeIgniter is pretty easy. I drop the entire folder into application/libraries. I called my folder ‘aws-sdk’.

Also in the libraries folder we add a file called aws.php withe the contents simply being:

<?php

class Aws {

 function Aws()
 {
 require_once('aws-sdk/sdk.class.php');
 }
}

Before you can use that class, you’ll need to edit the config-sample.inc.php file in the sdk folder as well. We could hard-code the AWS credentials into that file but Elastic Beanstalk provides a better way.

Rename that file to config.inc.php and where you’d put your AWS access key and secret key, instead put:

'key' => get_cfg_var('aws.access_key'),
'secret' => get_cfg_var('aws.secret_key'),

get_cfg_var() is a PHP function that returns values from the PHP.ini file. Elastic Beanstalk allows you to add some of these config values under the “container” tab when you edit your environment right through the AWS console. There are already two fields specifically for these two values and you can add some others if you need something similar. We will use one for our database.

While in the config file you’ll also need to set the ‘default_cache_config’ param. It’s required for DynamoDB. Set it to ‘apc’.

If you do indeed find yourself working on a live Beanstalk app one suggestion I would make is to have a stable file in use as the healthcheck URL. If for some reason your load balancer can’t connect with a HTTP 200 to that URL, your app app health will go to red status and you won’t be able to contact it and you’ll waste considerable time connecting  directly to the EC2 instance or rolling back to older versions till you aren’t dead in the water. It seems a little flaky in that regard, or is rather just a compromise one needs to make to get the  benefits that using an environment like this?

Fun with GeoPoints

If you are in the news business you are likely thinking about location based services and geo-content in general.

In my role as a developer at Digital First Media, it is an especially important topic as we try to build centralized sites and services that can work across hundreds of local news sites.

It makes sense for the business. Location based marketing is an area we need to be leaders in. We always have been, and the fast changing and still emerging landscape in both mobile and personalization means there is still tons of room to carve out our share.

This is particularly important to organizations like ours with legacy print products. Zoned inserts are still a big part of the newspaper business, but a scary one too, because the advances in location based targeting online means a significant disruption is ahead. When the money moves online, we need solutions in place to fulfill the needs of our local customers.

Now, it’s not that I suddenly drank some kool-aid being served in the sales department open-house. I don’t get invited to those anymore. I’m mentioning this stuff as a developer because there is a  truly awesome part to this all.

This location based personalization is also often exactly what our users want as well. It is a rare opportunity in this industry that the stars align so that the editorial, technology , sales AND (most importantly) the site users all want the same thing.

I’m so bullish on this, I’d say that if the industry can execute well on location based services online it could revive a lot of companies. But I think we need to be smart about it. No one department can be “product owner” here. We need to build things that users love. The rest will fall in line.

Thanks to Superstorm Sandy, I had some downtime where I could test a few technologies out. During the election, using some tools for reverse geocoding available from Open Street Maps, I was able to aggregate tweets mentioning either “Romney” or “Obama” and put them into  buckets based on their locations.

Earlier this year, I had done some prototypes using the Alchemy API, in conjunction with a number of experiments being done for the Citizens’ Agenda project at Jay Rosen’s NYU Studio 20.  I grabbed some old code left on the editing room floor and was able to do some sentiment analysis on those aggregated tweets, to try and track the aggregate moods for each term.

The results were surprising to me. The analysis returned more positive results for the term “Romney” leading up to election night. Once it became clear Obama was the winner, the term “Obama” rose a bit, as might be expected, though the overall rating for that term stayed negative the whole time. I can think of lots of cool visualizations and uses for this type of data. Maybe an opportunity for an upcoming hackathon.

While it is clear we aren’t predicting elections yet, I see  a lot of potential in this space for all kinds of usage in journalism and marketing.

At Digital First Media, I’m working on some server side geo-location tools and I also see many of our journalists thinking about maps and location from there end. The tools out there range from expensive and closed to free and open. I personally want two things. Automation and API’s. And using Fusion Tables and Maps you can cobble together some cool things. I also want persistent data. This needs to grow over the years and become more valuable as it does so.

I’m a big fan of PostgreSQL but it was time to learn a little MongoDB and it also has a dead simple way to do 2D location searches with its $near query.

Anyhow, I was able to pretty quickly get a small app running that does  one thing very simply. It stores geojson content and then returns that content based upon whether it is within a certain radius of the coordinates you query for.

Check it out at geopoints.org or directly at github at https://github.com/mterenzio/geopoints

New Journalism = Design For Manufacturability #DFM

As my friend Dave Winer often exclaims. Oy!

The recent Chapter 11 filing of Journal Register Company, where I am a web developer, has brought out the “I told you so.” crowd. “Look, now do you believe your digital first, free access, crowd sourced, open sourced etc. etc. etc. strategy is just a fantasy?”

Nope.

As did Matt DeRienzo (no relation), I’ll start by saying I can comfortably write this post because my company is great. I can remember being cornered at my desk by a red faced, vein bulging vice president about a blog post I wrote in the past, when I hadn’t even mentioned the company I worked for. I merely mentioned that newspaper industry was in a severe revenue decline and clueless about business. Oh and this . . .

Now imagine I’m comfortable saying this:

My company currently sucks too, and is clueless . . .

. . . no black helicopters . . .

. . . because we respect that. And we want to change. We want to survive, even if we have to change.

Even. If. We. Have. To. Change.

Enough drama, let’s address the facts, or lack of.

1. If “Digital First” is a good strategy, why the bankruptcy?

I have no idea why the move was deemed right. Michael Wolff thinks he knows, and the piece is an interesting analyses of debt, but ultimately has less substance than John Paton’s beard, which he mentions a few times. I can only say that when the last bankruptcy happened, no one claimed the management was tossing away valuable traditions. In fact, I think they were. The most valuable. Their community. They did so by cutting journalism to make more profit. Anyone following DFM knows we aren’t doing that. We adore the community. We let them into our editorial meetings, for crying out loud.  But there are efficiencies that need to  be achieved, and more will come in the future. We need to be lean and agile. Not something necessary when you write for some of the publications that have criticized us. This is a for profit business. And by the way, if anyone can let us know what valuable tradition we did jettison, I can guarantee I can convince the company to try to get it back. Or maybe we have a different definition of valuable. Critics might have meant costly, because those operations certainly need to go.

2. Other companies are already moving on from the “free content” model.

Yup. Unfortunately, the rest of the web has not. And, for the record, I’m not against paying for content. It needs to be highly personal, highly unique. Some local newspapers dip their toes in those waters, but for the most part, it isn’t there (at least yet). When it is, I’ll pay. Right now, you got a wish. Again, I’ll pay for a little bit of something for me, not a lot of some stuff made for everybody, most of which I don’t care about.

3. “crack the code that pays for news-gathering that communities need.”

Hell yeah. Tryin’. If you have the answer, say it out loud. But the fact is, the old way can’t support it, or else we wouldn’t be having this conversation. So you criticize a new try?. If the old ways can’t do it we need both new methods and new business models. And I promise you that DFM is working on both. Are we there yet? Obviously not, and no one said we were. In fact Paton has said  this is just the beginning himself. Lots and lots of work ahead. It is hard work you know?

4. Lack of transparency.

Yeah, I hear ya. It is a private company, so the earnings are private, and that’s America. Transparency in all things isn’t an obligation. We digital folks push it when it  can bring greater good, or is morally right, not for its sake alone. There is no morality in this case, IMHO. Privacy is a good thing too, right? This isn’t  Facebook. ; )

But really, it’s the difference between a private company and a public one. Are the critics saying all company’s should be public?

A good point has been made that “revenue is way up, but up from what?” I agree, it probably was low to begin with, though pretty good if you think about the percentages that Paton has released and how much print revenue a company the size of Journal Register produces. But frustrating to a onlooker, so I sympathize here.

What I do see internally tells me that it is an all out push to stack those digital dimes. More importantly, is finding innovative ways to create new revenues. We are in the early stages, but I’ve seen some of the most innovative ways of producing revenue through software from this company than anywhere in my career, and I think we will be impressed when we look back at what was accomplished here.

Again, there are tons of learning and improvements needed, but I’ve never been as optimistic about my company’s digital strategy as I am now. Expect great things very soon on both the journalism and  business model fronts.

5. Journalism is of zero value.

Again that word value. and, I think, a misunderstanding of the quote. For my take, all old value and markets are now worthless. We have new values, new marketplaces, not based  upon distribution but on networks. Not based on reach, but based on targeting. Not controlled by companies, but controlled by the users, the people formerly known as the audience.

I could go on and on, but I don’t care to. I’m only just finishing this post many days later because I got distracted with some serious family matters, the good result of which is that this post is much kinder and gentler than it was last week and there is no need to get sore over business.

It is just business, right?

If a CEO of a public company doesn’t act in the interest of the shareholders, she might go to jail. John Paton doesn’t have that legal obligation. He does his best anyway. Nothing to criticize about.

We all need to work together to find more efficient ways to inform each other. That’s all. We aren’t done yet.

New journalism needs to redesign itself, saving the good parts, but making sure the new design can be manufactured at scale. Design For Manufacturability. Maybe that is what DFM stands for.

Or as Dave Winer once told me when I was worrying about an app I was building, it Doesn’t Frigin’ Matter. That’s a paraphrase. Dave didn’t say frigin. ; )

Newspapers are way ahead of their time

If you’ve arrived here, it’s probably because you think the headline sounds insane and you are ready to leave a comment telling the author what a nutbag he is. Well, maybe, but not for that reason.

Lately I’ve been watching a lot of traditional news companies and noticing that that their strategies, when reasonable at all, are strategies that belong in a different time period. Often that time is in the past, but in some cases that time has not come yet.

Let me explain.

As an example, most farmers use machinery, not livestock, to plow fields these days. In fact, it’s probably a necessity if one wants to compete. But back in the 1800’s there was a time when a farmer wished he could replace his livestock with a machine, but it was impossible, at least for him. The steam engine hadn’t been invented, or if it had, it really wasn’t feasible to create your own tractor. Another industry needed to mature first and only then could a farmer transform his business.

Eventually it became necessary for the farmer to transform, or die. And in between, lots of farmers that were slow to transform when out of business.

Right now, the internet is around the time when the steam engines are still being invented and perfected and an industry is blooming for that business, the business of web services and software.

While I understand that produce didn’t quite become the commodity that content has become, the ability to mass produce it eventually pushed a lot of small farmers to the brink.

A better analogy would be that the steam engine allowed individuals to create food and they didn’t need farms, but that’s not my point.

My point is that when I see traditional news companies trying to leverage users to crowdsource content, I see someone putting the cart way before the horse, or ehhh, steam engine.

The businesses that are valuable right now are the ones that enable and empower users, the one’s that provide steam engines. What everyone does with those engines will remain to be seen. Tons of things, I’m sure. It’s happening as we speak.

Value used to be in production and distribution. Now it’s in providing tools. If we provide our communities tools, they will sow the fields. That of course, is what Twitter, Facebook and Google are doing. They are becoming stewards of the land. They provide the land and tools and take a share in return. That’s a drastic difference from trying to get communities to work for free, which is what a lot of crowdsourcing attempts look like to me.

News companies should worry more about providing their communities the tools to be a community and leave the cultivation of those crops to the second wave of the revolution.

Citizens’ Agenda, Fusion Tables and JSON (Show your work)

There is a lot of positive buzz around in regards to “The Citizens’ Agenda” project that I’m involved with. You can see all the great work that Studio20 and The Guardian are doing here. And don’t forget to tweet any #unasked questions you might be thinking about.

This initial set of data being published is a set of 839 categorized questions asked at  20 Republican presidential debates. The results were stored in a Google Fusion Table. While not completely normalized, there are a lot of interesting queries and visualizations to be made with the data.

My initial thought was to get it into JSON format. Then we could build on top of it and query it in more fruitful ways.

I had never even seen Google Fusion Tables. I take that back. I had watched a demo at the Knight-Mozilla OpenNews “hacktoberfest” in Berlin in September by Chris Keller, but had never used them myself.

So I said, “Hey Chris, what do I do?” He provided some examples using JQuery which were really helpful to see how the API works in general.

And he pointed to some existing ways to get JSON back from the API including Fusion Tables to JSON and a very interesting undocumented feature of the Google API itself.

As I mentioned, the table data was not quite normalized. For example, in a column called “Question directed to,” the value could be “Romney” or “Romney, Paul.”

So a query for “Romney” might miss that second example, since matches need to be exact. Also, the column with the date had additional text in that field describing the  particular debate.

In other words, getting meaningful queries was going to take some text processing from within the application. So I thought I’d try out the PHP API client that Google provides.

Here is the result. This example returns the entire data set in JSON. I will followup with some more specific queries. You can see a live example at http://citizensagenda.net/debate-questions

// include fusion tables API lib
// http://code.google.com/p/fusion-tables-client-php/
require('../classes/fusiontables/clientlogin.php');
require('../classes/fusiontables/sql.php');
require('../classes/fusiontables/file.php');

// PHP CSV function breaks because of new lines so I use parsecsv library
// http://code.google.com/p/parsecsv-for-php/
require_once('../classes/parsecsv.lib.php');

// authorize (use your creds)
$token = ClientLogin::getAuthToken('my-google-username', 'my-google-password');
$ftclient = new FTClientLogin($token);

//select the data (must know fusion table id)
$fusiontableid = 'fusion-table-id';
$csvreturn = $ftclient->query(SQLBuilder::select($fusiontableid));

// instantiate parsecsv object and process the google data
// this populates $csv->data with an array
$csv = new parseCSV();
$csv->auto($csvreturn);

//encode array into json, set content-type header and print it out
$json = json_encode($csv->data);
header('Content-type: application/json');
echo $json;