Esquire Theme by Matthew Buchanan
Social icons by Tim van Damme

17

May

Mashape - The Cloud API Hub

create, share(, and monetize) your own APIs

obfuscurity/descartes · GitHub

graphite dashboarding though having to store config data in postgresql and openID state in redis seems a bit overkill to me

Reinout van Rees: Growing open source seeds - Kenneth Reitz

via reinout.vanrees.org

He shows us three kinds of (more or less) open source projects.

Type 1: public source

Once upon a time there was an “open source project” called the facebook SDK. Basically it just stopped working one day and nobody could help, despite offers for help on the issue tracker. Hacker news got wind of it and it was on the front page for a while. Facebook’s reaction? Disabling the issue tracker… (Later on they fixed it).

That’s not open source, that’s public source. Often it is abandoned due to loack of interest, change of focus or so. The motivation for having it as open source simply is not clear.

Type 2: shared investment

A different example: gittip. They aim to be the world’s first open company. There’s a github issue for everything, even the company name. Major decisions are voted for on github. The code is open source, of course. All interviews with journalists are filmed and live-streamed. And all otherwise-often-backdoor-cooperation-agreements are fully open.

Projects like gittip are shared investment projects. Shared ownership, extreme transparency. There is very little questioning of motivations. The motivation is clear and public. There’s a documented process for new contributers. The advantage? It is low risk. There’s a high bus factor.

Type 3: dictatorship project

Kenneth is the author of requests. An open source project, very succesful. But all the decisions are made by Kenneth.

That’s really more of an dictatorship project. A totalitarian BDFL that owns everything. The dictator is responsible for all decisions. Requests’ values lie in its extreme opinions. If he’d involve more people, the value would be dilluted. There are drawbacks. A low bus factor. High risk of burnout: Kenneth is the single point of failure.

Lessons learned

  • Be cordial or be on your way. As a user, you need to keep all your interactions with the maintainer as respectful as possible. The maintainer put a lot of work in it and they don’t owe you any of their time.

    As a maintainer, you also must be cordial. Be thankful to all contributions. Feedback is the liveblood of your project, even the negative. You’ll need to ignore non-constructive comments. Be careful with the words you choose, sometimes contributors take what you say VERY personally. You might have to educate your users. And: a bit of kindness goes a long way.

  • Sustainability is almost the biggest challenge. Don’t burn out. Try to get others to help.

    He quotes Wes Beary: “open source provides a unique opportunity for the trifecta of purpose, mastery and autonomy”. Pay equal attention to all of these three. Learn to do less, focus more on your purpose, for instance.

  • Learn to say no. People ask for crazy features. Or they submit quite sane pull requests that, if you allow them all in, makes your project slow and unfocused. Kenneth wants as few lines of codes in his project. Negative diffs are the best diffs!

  • Open source makes the world a better place. Don’t make it complicated!

16

May

The Stephen King Universe

via www.coolinfographics.com

The Stephen King Universe infographic poster

Are you a Stephen King fan? Have you yourself made these connections? From TessieGirlThe Stephen King Universe has been updated to include the many connections to the Dark Tower series.

When I was in Grade 5 (guess I was ten), my friend Tarnya Smyth brought her mum’s battered copy of Stephen King’s ‘Carrie’ to school. We broke it into about 4 pieces and passed them around, all taking turns reading each battered section. I told mum about it and she FLIPPED HER WIG and told me to ‘Stop reading that book immediately!!’ So I finished it.

Now, I TOTALLY do not recommend  ten year olds reading Stephen King books (messed me up good), but this was when my life long relationship with Mr King began. My love for his books is based around his characters. They are so full. I love Stephen King dialogue. I love his sense of humour. And I love the links and connections between the books. I am the kind of annoying person who likes to know the ‘In Joke’. So, of course, I MADE A FLOW CHART!!!

This chart is like my fourth child. Be kind to it. It means a lot to me.

I wish they had published a higher-resolution version online.  Some of the text is too small to read, but I think I can follow all of the connections.  A must have for any Stephen King fan!

Also, it’s available for purchase as a poster from the TessieGirl site for $25 plus shipping from Australia.  You can also see the original version.

Thanks to Becky for sending in the link!

14

May

P:R Approved: Kenneth Rocafort’s Cyborg Superman!

via www.tencentticker.com

Note: The New 52 ain’t all bad. Superman artist Kenneth Rocafort has given new life to one of the most memorable (and sometimes derided) DC creations from the 1990s: Cyborg Superman. Moving past the cyborg stereotype perpetrated in the 1990s, Rocafort’s design melds Otomo aesthetics with the segmented armor approach of the New 52, making it actually work. The fact that this also looks like a Bruce Campbell gives this one bonus points. – Chris A.

11

May

Stuff The Internet Says On Scalability For May 10, 2013

via highscalability.com

Hey, it’s HighScalability time:


(In Thailand, they figured out how to solve the age-old queuing problem!)

Don’t miss all that the Internet has to say on Scalability, click below and become eventually consistent with all scalability knowledge…

10

May

Our NYC Data Engineering events are live!

via g33ktalk.com

Because of NYC startups’ interest in big data technologies we’ve recently launched a brand new data engineering meetup. The meetup is for engineers only and features New York’s top startups presenting their learnings on building real-world data processing architectures.

We are having a talk by bit.ly on their recently open sourced data processing technology next Wednesday, come join us!

May 15th: Realtime Distributed Message Processing at Scale with NSQ
(Matt Reiferson from Bit.ly speaking)
http://www.meetup.com/NYC-Data-Engineering/events/113291272/

Seven Tenets of Quantitative Data Presentation

via www.perceptualedge.com

Presenting quantitative information is a specialized form of communication. Like all forms of communication, quantitative data presentation is most effective when we follow a few best practices, such as the following seven tenets.

  1. Know your data. Until you understand the stories that live in your data, you can’t begin to tell them.
  2. Know your audience. Unless you understand what matters to your audience, you won’t know what is of interest and use to them.
  3. Determine your message. Every dataset contains multiple stories. You can’t tell them all at once. Before you present quantitative information, you must determine the specific message or messages that you want to communicate. Start by writing a sentence or two or three to express the message before moving on to determine the ideal means of expression.
  4. Reduce the data to what’s needed to communicate the message. Pare the data down to the essence of what your audience must see to understand the message. What’s essential usually involves more than a simple set of primary values (e.g., monthly sales figures), for without context in the form of comparisons, numbers mean little. For example, monthly sales figures compared to target values or to values for the same months last year are more meaningful than sales figures alone.
  5. Determine the best means of expression. Some quantitative messages are best communicated with words, some with tables of numbers, some with graphs, and some with a combination. Some messages are best displayed in a bar graph, some in a line graph, some in a scatter plot, and so on. Knowing which form of expression works best for the message that you’re trying to present requires a little training into how our eyes and brains process visual information. The principles are easy to learn, but they aren’t intuitive. I wrote the book Show Me the Numbers, in part, to teach these principles.
  6. Design the display to communicate simply, clearly, and accurately. Include nothing that isn’t data unless it’s needed to support the data. Unnecessary color variation and visual effects, or even grid lines in a graph when they aren’t needed, will detract from the message. Non-data elements that are needed should only be visible enough to do their job and never so visible that they call attention to themselves. Non-data elements should sit politely in the background so the information stands out clearly in the foreground. If some information is more important to the message than other information, do something visual to feature it. For example, a brighter color or thicker stroke would make a particular line in a line graph stand out more than the others.
  7. Suggest a way to respond. Whenever possible, make it easy for your audience to respond with appropriate action by suggesting specific steps. Most quantitative messages aren’t presented merely to inform, but also to motivate a useful response.

Take care,

09

May

Beware the Straw Man

via www.perceptualedge.com

A “straw man” is a flawed form of argument that occurs when one side attacks a position that isn’t actually held by the other side (the “straw man”) and then acts as though the other side’s position has been refuted. People usually construct straw men when they cannot legitimately refute an opponent’s position. As such, a straw man is a dishonest and fallacious form of argument, but one that can be persuasive when the audience is not aware of the facts.

I learned about straw men as an undergraduate majoring in communication studies. I loved the course that I took in argumentation and debate back then because I found the rules of logic elegant, interesting, and easy to understand. I vividly remember, however, that most of my classmates didn’t take so naturally to these principles and frequently struggled to make their case. I’m ashamed to admit that I took far too much pleasure in tying my opponents into logical knots and luring them into logical traps.

Since those bygone days of youth, I have expanded what I learned in college by keeping up with work in the fields of critical thinking and brain science. I am now familiar not only with the rules of rational argument but also with many causes of flawed thinking. I have found, to my great disappointment, that this is not common knowledge, even among scientists and analysts. I am no longer surprised when academics in the field of information visualization—doctoral students and professors—conduct studies that are flawed in obvious ways.

I was prompted to think about straw men recently when I encountered a couple on the Web that were apparently constructed to fault the work of people like me who teach data visualization best practices. The first appeared in a recent series of articles about data visualization on the Harvard Business Review’s (HBR) website. I was invited to contribute an article to this series, but unfortunately didn’t have the time. I wish I could have participated, however, to correct the portrayal of business-related data visualization as skewed toward elaborate infographics rather than the simple uses of quantitative graphics that make up around 99% of the data visualizations created in organizations. The straw man that I noticed was constructed by Amanda Cox of the New York Times. I greatly admire the data graphics of the New York Times, including Amanda’s work in particular. Cox is an articulate spokesperson for journalistic uses of data visualization. For this reason, I was surprised when I read the following interaction in HBR’s interview with Amanda (emphasis mine):

[HBR]: It seems like there’s more focus on trying to get data viz to go viral than to make it “matter.”

[Amanda Cox]: There’s a lot where not much actionable comes out of it. I don’t know if the ratio is different from the ratio of bad writing to good, or bad restaurant openings to good, but I think it’s an important idea to focus on. There’s a strand of the data viz world that argues that everything could be a bar chart. That’s possibly true but also possibly a world without joy.

I appreciated almost everything that Amanda said except the two sentences that I’ve highlighted above, which appear to be a jab at data visualization practitioners who promote the use of simple graphs over some of the elaborate (but often ineffective) infographics that routinely appear on the Web. Amanda’s statement is a straw man. No one “argues that everything could be a bar chart.” Anyone who did would not only be robbing the world of joy but also of meaning. Bar graphs are one effective means of displaying data among several, and they are only appropriate for particular data sets and purposes. I’m not sure why Amanda felt compelled to insert this little goad of a comment in the interview. If she has an actual case to make, she can surely do better than this.

On April 17th, I encountered a similar straw man constructed by Nathan Yau in his blog (emphasis mine):

Data is an abstraction of something that happened in the real world. How people move. How they spend money. How a computer works. The tendency is to approach data and by default, visualization, as rigid facts stripped of joy, humor, conflict, and sadness—because that makes analysis easier. Visualization is easier when you can strip the data down to unwavering fact and then reduce the process to a set of unwavering rules.

The world is complex though. There are exceptions, limitations, and interactions that aren’t expressed explicitly through data. So we make inferences with uncertainty attached. We make an educated guess and then compare to the actual thing or stuff that was measured to see if the data and our findings make sense.

Data isn’t rigid so neither is visualization.

Are there rules? There are, just like there are in statistics. And you should learn them.

However, in statistics, you eventually learn that there’s more to analysis than hypothesis tests and normal distributions, and in visualization you eventually learn that there’s more to the process than efficient graphical perception and avoidance of all things round. Design matters, no doubt, but your understanding of the data matters much more.

I agree with everything that Nathan says here, but not with what he implies in the text that I’ve highlighted. His comment about “efficient graphical perception and avoidance of all things round” appears to be a direct reaction to my position, but one that he’s morphed into a straw man. No one argues that there isn’t more to data visualization than perceptual efficiency and circle avoidance. (I suspect that Yau’s phrase “all things round” refers to an article that I wrote in 2010, “Our Irresistible Fascination with All Things Circular.”) No one who promotes the importance of efficient and accurate graphical perception argues that design matters more than understanding. In fact, it is our concern that people understand data clearly, accurately, and as fully as possible that leads us to teach people how to present data graphically in ways that work for human perception and cognition. There is indeed much more to data visualization than a rigid set of design rules, which is why, when I teach design principles, I do so in a way that enables my students to understand how and why these principles work so they can apply, bend, and sometimes break the rules intelligently.

What’s ironic about Yau’s claim is that he often features infographics as exemplary that are beautiful or otherwise eye-catching, but yield little understanding. Such examples can easily be found in his lists of the best data visualizations of the year. Given his training as a statistician, I’ve always found this puzzling.

Making data visualizations perceptible is not all there is, but it is certainly an essential requirement if we want people to understand what we’re trying to say. I’m sure that Cox and Yau agree, but they seem willing at times to sacrifice perceptual effectiveness for visual allure. When they do, understanding is diminished. There is no reason why perceptual effectiveness and visual allure cannot coexist. Leaders in the field of data visualization don’t always agree, but when we disagree and wish to state our case, we should build it on solid evidence and sound reason. Dismissive remarks and thinly veiled insinuations that aren’t accurate or backed by evidence don’t qualify as useful discourse.

Take care,

08

May

Not Invented Here: A Comical Series on Scalability

via highscalability.com

I read one of these poignantly humorous comics on Not Invented Here a while back and since I wasn’t sure it was OK to repost I emailed asking for permission. Nada. Then I saw Martijn de Vrieze posted a collection of scalability comics from NIH and decided what the heck (click image to read on site):

Thanks to Martijn for curating the collection and NIH for creating them.

And I agree with Martijn, they do capture an ineffable quality about the entire space.

07

May

We're all starting to track ourselves

via petewarden.typepad.com

Mapscreenshot

We’re releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We’ve already taught our computers what we buy and read, now we’re telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it’s combined with all the other people doing the same thing. We’re instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we’re adding high-resolution photos and detailed comments to the checkins.

It’s hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven’t even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It’s a scary new world to contemplate too of course, which is why I keep blogging about what I’m up to. Recently I’ve been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world’s changing, check it out:

https://www.jetpac.com/map

It’s still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we’re all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.

06

May

It will come! Trusted sources confirm launch of Fuji entry level interchangeable lens camera

via www.fujirumors.com

Now we know: it will come! According to trusted sources, Fuji is about to launch an entry level interchangeable lens camera this summer. It should be smaller than the X-E1 with less control buttons or dials… and no viewfinder. I’m working to find out more. Is there someone out there that could make my work easier? I’m all ears ;)

A new source told me here that the price (body+lens) should be of $550 and launched in July. $550.

That’s all for now
Patrick

P.S.: You’re not interested in entry level cameras? Then check out this new X-PRO1 body for $ 1,049 at ebayUS here (4 available).

P.P.S.: The X100S is now in stock on AmazonUS (via third party reseller) for normal price here (only 2 left!).

 photo ama_zps28f66fca.png

03

May

Creating Map Visualizations in <10 lines of Python

via wrobstory.github.com

import vincent
world_countries = r'world-countries.json'
world = vincent.Map(width=1200, height=1000)
world.geo_data(projection='winkel3', scale=200, world=world_countries)
world.to_json(path)

World

One of my goals when I started building Vincent was to streamline the creation of maps as much as possible. There are some excellent Python map libraries out there- see Basemap and Kartograph for more fun with maps. I highly recommend both of those tools, as they are both quite good and very powerful. I wanted something a little simpler, that relies on the power of Vega and allows for simple syntax- point to geoJSON files, specify a projection and scale/size, output the map.

For instance, layering sets of map data in order to create more complex maps:

vis = vincent.Map(width=1000, height=800)
#Add the US county data and a new line color
vis.geo_data(projection='albersUsa', scale=1000, counties=county_geo)
vis + ('2B4ECF', 'marks', 0, 'properties', 'enter', 'stroke', 'value')

#Add the state data, remove the fill, write Vega spec output to JSON
vis.geo_data(states=state_geo)
vis - ('fill', 'marks', 1, 'properties', 'enter')
vis.to_json(path)

USA Map

Additionally, choropleth maps were begging for a binding to the Pandas DataFrame, with data columns mapping directly to map features. Assuming a 1:1 mapping from geoJSON features to column data, the syntax is very straightforward:

#'merged' is the Pandas DataFrame
vis = vincent.Map(width=1000, height=800)
vis.tabular_data(merged, columns=['FIPS_Code', 'Unemployment_rate_2011']) 
vis.geo_data(projection='albersUsa', scale=1000, bind_data='data.id', counties=county_geo)
vis + (["#f5f5f5","#000045"], 'scales', 0, 'range')
vis.to_json(path)

Chloropleth

This isn’t without a little data wrangling and transformation- the user needs to ensure that there is a 1:1 mapping of keys in the geoJSON to row keys in the Pandas DataFrame. Here is what was required to get a clean DataFrame for mapping for the previous example: our county data is a csv with FIPS code, county name, and our economic data (column names withheld):

00000,US,United States,154505871,140674478,13831393,9,50502,100
01000,AL,Alabama,2190519,1993977,196542,9,41427,100
01001,AL,Autauga County,25930,23854,2076,8,48863,117.9
01003,AL,Baldwin County,85407,78491,6916,8.1,50144,121
01005,AL,Barbour County,9761,8651,1110,11.4,30117,72.7

And our county polygons in a geoJSON with FIPS codes as the id’s (thanks to the folks at Trifacta for this data). The actual polygons have been truncated here for brevity, see the example data for the complete dataset:

{"type":"FeatureCollection","features":[
{"type":"Feature","id":"1001","properties":{"name":"Autauga"}
{"type":"Feature","id":"1003","properties":{"name":"Baldwin"}
{"type":"Feature","id":"1005","properties":{"name":"Barbour"}
{"type":"Feature","id":"1007","properties":{"name":"Bibb"}
{"type":"Feature","id":"1009","properties":{"name":"Blount"}
{"type":"Feature","id":"1011","properties":{"name":"Bullock"}
{"type":"Feature","id":"1013","properties":{"name":"Butler"}
{"type":"Feature","id":"1015","properties":{"name":"Calhoun"}
{"type":"Feature","id":"1017","properties":{"name":"Chambers"}
{"type":"Feature","id":"1019","properties":{"name":"Cherokee"}

We need to match the FIPS codes and ensure that the matches are exact, or Vega won’t zip the data properly:

import json
import pandas as pd
#Map the county codes we have in our geometry to those in the
#county_data file, which contains additional rows we don't need
with open(county_geo, 'r') as f:
    get_id = json.load(f)

#Grab the FIPS codes and load them into a dataframe
county_codes = [x['id'] for x in get_id['features']]
county_df = pd.DataFrame({'FIPS_Code': county_codes}, dtype=str)

#Read into Dataframe, cast to string for consistency
df = pd.read_csv(county_data, na_values=[' '])
df['FIPS_Code'] = df['FIPS_Code'].astype(str)

#Perform an inner join, pad NA's with data from nearest county
merged = pd.merge(df, county_df, on='FIPS_Code', how='inner')
merged = merged.fillna(method='pad')

>>>merged.head()
      FIPS_Code State       Area_name  Civilian_labor_force_2011  Employed_2011  \
    0      1001    AL  Autauga County                      25930          23854   
    1      1003    AL  Baldwin County                      85407          78491   
    2      1005    AL  Barbour County                       9761           8651   
    3      1007    AL     Bibb County                       9216           8303   
    4      1009    AL   Blount County                      26347          24156

   Unemployed_2011  Unemployment_rate_2011  Median_Household_Income_2011  \
0             2076                     8.0                         48863   
1             6916                     8.1                         50144   
2             1110                    11.4                         30117   
3              913                     9.9                         37347   
4             2191                     8.3                         41940

   Med_HH_Income_Percent_of_StateTotal_2011  
0                                     117.9  
1                                     121.0  
2                                      72.7  
3                                      90.2  
4                                     101.2

And now we can quickly generate different choropleths:

vis.tabular_data(merged, columns=['FIPS_Code', 'Civilian_labor_force_2011']) 
vis.to_json(path)

Labor Force

That’s not telling us much other than that LA and King counties are both very large and very populous. Let’s look at median household income:

vis.tabular_data(merged, columns=['FIPS_Code', 'Median_Household_Income_2011'])
vis.to_json(path)

Median Income

Certainly some high income areas on the east coast and in other high density areas. I bet this would be more interesting on the city level, but that will have to wait for a future post. Lets quickly reset the map and look at state unemployment:

#Swap county data for state data, reset map
state_data = pd.read_csv(state_unemployment)
vis.tabular_data(state_data, columns=['State', 'Unemployment'])
vis.geo_data(bind_data='data.id', reset=True, states=state_geo)
vis.update_map(scale=1000, projection='albersUsa')
vis + (['#c9cedb', '#0b0d11'], 'scales', 0, 'range')
vis.to_json(path)

State Unemployment

Maps are a passion of mine- this is one area where I really want to make Vincent very strong, including the ability to easily add points, markers, etc. If you the reader have any features you would like to see with regarding to mapping, please post an issue on Github

5 Billion Gallons of Sewage Overflowed to N.J. Post-Sandy, Report Says

via ridgewood.patch.com

New Jersey saw approximately 5.1 billion gallons of untreated or partially treated sewage flow into waterways in the weeks and months following Superstorm Sandy, according to new data released by Climate Central.

In total, the eight states hardest-hit by the storm had 11 billion gallons flow into canals, rivers and bays.

“To put that in perspective, 11 billion gallons is equal to New York’s Central Park stacked 41 feet high with sewage, or more than 50 times the BP Deepwater Horizon oil spill. The vast majority of that sewage flowed into the waters of New York City and northern New Jersey in the days and weeks during and after the storm,” the Climate Central report said.

Data included in the report was compiled from state agencies and treatment plant operators.

Contributing to the sewage overflow was actual damage to the treatment plants themselves as a result of the storm. Sewage overflows into the water continued from the day Sandy hit (Oct. 29) until January of this year, when the last known overflow was reported.

New Jersey’s largest spill was reported by the Passaic Valley Sewerage Commission in Newark in which 840 million gallons of untreated sewage flowed into Newark Bay. Within the two weeks that followed, 3 billion gallons more overflowed into the bay while the plant was being restored, according to the report.

In Middlesex County, an additional 1.1 billion gallons reportedly flowed into local waterways. 

Click on the interactive map to see where sewage overflowed and the amounts.











Testing interface on Internet Explorer

via devopsreactions.tumblr.com

by Gian