Esquire Theme by Matthew Buchanan
Social icons by Tim van Damme

29

Sep

Vasudev Ram: CommonMark, a pure Python Markdown parser and renderer

via jugad2.blogspot.com


By Vasudev Ram

I got to know about CommonMark.org via this post on the Python Reddit:

CommonMark.py - pure Python Markdown parser and renderer

From what I could gather, CommonMark is, or aims to be, two things:

1. “A standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests”.

2. A Python parser and renderer for the CommonMark Markdown spec.

CommonMark on PyPI, the Python Package Index.

Excerpts from the CommonMark.org site:

[ We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. We believe this is necessary, even essential, for the future of Markdown. ]

[ Who are you?
We’re a group of Markdown fans who either work at companies with industrial scale deployments of Markdown, have written Markdown parsers, have extensive experience supporting Markdown with end users – or all of the above.

John MacFarlane
David Greenspan
Vicent Marti
Neil Williams
Benjamin Dumke-von der Ehe
Jeff Atwood ]

So I installed the Python library for it with:
pip install commonmark
Then modified this snippet of example code from the CommonMark PyPI site:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
print(renderer.render(parser.parse("Hello *World*")))
on my local machine, to add a few more types of Markdown syntax:
import CommonMark
parser = CommonMark.DocParser()
renderer = CommonMark.HTMLRenderer()
markdown_string = \
"""
Heading
=======

Sub-heading
-----------

# Atx-style H1 heading.
## Atx-style H2 heading.
### Atx-style H3 heading.
#### Atx-style H4 heading.
##### Atx-style H5 heading.
###### Atx-style H6 heading.

Paragraphs are separated
by a blank line.

Let 2 spaces at the end of a line to do a
line break

Text attributes *italic*, **bold**, `monospace`.

A [link](http://example.com).

Shopping list:

* apples
* oranges
* pears

Numbered list:

1. apples
2. oranges
3. pears

"""
print(renderer.render(parser.parse(markdown_string)))
Here is a screenshot of the output HTML generated by CommonMark, loaded in Google Chrome:


Reddit user bracewel, who seems to be a CommonMark team member, said on the Py Reddit thread:

eventually we’d like to add a few more renderers, PDF/RTF being the first….

So CommonMark looks interesting and worth keeping an eye on, IMO.

- Vasudev Ram - Dancing Bison Enterprises - Python training and consulting

Dancing Bison - Contact Page

28

Sep

Put the Laptop Away

via medium.com

Clay Shirky on Medium:

People often start multi-tasking because they believe it will help them get more done. Those gains never materialize; instead, efficiency is degraded. However, it provides emotional gratification as a side-effect. (Multi-tasking moves the pleasure of procrastination inside the period of work.) This side-effect is enough to keep people committed to multi-tasking despite worsening the very thing they set out to improve.

#

27

Sep

Mike Driscoll: How to Connect to Twitter with Python

via www.blog.pythonlibrary.org

There are several 3rd party packages that wrap Twitter’s API. We’ll be looking at tweepy and twitter. The tweepy documentation is a bit more extensive than twitter’s, but I felt that the twitter package had more concrete examples. Let’s spend some time going over how to use these packages!


Getting Started

To get started using Twitter’s API, you will need to create a Twitter application. To do that, you’ll have to go to their developer’s site and create a new application. After your application is created, you will need to get your API keys (or generate some). You will also need to generate your access tokens.

Since neither of these packages are included with Python, you will need to install them as well. To install tweepy, just do the following:


pip install tweepy

To install twitter, you can do the same kind of thing:


pip install twitter

Now you should be ready to go!


Posting a Status Update

One of the basics that you should be able to do with these packages is post an update to your Twitter account. Let’s see how these two packages work in that regard. We will start with tweepy.

import tweepy
 
auth = tweepy.OAuthHandler(key, secret) auth.set_access_token(token, token_secret) client = tweepy.API(auth) client.update_status("#Python Rocks!")

Well that was pretty straight-forward. We had to create an OAuth handler with our keys and then set the access tokens. Finally we created an object that represents Twitter’s API and updated out status. This method worked great for me. Now let’s see if we can get the twitter package to work.

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret) client = twitter.Twitter(auth=auth) client.statuses.update(status="#Python Rocks!")

This code is pretty simple too. In fact, I think the twitter package’s OAuth implementation is cleaner than tweepy’s.

Note: I sometimes got the following error while using the twitter package: Bad Authentication data, code 215. I’m not entirely sure why as when you look that error up, it’s supposed to be caused because you’re using Twitter’s old API. If that was the case, then it should never work.

Next we’ll look at how to get our timeline.


Getting Timelines

Getting your own Twitter timeline is really easy in both packages. Let’s take a look at tweepy’s implementation:

import tweepy
 
auth = tweepy.OAuthHandler(key, secret) auth.set_access_token(token, token_secret) client = tweepy.API(auth) timeline = client.home_timeline() for item in timeline:
    text = "%s says '%s'" % (item.user.screen_name, item.text) print text

So here we get authenticated and then we call the home_timeline() method. This returns an iterable of objects that we can loop over and extract various bits of data from. In this case, we just extract the screen name and the text of the Tweet. Let’s see how the twitter package does this:

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret) client = twitter.Twitter(auth=auth) timeline = client.statuses.home_timeline() for item in timeline:
    text = "%s says '%s'" % (item["user"]["screen_name"],
                             item["text"]) print text

The twitter package is pretty similar. The primary difference is that it returns a list of dictionaries.

What if you wanted to get someone else’s timeline. In tweepy, you’d do something like this:

import tweepy
 
auth = tweepy.OAuthHandler(key, secret) auth.set_access_token(token, token_secret) client = tweepy.API(auth) user = client.get_user(screen_name='pydanny') timeline = user.timeline()

The twitter package is a little different:

import twitter
 
auth=twitter.OAuth(token, token_secret, key, secret) client = twitter.Twitter(auth=auth) user_timeline = client.statuses.user_timeline(screen_name='pydanny')

In this case, I think the twitter package is a bit cleaner although one might argue that tweepy’s implementation is more intuitive.


Getting Your Friends and Followers

Just about everyone has friends (people they follow) and followers on Tritter. In this section we will look at how to access those items. The twitter package doesn’t really have a good example to follow to find your Twitter friends and followers so in this section we’ll just focus on tweepy.

import tweepy
 
auth = tweepy.OAuthHandler(key, secret) auth.set_access_token(token, token_secret) client = tweepy.API(auth) friends = client.friends() for friend in friends:
    print friend.name

If you run the code above, you will notice that the maximum number of friends that it prints out will be 20. If you want to print out ALL your friends, then you need to use a cursor. There are two ways to use the cursor. You can use it to return pages or a specific number of items. In my case, I have 32 people that I follow, so I went with the items way of doing things:

for friend in tweepy.Cursor(client.friends).items(200):
    print friend.name

This piece of code will iterate over up to 200 items. If you have a LOT of friends or you want to iterate over someone else’s friends, but don’t know how many they have, then using the pages method makes more sense. Let’s take a look at how that might work:

for page in tweepy.Cursor(client.friends).pages():
    for friend in page:
        print friend.name

That was pretty easy. Getting a list of your followers is exactly the same:

followers = client.followers() for follower in followers:
    print follower.name

This too will return just 20 items. I have a lot of followers, so if I wanted to get a list of them, I would have to use one of the cursor methods mentioned above.


Wrapping Up

These packages provide a lot more functionality than this article covers. I particularly recommend looking at tweepy as it’s quite a bit more intuitive and easier to figure out using Python’s introspection tools than the twitter package is. You could easily take tweepy and create a user interface around it to keep you up-to-date with your friends if you didn’t already have a bunch of those applications already. On the other hand, that would still be a good program to write for a beginner.


Additional Reading

26

Sep

Last commit on Friday 5pm

via devopsreactions.tumblr.com

by g1t

Climbing the Walls of Evidence

via colinstokes.blogspot.com

Sometimes you build a wall of evidence. Then you climb on top of that wall and start screaming.

Dr. Stacy L. Smith and her colleagues at USC have released yet another report, supported by the Geena Davis Institute on Gender and Media. As they have every few months in recent years, they go to the trouble of watching hundreds of movies and take notes, and then tell us what we’ve been seeing.

Today I read Soraya Chemaly’s fierce piece in the Huffington Post, "20 Facts Everyone Should Know About Gender Bias in Movies." She reports, and tries to process, Smith’s latest findings, which sample the most popular films from all over the world. The infographics scroll on and on, slicing the stories we take in (and take our children to) from dozens of angles, then holding those slices up to the light.

We see, quantitatively, our bias.

We see our presumption that female voices are about half as relevant as male ones. Men take up about 70% of the dialogue we hear in popular movies worldwide.

We see our presumption that the men who are speaking are professionals. They wear clothes. The women who get a word in don’t work for a living—though they do show off their skinny bodies, prompting appreciation from those male voices.

Chemaly’s list piles up. Every now and then she looks up in shock:
Just three female characters were represented as political leaders with power. One didn’t speak. One was an elephant. The last was Margaret Thatcher.
I first read about Dr. Smith’s analyses a few years ago, and they inspired me to give a talk about how I wanted to block these biases from my children, if I could. Some things have changed: we’ve gone to see movies with female heroes in the center, and we’ve gone again and again. Feminism is back, and the familiar backlash with it. That’s how you know you’re winning, maybe.


But still, the numbers tell us that movies are still, collectively, lying to us. Not just being annoyingly stereotypical. Skewed gender representation isn’t bad because it’s a bummer for little girls who don’t like pink. It’s bad because, as Chemaly knows, it’s teaching another generation how to oppress each other and themselves.
Media is how we train girls and women to have low expectations and train boys to have high ones…These biased portrayals contribute to inhumane, unrealistic stereotypes about masculinity based on control, violence, dominance and the active erasure of empathy as an acceptable emotion. A narrow, frequently violent, power-over-others male heroism comes at a very high price for everyone.
As filmmaker Abigail Disney…asked, “Where are the men who solve problems by thinking?”
Yes, thanks to Dr. Smith’s team and Geena Davis’ team, the wall of evidence has gotten really high. And every day someone like Chemaly, or Emma Watson at the United Nations, or Margot Magowan on her Reel Girl blog, lays the bricks on one another once more, and then stands on top of them and cries out for action.
There is no excuse for not having this information and using it.
Men with influence and the ability to raise these questions and do something about them probably strive, as individuals, to be good parents to their kids and make sure their daughters are healthy, happy, educated and ambitious.
Not doing anything about this problem, from an institutional perspective, undoes all of that effort. The argument that there is some kind of benign “neutral” position is misguided.
Same goes for parents.


Read Soraya Chemaly’s "20 Facts Everyone Should Know About Gender Bias in Movies" at the Huffington Post.

25

Sep

Teamwork - in theory

via devopsreactions.tumblr.com

by uaiHebert

...

via www.flickr.com

june1777 posted a photo:

...



18

Sep

No, THIS Is Almost Every Sci-Fi Starship Ever, In One Giant Chart

via feeds.gawker.com

No, THIS Is Almost Every Sci-Fi Starship Ever, In One Giant Chart

Remember last year, when Dirk Loechel showed us a size comparison chart of pretty much every starship you could think of? Turns out there were some missing, so he’s gone and made what he says is his final update.

Read more…

15

Sep

The deployment pipeline

via devopsreactions.tumblr.com

by Julik and Aaron

13

Sep

A Secret Hideaway The Data News Team has run off to a secret...

via datanews.tumblr.com



A Secret Hideaway

The Data News Team has run off to a secret (and beautiful) hideaway for a crazy sprint to make SchoolBook more amazing. For several days, we’ll be designing, coding and writing. Also maybe napping in the hammock.

When we’re done, our SchoolBook database will be more useful, more mobile, and have fewer calories. Most importantly, it’ll be easier for parents and students to find the perfect NYC public school as they navigate the school selection and admissions process.

Watch our blog for more pics from the place and the process.

— John

11

Sep

PlotDevice: Draw with Python

via flowingdata.com

PlotDevice

PlotDevice

You’ve been able to visualize data with Python for a while, but Mac application PlotDevice from Christian Swinehart couples code and graphics more tightly. Write code on the right. Watch graphics change on the right.

The application gives you everything you need to start writing programs that draw to a virtual canvas. It features a text editor with syntax highlighting and tab completion plus a zoomable graphics viewer and a variety of export options.

PlotDevice’s simple but com­pre­hen­sive set of graphics commands will be familiar to users of similar graphics tools like NodeBox or Processing. And if you’re new to programming, you’ll find there’s nothing better than being able to see the results of your code as you learn to think like a computer.

Looks promising. Although when I downloaded it and tried to run it, nothing happened. I’m guessing there’s still compatibility issues to iron out at version 0.9.4. Hopefully that clears up soon. [via Waxy]

Tags: ,

Dinosaurs versus airplane

via flowingdata.com

Dinosaurs versus airplane

Scientists found the fossils of a giant dinosaur that they estimate was 26 meters long and 60 tons heavy. How much is that really? BBC News provided a simple chart to put size into perspective. They compared dinosaur sizes to a moose, African Elephant, and a Boeing 737-900.

Impressive. Although not as impressive as Mega Shark. [Thanks, Jim]

Tags: , , ,

CloudTunes: your web-based music player for the cloud

via thechangelog.com

Great idea and execution from Jakub Roztočil:

CloudTunes provides a unified interface for music stored in the cloud (YouTube, Dropbox, etc.) and integrates with Last.fm, Facebook, and Musicbrainz for metadata, discovery, and social experience. It is similar to services like Spotify, except instead of local tracks and the fixed Spotify catalog, CloudTunes uses your files stored in Dropbox and music videos on YouTube.

Collection


Subscribe to The Changelog Weekly - our free weekly email covering everything that hits our open source radar.
The post CloudTunes: your web-based music player for the cloud appeared first on The Changelog.

A non-comprehensive list of awesome female data people on Twitter

via simplystatistics.org

I was just talking to a student who mentioned she didn’t know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn’t seen a good list of women on Twitter who do stats/data. So I thought I’d make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I’ll update the list?

04

Sep

A reusable data processing workflow

via blog.apps.npr.org

Correction (September 2, 2014 8:55pm EDT): We originally stated that the script should combine data from multiple American Community Survey population estimates. This methodology is not valid. This post and the accompanying source code have been updated accordingly. Thanks to census expert Ryan Pitts for catching the mistake. This is why we open source our code!

The NPR Visuals team was recently tasked with analysing data from the Pentagon’s program to disperse surplus military gear to law enforcement agencies around the country through the Law Enforcement Support Office (LESO), also known as the “1033” program. The project offers a useful case study in creating data processing pipelines for data analysis and reporting.

The source code for the processing scripts discussed in this post is available on Github. The processed data is available in a folder on Google Drive.

Automate everything

There is one rule for data processing: Automate everything.

Data processing is fraught with peril. Your initial transformations and data analysis will always have errors and never be as sophisticated as your final analysis. Do you want to hand-categorize a dataset, only to get updated data from your source? Do you want to laboriously add calculations to a spreadsheet, only to find out you misunderstood some crucial aspect of the data? Do you want to arrive at a conclusion and forget how you got there?

No you don’t! Don’t do things by hand, don’t do one-off transformations, don’t make it hard to get back to where you started.

Create processing scripts managed under version control that can be refined and repeated. Whatever extra effort it takes to set up and develop processing scripts, you will be rewarded the second or third or fiftieth time you need to run them.

It might be tempting to change the source data in some way, perhaps to add categories or calculations. If you need to add additional data or make calculations, your scripts should do that for you.

The top-level build script from our recent project shows this clearly, even if you don’t write code:

#!/bin/bash echo 'IMPORT DATA' echo '-----------' ./import.sh

echo 'CREATE SUMMARY FILES' echo '--------------------' ./summarize.sh

echo 'EXPORT PROCESSED DATA' echo '---------------------' ./export.sh

We separate the process into three scripts: one for importing the data, one for creating summarized versions of the data (useful for charting and analysis) and one that exports full versions of the cleaned data.

How we processed the LESO data

The data, provided by the Defense Logistics Agency’s Law Enforcement Support Office, describes every distribution of military equipment to local law enforcement agencies through the “1033” program since 2006. The data does not specify the agency receiving the equipment, only the county the agency operates in. Every row represents a single instance of a single type of equipment going to a law enforcement agency. The fields in the source data are:

  • State
  • County
  • National Supply Number: a standardized categorization system for equipment
  • Quantity
  • Units: A description of the unit to use for the item (e.g. “each” or “square feet”)
  • Acquisition cost: The per-unit cost of the item when purchased by the military
  • Ship date: When the item was shipped to a law enforcement agency

Import

Import script source

The process starts with a single Excel file and builds a relational database around it. The Excel file is cleaned and converted into a CSV file and imported into a PostgreSQL database. Then additional data is loaded that help categorize and contextualize the primary dataset.

Here’s the whole workflow:

  • Convert Excel data to CSV with Python.
    • Parse the date field, which represents dates in two different formats
    • Strip out extra spaces from any strings (of which there are many)
    • Split the National Supply Number into two additional fields: The first two digits represent the top level category of the equipment (e.g. “WEAPONS”). The first four digits represent the “federal supply class” (e.g. “Guns, through 30 mm”).
  • Import the CSVs generated from the source data into PostgreSQL.
  • Import a “FIPS crosswalk” CSV into PostgreSQL. This file, provided to us by an NPR reporter, lets us map county name and state to the Federal Information Processing Standard identifier used by the Census Bureau to identify counties.
  • Import a CSV file with Federal Supply Codes into PostgreSQL. Because there are repeated values, this data is de-depulicated after import.
  • Import 5 year county population estimates from the US Census Bureau’s American Community Survey using the American FactFinder download tool. The file was files were added to the repository because there is no direct link or API to get the data.
    • Import 5 year county population estimates (covers all US counties)
    • Import 3 year county population estimates (covers approximately 53% of the most populous US counties)
    • Import 1 year county population (covers approximately 25% of the most populous US counties).
    • Generate a single population estimate table by overwriting 5 year estimates with 3 year estimates or 1 year estimates (if they exist).
  • Create a PostgreSQL view that joins the LESO data with census data through the FIPS crosswalk table for convenience.

We also import a list of all agencies using csvkit:

  • Use csvkit’s in2csv command to extract each sheet
  • Use csvkit’s csvstack command to combine the sheets and add a grouping column
  • Use csvkit’s csvcut command to remove a pointless “row number” column
  • Import final output into Postgres database

Summarizing

Summarize script source

Once the data is loaded, we can start playing around with it by running queries. As the queries become well-defined, we add them to a script that exports CSV files summarizing the data. These files are easy to drop into Google spreadsheets or send directly to reporters using Excel.

We won’t go into the gory details of every summary query. Here’s a simple query that demonstrates the basic idea:

echo "Generate category distribution" psql leso -c "COPY ( select c.full_name, c.code as federal_supply_class, sum((d.quantity * d.acquisition_cost)) as total_cost from data as d join codes as c on d.federal_supply_class = c.code group by c.full_name, c.code order by c.full_name ) to '`pwd`/build/category_distribution.csv' WITH CSV HEADER;" 

This builds a table that calculates the total acquisition cost for each federal supply class:

full_name federal_supply_code total_cost
Trucks and Truck Tractors, Wheeled 2320 $405,592,549.59
Aircraft, Rotary Wing 1520 $281,736,199.00
Combat, Assault, and Tactical Vehicles, Wheeled 2355 $244,017,665.00
Night Vision Equipment, Emitted and Reflected Radiation 5855 $124,204,563.34
Aircraft, Fixed Wing 1510 $58,689,263.00
Guns, through 30 mm 1005 $34,445,427.45

Notice how we use SQL joins to pull in additional data (specifically, the full name field) and aggregate functions to handle calculations. By using a little SQL, we can avoid manipulating the underlying data.

The usefulness of our approach was evident early on in our analysis. At first, we calculated the total cost as sum(acquisition_cost), not accounting for the quantity of items. Because we have a processing script managed with version control, it was easy to catch the problem, fix it and regenerate the tables.

Exporting

Export script source

Not everybody uses PostgreSQL (or wants to). So our final step is to export cleaned and processed data for public consumption. This big old query merges useful categorical information, county FIPS codes, and pre-calculates the total cost for each equipment order:

psql leso -c "COPY ( select d.state, d.county, f.fips, d.nsn, d.item_name, d.quantity, d.ui, d.acquisition_cost, d.quantity * d.acquisition_cost as total_cost, d.ship_date, d.federal_supply_category, sc.name as federal_supply_category_name, d.federal_supply_class, c.full_name as federal_supply_class_name from data as d join fips as f on d.state = f.state and d.county = f.county join codes as c on d.federal_supply_class = c.code join codes as sc on d.federal_supply_category = sc.code ) to '`pwd`/export/states/all_states.csv' WITH CSV HEADER;" 

Because we’ve cleanly imported the data, we can re-run this export whenever we need. If we want to revisit the story with a year’s worth of additional data next summer, it won’t be a problem.

A few additional tips and tricks

Make your scripts chatty: Always print to the console at each step of import and processing scripts (e.g. echo "Merging with census data"). This makes it easy to track down problems as they crop up and get a sense of which parts of the script are running slowly.

Use mappings to combine datasets: As demonstrated above, we make extensive use of files that map fields in one table to fields in another. We use SQL joins to combine the datasets. These features can be hard to understand at first. But once you get the hang of it, they are easy to implement and keep your data clean and simple.

Work on a subset of the data: When dealing with huge datasets that could take many hours to process, use a representative sample of the data to test your data processing workflow. For example, use 6 months of data from a multi-year dataset, or pick random samples from the data in a way that ensures the sample data adequately represents the whole.