Stouthearted

Software, management, life, and and everything in between.
Home  About Me

Getting Started With Pandas

For many this is old news - but I just need to give a shout out to Pandas. What an amazing module for analyzing and manipulating table-like data!

As much as I like SQL and “proper” programming, much of the business world revolves around Excel and CSV. I interact with spreadsheets every day - sometimes multiple times a day. Excel is a workhorse, and it’s not going away anytime soon. But being able to interact with spreadsheets programatically (and without digging into VBA or Excel syntax) has always been a pipe dream of mine.

Enter Pandas. It’s the power tool (or perhaps the superpower) that you need.

I don’t have much any experience with Python… And I was able to get productive with Pandas in just a couple hours. Here are the steps I took.

Installation

  1. Install Anaconda.
  2. That’s it! You’re all set. Start writing python, and then run your scripts through the Anaconda CLI. Pandas and a bunch of other packages are included by default.

Getting The Basics

There are two basic types of data that you need to understand in Pandas - Series, and Dataframes.

Series

A Pandas Series is essentially an indexed array/list. Think of it as one column in a table of data. The data can be any type.

Example:

# Import pandas
import pandas as pd

# Create a series
cities = ["Houston", "Denver", "San Francisco", "Orlando"]
series = pd.Series(cities)

# Print result
print(series)

Output:

0          Houston
1           Denver
2    San Francisco
3          Orlando
dtype: object

Straightforward, right? Let’s move on to Dataframes - that’s where the fun is.

DataFrames

A Pandas DataFrame is a complete table of data - multiple columns, not just one column like a Series. Once again, the data in the columns can be multiple types.

Example:

# Import pandas
import pandas as pd

# Create a dataframe
cities = {
    "Cities": ["Houston", "Denver", "San Franscisco", "Orlando"],
    "Population": [2300000, 2900000, 883000, 285000]
}
df = pd.DataFrame(cities)

# Print result
print(series)

Output:

           Cities  Population
0         Houston     2300000
1          Denver     2900000
2  San Franscisco      883000
3         Orlando      285000

Importing and Exporting Files

The last step to being productive is importing and exporting files.

To import a CSV file, use pandas.read_csv(). To export to external file use Series.to_csv() or DataFrame.to_csv().

Example CSV File - cities.csv:

Cities Population
Houston 2300000
Denver 2900000
San Francisco 883000
Orlando 285000

Example Code:

# Imports
# Numpy is another common python package that works with python and pandas - it comes with Anaconda
import pandas as pd
import numpy as np

# Import CSV as DataFrame
df = pd.read_csv("cities.csv") 

# Add additional column
df["Population_Over_Million"] = np.where(df["Population"] > 1000000, True, False)

# print, then export result
print(df)
df.to_csv("export.csv")

Output:

           Cities  Population  Population_Over_Million
0         Houston     2300000                     True
1          Denver     2900000                     True
2  San Franscisco      883000                    False
3         Orlando      285000                    False

Unleash the Magic

If you’re like me, your imagination is already running wild with all the operations that are possible. We’ve only barley scratched the surface of what Pandas is capable of. You can:

  • Clean data by removing rows that have empty cells, duplicates, errors, etc.
  • Format data in an scriptable manner
  • Find correlations between columns via the .corr() function
  • Create diagrams using .plot()
  • etc.

If you can do it in Excel, chances are Pandas can do it as well - faster, in plain text, and all from the comfort of your CLI or code editor. You can also manipulate massive files that would bring Excel to its knees.

Further Reading

I found the W3Schools tutorial to be simple and helpful. The Pandas documentation is also robust.