July Release Confetti
150+ New Courses, Hands‑On Labs, And
Interactive Learning Activities
Learn More

Series & Data Frames With Pandas


Introduction

In this guide, you will learn the basics of python’s Pandas library. We will learn about Series & Data frames and how we can use these tools for data manipulation. I kept the examples as simple as possible in order to eliminate any noise and make the material easier to learn. Throughout this guide, I will be using iPython to display the code examples, however, feel free to use whichever editor you are most comfortable with.

Requirements

This guide assumes the following knowledge:

- Access to a Linux shell environment with an active internet connection.

- PIP (python’s packaging tool) is installed on your computer.

- The reader should have a grasp of programming concepts such as list, dictionaries, variables, etc.

- Knowledge on how to use an editor to create and run python scripts.

Basic Usage

We will begin by covering the basics step by step, followed by a project in which you can get to practice the material covered in this guide.

Setup

Installing Python’s PIP

PIP is a package management system used to install packages written in Python. If you do not have pip installed, and you are on a Debian base distribution, you can use the apt package manager to install pip:


$ apt-get install python-pip

Installing Pandas

Assuming you have PIP installed on your system, the following command will install pandas on your computer.


$ pip install pandas

Series

Syntax

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)


The Pandas documentation describes a Series as a one-dimensional array capable of holding any data type. This makes Series flexible and very powerful tools to manipulate data. Every value in a series is indexed and can be called upon if needed.

In this section we will be learning how to:

- Create a Series.

- Give Series an index.

- Call on a specific value by using its index.

- Perform operations on Series.

- Perform Boolean tests on your Series.

- Creating a Series

We start by importing the Pandas library allowing us to access the Series method.


In [1]: import pandas

We create a Series by passing pandas.Series(VALUES, index=[]) a list of values. If no index is provided, pandas will automatically create one.

In this case, we will create a variable, and give pandas.Series() a list of integers [1, 2, 3, 4].


In [2]: server = pandas.Series( [1,2,3,4] )


Let’s get a summary of our data by printing server.


In [3]: print(server)
0 1
1 2
2 3
3 4


The summary displays each data value along with its index. In this case server[0] holds the value of 1, server [1] holds the value of 2 and so on. Since we did not provide an index, Pandas auto generated our index as 0, 1, 2, 3.

Assigning an index to a Series

Next, we will be using a list of IP addresses as our index, and a list of integers representing packages sent out on our network as our values.


In [1]: import pandas


We are passing pandas.Series() the following list of integers [ 300, 855, 321 ] We are also going to be defining our own index with the following three IP addresses.


In [2]: servers= pandas.Series([300,855,321], index=["192.168.1.10","192.168.1.55","192.168.1.22"])

By printing our server variable we can see that each IP address is being used as an index for the number of packages being sent out of the network.


In [3]: print(server)
192.168.1.10 300
192.168.1.55 855
192.168.1.22 321


Calling on a value

You can call on a value by using its given index. Here we are calling on the value held by the IP “192.168.1.10” and in return we get the value 300.  


 In [1]: print( server["192.168.1.10"] )
In [2]: 300


Performing Operations

Let’s find out which servers sent less than 500 packages out on the network, this can be done with:


In [1]: print( servers[servers < 500] )
192.168.1.10 300
192.168.1.22 321


As you can see only servers “192.168.1.10” and “192.168.1.22” are shown since the values they hold are below 500.

Boolean Tests

To execute a Boolean test on a Series


In [1]:  "192.168.1.10" in servers
In [2]: True


The IP address "192.168.1.10" is in the variable server thus it returns True.

Data Frames
Syntax

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)¶

Creating a Data Frame

 We are going to be using a practice CSV file. By passing our csv file to pandas’ .read_csv() method, the data in our file will be loaded into a data frame.


In [1]:  import pandas

The syntax for creating a data frame from a CSV file is as follows: pandas.read_csv(PATH) where PATH is the path to the file which you would like to load into memory.

In Line 2, we are creating a data frame by loading the data from practicedatabasefile.csv into memory.


In [2]:  dataframe = pandas.read_csv("/home/backslash/Desktop/Pandas Guide/practicedatabase.csv")

We can use .head() to show the first five rows of our data set.


In [3]:  dataframe.head()
ID First Last Dept
0 1 Bob Smith IT
1 2 Jane Henderson HR
2 3 John sanders HR
3 4 Mega Rao HR
4 5 Rob Jones HR


We can also specify the amount of rows shown by passing .head() an integer as shown below


In [4]:  dataframe.head(3)
ID First Last Dept
0 1 Bob Smith IT
1 2 Jane Henderson HR
2 3 John sanders HR


We can view our data by printing dataframe.


In [5]:  print(dataframe)
ID First Last Dept
0 1 Bob Smith IT
1 2 Jane Henderson HR
2 3 John sanders HR
3 4 Mega Rao HR
4 5 Rob Jones HR
5 6 Danny Medina IT
6 7 Mark Borrelly IT
7 8 Clark Jaques HR


Columns

There are four columns in this data frame: ID, First, Last, and Dept. Using the .columns() method we can view the columns that are available for us to work with.


In [5]:  print(dataframe.columns)
Index(['ID', 'First', 'Last', 'Dept', 'Status'], dtype='object')


We can call a column by using the name of the data frame followed by the columns name


In [5]: print(dataframe.Last)
0 Smith
1 Henderson
2 sanders
3 Rao
4 Jones
5 Medina
6 Borrelly
7 Jaques
Name: Last, dtype: object


However, that only works for columns that are only a word long. For columns named with more than one word, use the following format:


In [5]:   print(dataframe["ID"])
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8


In here we are printing everything under the ID column

Modifying a Data frame

Sometimes, you might need to create a dataframe based on a selection of the data that its relevant to your research. Let’s create a new dataframe and capture only the data included in the columns: ID, Last, Dept.


In [5]: customdataframe = pandas.DataFrame(dataframe, columns=["ID", "Last","Dept"])

If we take a look at our customdataframe, we can see only our selected columns are included


In [5]: print(customdataframe)
ID Last Dept
0 1 Smith IT
1 2 Henderson HR
2 3 sanders HR
3 4 Rao HR
4 5 Jones HR
5 6 Medina IT
6 7 Borrelly IT
7 8 Jaques HR

Indexing

Accessing our index we can see all the data for our index


In [5]: print(dataframe.ix[2])
ID 3
First John
Last sanders
Dept HR
Name: 2, dtype: object

You can add custom columns to your dataframe.


In [5]: dataframe["Status"] = "Active"

In [5]: print(dataframe.head())
ID First Last Dept Status
0 1 Bob Smith IT Active
1 2 Jane Henderson HR Active
2 3 John sanders HR Active
3 4 Mega Rao HR Active
4 5 Rob Jones HR Active


The column status has been created with its value set to active.

We have access to change any data in our data frame. If we wanted to change the Status of Mega from Active to terminated, this can be done by using the name of the data frame followed by the column, and the index of the data which you would like to change.


In [6]: dataframe["Status"][3]= "Terminated"
Let’s see if it worked.

In [7]: print(dataframe.head())
ID First Last Dept Status
0 1 Bob Smith IT Active
1 2 Jane Henderson HR Active
2 3 John sanders HR Active
3 4 Mega Rao HR Terminated
4 5 Rob Jones HR Active


Appendix

Anaconda is a high-performance distribution of python which includes over 100 packages used for data science and data analysis. Of course, one of the packages included is Pandas. You can download Anaconda by going to their website https://www.continuum.io/downloads. Once there you have the choice of downloading Python 3.5 & 2.7 in either 64bit or 32bit version.

In this guide, we will be downloading Anaconda’s Python version 3.5 64bit. After the download is complete, let’s continue by listing the files in the directory in which we downloaded the file.


[user@mainrig ~]$   ls Downloads
Anaconda3-4.2.0-Linux-x86_64.sh


As we can see, we have the file Anaconda3-4.2.0-Linux-x86_64.sh in our downloads directory. We can proceed to install the file by issuing the command:

TIP: Make sure you are in the same directory as your file when you run the command.


[user@mainrig ~]$ bash  Anaconda3-4.2.0-Linux-x86_64.sh
Welcome to Anaconda3 4.2.0 (by Continuum Analytics, Inc.)

In order to continue the installation process, please review the licenseagreement.
Please, press ENTER to continue
>>>


Press enter and scroll down through Anaconda’s license agreement. You will eventually get to the prompt asking you:


 Do you approve the license terms? [yes|no]
>>>


After accepting the license terms, Anaconda will show the Path in which it is going to be installed. You can press enter to accept the default path or enter your own custom one at this point. The installation will then proceed. After all the operations have been completed, you will be prompted to add Anaconda to your PATH (Highly recommended.)

TIP: You might have to close your current terminal and reopen it for the changes to your .bashrc file to take effect.

Now that we have successfully installed Anaconda in our system. Issue the command


[user@mainrig ~]$  conda

f anaconda was installed correctly on your system you should the following among other available options.

usage: conda [-h] [-V] command ...

conda is a tool for managing and deploying applications, environments and packages……

The first thing we are going to do is install iPython-Notebook as it would allow us to showcase our work and code in a more interactive way.


[user@mainrig ~]$  conda install ipython-notebook

Type y and press ENTER to install the file.

On your terminal issue the command:


[user@mainrig ~]$  ipython

The output will be a live prompt. You now have all the tools necessary to get started on learning how to use pandas and related data science packages.

  • post-author-pic
    Anthony J
    11-24-2016

    Nice!


  • post-author-pic
    Gregory E
    11-24-2016

    Thank you !!!

  • post-author-pic
    Prasanth Kumar K
    12-25-2016

    awesome... thanks

  • post-author-pic
    Yunyoung H
    03-02-2017

    could you update regarding data analysis steps too?

  • post-author-pic
    John L
    07-09-2017

    Nice supplement to the course. Thanks.

Looking For Team Training?

Learn More