Learn Python Series (#30) - Data Science Part 1 - Pandas
Learn Python Series (#30) - Data Science Part 1 - Pandas
Repository
What will I learn?
- You will learn what kind of toolset the
pandas
Python package is providing you with, how to install it (if you haven't installed it already in your current Python distribution), and import it into your projects; - how to convert data (either passed-in directly or read from another source) to a
pandas
DataFrame; - how to save data from a
pandas
DataFrame to an external file, such as CSV; - how to do some basic
pandas
data wrangling operations.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.7) distribution, such as (for example) the Anaconda Distribution;
- The ambition to learn Python programming.
Difficulty
- Beginner
Curriculum (of the Learn Python Series
):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
- Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
- Learn Python Series (#17) - Roundup #2 - Combining and analyzing any-to-any multi-currency historical data
- Learn Python Series (#18) - PyMongo Part 1
- Learn Python Series (#19) - PyMongo Part 2
- Learn Python Series (#20) - PyMongo Part 3
- Learn Python Series (#21) - Handling Dates and Time Part 1
- Learn Python Series (#22) - Handling Dates and Time Part 2
- Learn Python Series (#23) - Handling Regular Expressions Part 1
- Learn Python Series (#24) - Handling Regular Expressions Part 2
- Learn Python Series (#25) - Handling Regular Expressions Part 3
- Learn Python Series (#26) - pipenv & Visual Studio Code
- Learn Python Series (#27) - Handling Strings Part 3 (F-Strings)
- Learn Python Series (#28) - Using Pickle and Shelve
- Learn Python Series (#29) - Handling CSV
Additional sample code files
The full - and working! - iPython tutorial sample code file is included for you to download and run for yourself right here:
https://github.com/realScipio/learn-python-series/blob/master/lps-030/learn-python-series-030-data-science-pt1-pandas.ipynb
GitHub Account
Learn Python Series (#30) - Data Science Part 1 - Pandas
Welcome to already episode #30 of the Learn Python Series
! It's been a while since I've published my last (#29) tutorial episode on Python, after which I was busy with a number of projects including co-developing and running UA and @steem-ua together with @holger80.
Not everybody realises that (although I can code) I'm not originally academically educated in Computer Sciences, ergo that I'm writing the Learn Python Series
partially as a documentation project on my own Python research, study and development aspirations. By carefully writing these tutorials in a very structured format, almost (or even exactly) "book-like", I'm "cementing" my own Python knowledge and skills. The past months I've gained an interest in learning more about Data Science using Python, and I recently came to the conclusion my own "research notes" were beginning to pile and felt the need to better document my progress. How to do that better by resuming the Learn Python Series
? So there you go...! ;-)
About Data Science
Data Science is about gaining insights from (huge) amounts of (structured) data by analysing that data, and also to analytically and algorithmically solve complex problems, which insights and algorithmic solutions also have the potential to generate much value. When you dig into (large / big) data sets, you might be able to discover new insights that were previously hidden. The process of first exploring data, investigating that data to discover data characteristics and patterns, enriching that data with other data, often times requires a combination of both analytical skills and mathematical / business / tech creativity and skill. I suppose data science is positioned in the intersecting areas of those fields, which alligns with my own interests as well; which is why I find Data Science fascinating to learn more about, personally.
About the Python package pandas
pandas
is a well-known and actively developed Python package which can be summarised as a "data analysis, wrangling and management toolkit"; I suppose you could call it "Excel for Python" in a way. pandas
provides powerful and flexible methods and data formats to aid data science tasks, using Python and it's built on top of numpy
("Numerical Python", which we've already yet briefly talked about in episode #11 of the Learn Python Series
).
pandas
is positioned (as opposed to NumPy itself) as a more "high level" data analysis / wrangling toolkit, and - like Excel or OpenOffice "Calc" - it works really well with "tabular data". Unlike Excel / Calc, pandas
is able to handle really large data sets, with file sizes ranging from hundreds of MegaBytes to even Gigabytes (or more!); try working with (or even opening!) those on a regular Excel / Calc application running on a regular personal computer!
pandas
can therefore be used to -1- clean / munge / wrangle data sets, -2- analyse and (re-) model the data set, and -3- organise the data analysis (to plot, display in tabular form, and/or further process).
In short pandas
is really powerful and cool, so let's dive right in!
Installing and importing pandas
If you're working with the Anaconda Python distribution, the pandas
package is already installed by default, so you only need to import
it in your project. If you haven't already installed pandas
, that's as simple as:
pip install pandas
Then, create a new Python file, give it a relevant name (for example pandas_tut_1.py
) and then simply begin with:
import pandas as pd
pandas
Data Frame Basics
A DataFrame
is a pandas
data structure to represent tabular data (like a CSV file or an Excel spreadsheet with named columns and rows). Shortly hereafter, we'll be covering how to read-in an existing CSV file and convert it to a DataFrame object, but let's begin with creating a simple example DataFrame from scratch.
the .DataFrame()
constructor
If we begin with a regular Python data object such as a dictionary, or a list of lists or tuples, pandas
provides the .DataFrame()
constructor to convert such data objects into a pandas
DataFrame, for example like so:
import pandas as pd
weather_dict = {
'day': ['1/1/2019', '1/2/2019', '1/3/2019', '1/4/2019', '1/5/2019'],
'temp_celsius': [3, 2, -1, 0, 4]
}
df1 = pd.DataFrame(data=weather_dict)
df1
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
Explanation: after importing pandas
as pd
, and declaring a dictionary object with two keys (day
and temp_celsius
), each containing one list with 5 values, we then converted the weather_dict
dictionary object into a DataFrame object (called df1
).
Nota bene: as always, I'm writing this tutorial itself using Jupyter Notebook, which contains both a Python interpreter, the markdown content, and a number of built-in Jupyter Notebook-specific methods and mechanisms. Running the above code inside a Jupyter Notebook prints/outputs the df1
DataFrame simply by calling the variable df1
. In case you want to print the df1
DataFrame contents from the command line after having coded the above in an external code editor (e.g. Microsoft Visual Studio Code), then you need to do:
print(df1)
day temp_celsius
0 1/1/2019 3
1 1/2/2019 2
2 1/3/2019 -1
3 1/4/2019 0
4 1/5/2019 4
(From here on I'm assuming you're following along on a Jupyter Notebook as well, hence I won't be explicitly printing the DataFrame objects every time in the remainder of this and following tutorial(s).)
Nota bene: in this particular (dictionary) example, I've been using a "top down" approach, in which data is converted into a DataFrame object by dictionary keys. However, a more "logical" approach would be to insert that data "row-by-row", as the temperature value of "3 degrees Celsius" belongs to the associated data value "1/1/2019".
Another way to construct the same DataFrame, is via a "list of lists", which are then given column names as an additional constructor argument, like so:
import pandas as pd
weather_list = [
['1/1/2019', 3],
['1/2/2019', 2],
['1/3/2019', -1],
['1/4/2019', 0],
['1/5/2019', 4]
]
df2 = pd.DataFrame(data=weather_list, columns=['day', 'temp_celsius'])
df2
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
the read_csv()
method
As we've just learned, the DataFrame()
constructor needs to be passed a data=
argument, which is the Python object holding the (example) data. But of course when dealing with large data sets you're not going to declare all those values manually. Instead, you might have saved them already on disk and you like to read the data from disk to then convert to a DataFrame object.
For exactly that purpose, pandas
has the built-in method read_csv()
(as well as a number of similar methods for other file types). Suppose in your current working directory exists the CSV file weather.csv
, then you can construct the exact same DataFrame object like so:
import pandas as pd
df3 = pd.read_csv('weather.csv')
df3
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
the to_csv()
method
pandas
also allows to go the opposite route: to export DataFrame objects and save them to disk as .csv
files. The to_csv()
is used for that.
Nota bene: in order to save the weather.csv
example file (that we just read via read_csv()
) from the df2 DataFrame object we constructed before, it's convenient to not save the 0,1,2,3,4
index values to the CSV file (those index values are format-specific, and don't directly belong to the original data set). By default (at least in pandas
version 0.24.0; the current version) those index values would be exported to CSV and so are the column names / headers (the first row of the CSV file). While we do want those column name values included in the CSV file, but not the pandas
default index values, we set the index=
parameter to None
(and leave the header=
parameter as it is by default: True
). As the first argument we pass the file name (and an optional filepath in case you want to save it in another directory as your current working directory):
df2.to_csv('weather.csv', index=None)
After running the above to_csv()
code line, your file 'weather.csv'
should be saved as a valid CSV file, located in your current working directory.
the .head()
and .tail()
methods
When working with large data sets, it's often times convenient to quickly inspect the data you're working with, without wanting to "eye ball" big amounts of data. To only display the top 5 lines of your DataFrame (including column names and index numbers) you can use the .head()
method, and to only display the bottom 5 lines of your DataFrame you can use .tail()
.
Nota bene: Please note that our (very simple) example weather.csv
data set only contains 5 rows in total for simplicity / explanatino matters, ergo, in thisspecific example case you wouldn't notice a difference when running either ...
df3
, ordf3.head()
, ordf3.tail()
However, you can also pass an integer N
to both .head()
and .tail()
, to only show N likes either at the top or bottom of your DataFrame, for example:
df3.head(2)
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
df3.tail(2)
day | temp_celsius | |
---|---|---|
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
In these specific examples, by passing the integer value of 2
to both .head(2)
and .tail(2)
we only show the top and bottom 2 lines of the DataFrame, respectively.
Index slicing
If you're interested to only use a specific set of DataFrame rows, you can use index slices just like we've learned about already on regular Python lists.
For example, if we only want to work with rows 1 and 2:
df3[1:3]
day | temp_celsius | |
---|---|---|
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
Nota bene: the stop parameter is non-inclusive, ergo df3[1:3]
means "begin with row 1 and stop at row number 3", hence, it shows rows 1 and 2.
In case you want to work with the entire DataFrame beginning with row number 2, then use:
df3[2:]
day | temp_celsius | |
---|---|---|
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
And in case you want to work with the entire DataFrame until row number 3, then use:
df3[:3]
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
the columns
attribute / property
If you want to assign, return, or print all column names your DataFrame holds, call the columns
attribute / property, like so:
df3.columns
Index(['day', 'temp_celsius'], dtype='object')
the shape
attribute / property
Calling the shape
property returns a tuple of the DataFrames "size" or "shape" in the form of (num_rows, num_columns)
, like so:
df3.shape
(5, 3)
Nota bene: it's more efficient / faster to call the shape
property when determining the total amount of rows in a DataFrame than using len(df3)
would be, although that works as well:
num_rows_shape, num_cols_shape = df3.shape
print(f"Number of rows via .shape: {num_rows_shape}")
num_rows_len = len(df3)
print(f"Number of rows via len(): {num_rows_len}")
Number of rows via .shape: 5
Number of rows via len(): 5
Two syntaxes on selecting columns
pandas
allows for two syntaxes to selecting individual columns, being with the .column_name
dot-property notation, and by using ['column_name']
squared brackets notation. E.g.:
df3.temp_celsius
0 3
1 2
2 -1
3 0
4 4
Name: temp_celsius, dtype: int64
and also:
df3['temp_celsius']
0 3
1 2
2 -1
3 0
4 4
Name: temp_celsius, dtype: int64
Nota bene: I strongly recommend to only use the squared bracket notation, as your DataFrame column names could be identical to pandas
built-in attribute / property names. Suppose for example you'd have a column named columns
(for some reason or another), then calling df3.columns
returns the columns
property values, not the content of the df3['columns']
DataFrame column!
Also, in case your column names contain one or more spaces or other non-alphanumerical characters, the dot-property syntax doesn't work.
Selecting multiple DataFrame columns
In case you want to select multiple DataFrame columns, but not all of them, then pass a list of column names instead. Again, on our very simple example file weather.csv
we only have 2 columns containing data, so the example I'll give right now will return the same results (in this specific case) as calling the entire DataFrame object. In case there would be another column in the total DataFrame, then of course this technique only selects the column names passed as a list, like so:
df3[['day', 'temp_celsius']]
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
Selecting specific rows and specific columns
In case you want to select both (one or more) columns and a slice of DataFrame rows, then combine the just explained techniques beginning with the columns, then the row slices, for example:
df3['temp_celsius'][1:3]
1 2
2 -1
Name: temp_celsius, dtype: int64
Appending a new column using vectorised column operations
As pandas
is built on top of numpy
, which very efficiently uses vectorised data array operations, so does pandas
itself. Vectorisation means executing an operation on an entire column / array of data.
Let's say, just to easily explain such a column-wise vectorisation process, we'd want to add another column to the DataFrame - called "temp_plus_one"
, in which we want to store all the temperature values incremented by 1 celsius. To do that, first we simply name / assign that new / extra column, then reference (in this case) the 'temp_celsius'
column and add the value of 1
to it, like so:
df3['temp_plus_one'] = df3['temp_celsius'] + 1
df3
day | temp_celsius | temp_plus_one | |
---|---|---|---|
0 | 1/1/2019 | 3 | 4 |
1 | 1/2/2019 | 2 | 3 |
2 | 1/3/2019 | -1 | 0 |
3 | 1/4/2019 | 0 | 1 |
4 | 1/5/2019 | 4 | 5 |
The DataFrame column "temp_plus_one" is now added to the original df3
DataFrame.
Removing an existing column from a DataFrame, using .drop()
If we again like to remove the "temp_plus_one" column from the DataFrame, we can use the .drop()
method. By default the .drop()
method has an argument set as axis=0
to imply removing one or more rows from the data set. If we want to drop a column, we either pass as argument axis=1
or axis='columns'
, like so:
df3 = df3.drop('temp_plus_one', axis='columns')
df3
day | temp_celsius | |
---|---|---|
0 | 1/1/2019 | 3 |
1 | 1/2/2019 | 2 |
2 | 1/3/2019 | -1 |
3 | 1/4/2019 | 0 |
4 | 1/5/2019 | 4 |
What did we learn, hopefully?
Data Science is (to me it is at least!) an extremely interesting topic, and the pandas
Python package provides many powerful, relatively straightforward, and efficient tools for data analysis and manipulation. At pandas
' core is the DataFrame object / data format, that you can create from and to regular Python data types (e.g. dictionaries, lists, tuples) and multiple file types (such as CSV, JSON and Excel).
In this episode we covered some pandas
basics (of course on every newly covered topics we need to start at the beginning!), and in the following episodes we'll gradually expand on the possibilities pandas
has to offer and move on to intermediate and advanced skill techniques.
Thank you for your contribution @scipio.
After reviewing your tutorial we suggest the following points listed below:
Welcome back to the category of tutorials. Your tutorial is very well structured and explained. Good work!
The curriculum section becomes very large at the beginning of the tutorial. Maybe you should put it at the end of your tutorial.
The tutorial needs something graphic, some images in the middle of the contribution not to get too massive.
We suggest using the third person in your text.
Thank you for your work in developing this tutorial.
Looking forward to your upcoming tutorials.
Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.
To view those questions and the relevant answers related to your post, click here.
Need help? Chat with us on Discord.
[utopian-moderator]
Thanks for reviewing @portugalcoin!
Thank you for your review, @portugalcoin! Keep up the good work!
Thank you scipio! You've just received an upvote of 41% by artturtle!
Learn how I will upvote each and every one of your posts
Please come visit me to see my daily report detailing my current upvote power and how much I'm currently upvoting.
Thank you @artturtle!
Hi @scipio!
Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your post is eligible for our upvote, thanks to our collaboration with @utopian-io!
Feel free to join our @steem-ua Discord server
Hey, don't I know you from somewhere?? ;-)
Hey, @scipio!
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!
Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).
Want to chat? Join us on Discord https://discord.gg/h52nFrV.
Vote for Utopian Witness!
Thanks @utopian-io!
[Bleep! Bleep!]
Python is, I find, a remarkably good do-what-I-mean language. You pays your money and you takes your choice. Those included batteries are very, very welcome. With C++(https://codeasy.net/), everything you might need to be explicit is explicit - somewhere. Consequently, you can't just knock out code without worrying about types.