Pandas Cheat Sheet


Author: Huzaifa Tahir
Date: January 31, 2023

Pandas¶

What is Pandas?¶

Pandas is a Python library that provides fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way towards this goal.

Why Pandas?¶

Pandas is a great tool for data analysis and manipulation. It is a very powerful tool that can be used for a wide variety of tasks. It is also very easy to use and learn. It is a great tool for data analysis and manipulation. It is a very powerful tool that can be used for a wide variety of tasks. It is also very easy to use and learn.

How to install Pandas?¶

Pandas can be installed using pip or conda. If you are using pip, you can install it using the following command:

In [ ]:
! pip install pandas 
Requirement already satisfied: pandas in c:\python39\lib\site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\huzai\appdata\roaming\python\python39\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\python39\lib\site-packages (from pandas) (2022.7.1)
Requirement already satisfied: numpy>=1.20.3 in c:\python39\lib\site-packages (from pandas) (1.24.1)
Requirement already satisfied: six>=1.5 in c:\users\huzai\appdata\roaming\python\python39\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)

If you are using conda, you can install it using the following command:

In [ ]:
! conda install pandas
'conda' is not recognized as an internal or external command,
operable program or batch file.

How to import Pandas?¶

Pandas can be imported using the following command:

In [ ]:
import pandas as pd

What is DataFrame?¶

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

What is Series?¶

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. Pandas Series is nothing but a column in an excel sheet.

How to create a DataFrame?¶

There are multiple ways to create a DataFrame. The simplest way is to create a DataFrame from a list of lists.

In [ ]:
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a', 'b'])

df
Out[ ]:
a b
0 1 2
1 3 4

Use Seaborn to load the dataset¶

In [ ]:
import pandas as pd
import seaborn as sns

Load the dataset¶

In [ ]:
df = sns.load_dataset('iris')

df
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Important function and methods¶

1- head()¶

  • head() to display first 5 rows in dataframe
  • we can also spacify value of (n) in head()
In [ ]:
df.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
df.head(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
  • It Shows first 10 rows data

2- tail()¶

  • it show last n number rows of DataFrame
  • we can also spacify the number of rows
In [ ]:
df.tail()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
In [ ]:
df.tail(10)
Out[ ]:
sepal_length sepal_width petal_length petal_width species
140 6.7 3.1 5.6 2.4 virginica
141 6.9 3.1 5.1 2.3 virginica
142 5.8 2.7 5.1 1.9 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

3- info()¶

  • It display information about Dataframe, including data types and memory usage
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

4- describe()¶

  • generates descriptive statistics of a DataFrame.

  • count - the number of non-NA/null values in the column.

  • mean - the mean of the column.

  • std - the standard deviation of the column.

  • min - the minimum value of the column.

  • 25% - the first quartile (25th percentile) of the column.

  • 50% - the second quartile (50th percentile or median) of the column.

  • 75% - the third quartile (75th percentile) of the column.

  • max - the maximum value of the column.

In [ ]:
df.describe()
Out[ ]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

5- shape¶

  • Shape attribute shows number of rows and column in DataFrame
In [ ]:
df.shape
Out[ ]:
(150, 5)
  • (150,5) show rows is 150 and column is 5

  • Represent Rows ( -------------------------> )

  • Represent Column

          |
          |
          |
          |
          |
          |
    

Reduce the DataFrame size¶

In [ ]:
df.sample(frac=0.1).shape
Out[ ]:
(15, 5)

6- column¶

  • Access Column label Name
In [ ]:
df.columns
Out[ ]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

7- dtype¶

  • this attribute show data type of column in dataframe
In [ ]:
df.dtypes
Out[ ]:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object
  • only select those have numeric types and floats
In [ ]:
df.select_dtypes(include=['number']).head()
Out[ ]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
  • only select those have object types
In [ ]:
df.select_dtypes(include=['object']).head()
Out[ ]:
species
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
  • only select those have multiple types
In [ ]:
df.select_dtypes(include=['object','number']).head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
  • exclude
In [ ]:
df.select_dtypes(exclude=['object']).head()
Out[ ]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

8- index¶

  • shows the rows label of a Datafraem
In [ ]:
df.index
Out[ ]:
RangeIndex(start=0, stop=150, step=1)
  • In This RangeIndex(start=0, stop=150, step=1), the row labels are integers ranging from 0 to 149, inclusive, and with a step of 1.

  • This means that the DataFrame has 150 rows, and the row labels are 0, 1, 2, ..., 148, 149.

9- T (Transpose)¶

  • transposing a DataFrame involves flipping its rows and columns, such that the rows become columns and vice versa.
In [ ]:
df.head().T
Out[ ]:
0 1 2 3 4
sepal_length 5.1 4.9 4.7 4.6 5.0
sepal_width 3.5 3.0 3.2 3.1 3.6
petal_length 1.4 1.4 1.3 1.5 1.4
petal_width 0.2 0.2 0.2 0.2 0.2
species setosa setosa setosa setosa setosa

10- rename()¶

  • Rename the column of Index of a DataFrame
In [ ]:
df.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
df.rename(columns={'sepal_length':'sepal_length_A'})
Out[ ]:
sepal_length_A sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [ ]:
df.columns=df.columns.str.replace('_',' ')
df.head()
Out[ ]:
sepal length sepal width petal length petal width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
# Adding Prefix
df=df.add_prefix('abu_')
df.head()
Out[ ]:
abu_sepal length abu_sepal width abu_petal length abu_petal width abu_species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
# Adding suffix
df=df.add_suffix('_420')
df.head()
Out[ ]:
abu_sepal length_420 abu_sepal width_420 abu_petal length_420 abu_petal width_420 abu_species_420
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
df= sns.load_dataset('iris')

11- astype()¶

Convert String to number¶
  • Change the data type of a column in Dataframe
In [ ]:
df.dtypes['sepal_length']
Out[ ]:
dtype('float64')
In [ ]:
df['sepal_length'].astype(int)
df['sepal_length']
Out[ ]:
0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64
In [ ]:
df1=pd.DataFrame({'col_A':['1','2','3','4','5','6','7'],'col_B':['11','12','13','14','15','16','17']})
df1.dtypes
Out[ ]:
col_A    object
col_B    object
dtype: object
  • Best Way to write
In [ ]:
df1.astype({'col_A':'int64', 'col_B':'int64'}) 
Out[ ]:
col_A col_B
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17

12- drop()¶

  • remove a row or column from dataframe
In [ ]:
df.drop(columns='sepal_length')
Out[ ]:
sepal_width petal_length petal_width species
0 3.5 1.4 0.2 setosa
1 3.0 1.4 0.2 setosa
2 3.2 1.3 0.2 setosa
3 3.1 1.5 0.2 setosa
4 3.6 1.4 0.2 setosa
... ... ... ... ...
145 3.0 5.2 2.3 virginica
146 2.5 5.0 1.9 virginica
147 3.0 5.2 2.0 virginica
148 3.4 5.4 2.3 virginica
149 3.0 5.1 1.8 virginica

150 rows × 4 columns

13- sort_values()¶

  • Sort The Dataframe by one or more column
In [ ]:
df.sort_values(by='sepal_width', ascending=False).head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
15 5 4.4 1.5 0.4 setosa
33 5 4.2 1.4 0.2 setosa
32 5 4.1 1.5 0.1 setosa
14 5 4.0 1.2 0.2 setosa
16 5 3.9 1.3 0.4 setosa

14- groupby()¶

  • Group the datain a dataframe one or more columns
In [ ]:
df.groupby(by=['sepal_length','species']).count()
Out[ ]:
sepal_width petal_length petal_width
sepal_length species
4 setosa 20 20 20
versicolor 1 1 1
virginica 1 1 1
5 setosa 30 30 30
versicolor 25 25 25
virginica 6 6 6
6 versicolor 23 23 23
virginica 31 31 31
7 versicolor 1 1 1
virginica 12 12 12
In [ ]:
df.groupby('who').count()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class adult_male deck embark_town alive alone
who
child 83 83 83 83 83 83 83 83 83 83 13 83 83 83
man 537 537 537 413 537 537 537 537 537 537 99 537 537 537
woman 271 271 271 218 271 271 271 269 271 271 91 269 271 271
In [ ]:
df.groupby('who').size()
Out[ ]:
who
child     83
man      537
woman    271
dtype: int64
In [ ]:
df.groupby('who').size().reset_index(name='counts')
Out[ ]:
who counts
0 child 83
1 man 537
2 woman 271
In [ ]:
df.groupby('who').sum()
C:\Users\huzai\AppData\Local\Temp\ipykernel_12200\2979829937.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby('who').sum()
Out[ ]:
survived pclass age sibsp parch fare adult_male alone
who
child 49 218 528.67 144 105 2721.2210 0 6
man 88 1274 13700.50 159 82 13352.0656 537 410
woman 205 565 6976.00 163 153 12620.6627 0 121
In [ ]:
df.groupby(['sex','pclass','who']).count()
Out[ ]:
survived age sibsp parch fare embarked class adult_male deck embark_town alive alone
sex pclass who
female 1 child 3 3 3 3 3 3 3 3 3 3 3 3
woman 91 82 91 91 91 89 91 91 78 89 91 91
2 child 10 10 10 10 10 10 10 10 1 10 10 10
woman 66 64 66 66 66 66 66 66 9 66 66 66
3 child 30 30 30 30 30 30 30 30 2 30 30 30
woman 114 72 114 114 114 114 114 114 4 114 114 114
male 1 child 3 3 3 3 3 3 3 3 3 3 3 3
man 119 98 119 119 119 119 119 119 91 119 119 119
2 child 9 9 9 9 9 9 9 9 3 9 9 9
man 99 90 99 99 99 99 99 99 3 99 99 99
3 child 28 28 28 28 28 28 28 28 1 28 28 28
man 319 225 319 319 319 319 319 319 5 319 319 319
In [ ]:
df.describe().loc['min':'max',['survived','age','fare']]
Out[ ]:
survived age fare
min 0.0 0.420 0.0000
25% 0.0 20.125 7.9104
50% 0.0 28.000 14.4542
75% 1.0 38.000 31.0000
max 1.0 80.000 512.3292

15- merge()¶

  • Combine two or more Dataframe into one
In [ ]:
df1 = pd.DataFrame({'key': [1, 2, 3], 'A': [4, 5, 6]})
df2 = pd.DataFrame({'key': [2, 3, 4], 'B': [7, 8, 9]})
pd.merge(df1, df2)
Out[ ]:
key A B
0 2 5 7
1 3 6 8

16- concat()¶

  • concatenates two or more DataFrames along a specific axis
In [ ]:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3'],
                    'D': ['D0', 'D1', 'D2', 'D3']})

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                    'B': ['B4', 'B5', 'B6', 'B7'],
                    'C': ['C4', 'C5', 'C6', 'C7'],
                    'D': ['D4', 'D5', 'D6', 'D7']})

pd.concat([df1, df2])
Out[ ]:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
0 A4 B4 C4 D4
1 A5 B5 C5 D5
2 A6 B6 C6 D6
3 A7 B7 C7 D7

17- Save dataSet to excel file¶

In [ ]:
df.to_excel('phool.xlsx')

18- Reverse Row Order¶

In [ ]:
df.loc[::-1].head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
149 5.9 3.0 5.1 1.8 virginica
148 6.2 3.4 5.4 2.3 virginica
147 6.5 3.0 5.2 2.0 virginica
146 6.3 2.5 5.0 1.9 virginica
145 6.7 3.0 5.2 2.3 virginica

19- Reverse column order¶

In [ ]:
df.loc[:, ::-1].head()
Out[ ]:
species petal_width petal_length sepal_width sepal_length
0 setosa 0.2 1.4 3.5 5.1
1 setosa 0.2 1.4 3.0 4.9
2 setosa 0.2 1.3 3.2 4.7
3 setosa 0.2 1.5 3.1 4.6
4 setosa 0.2 1.4 3.6 5.0

20- Copy data from clipboard¶

In [ ]:
df=pd.read_clipboard()
df
Out[ ]:
# 20- Copy data from clipboard

21- Split dataframe into two subsets¶

In [ ]:
import pandas as pd
import seaborn as sns

df2= sns.load_dataset('titanic')
In [ ]:
from  random import random
kashti_1=df.sample(frac=0.50,random_state=1)
kashti_1.shape
Out[ ]:
(446, 15)
In [ ]:
kashti_2=df.drop(kashti_1.index)
kashti_2.shape
Out[ ]:
(445, 15)
In [ ]:
kashti_1.head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
862 1 1 female 48.0 0 0 25.9292 S First woman False D Southampton yes True
223 0 3 male NaN 0 0 7.8958 S Third man True NaN Southampton no True
84 1 2 female 17.0 0 0 10.5000 S Second woman False NaN Southampton yes True
680 0 3 female NaN 0 0 8.1375 Q Third woman False NaN Queenstown no True
535 1 2 female 7.0 0 2 26.2500 S Second child False NaN Southampton yes False
In [ ]:
kashti_2.head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
7 0 3 male 2.0 3 1 21.0750 S Third child False NaN Southampton no False
10 1 3 female 4.0 1 1 16.7000 S Third child False G Southampton yes False
15 1 2 female 55.0 0 0 16.0000 S Second woman False NaN Southampton yes True
18 0 3 female 31.0 1 0 18.0000 S Third woman False NaN Southampton no False

22- Join Two Datasets¶

In [ ]:
df4=kashti_1.append(kashti_2)
df4.shape
C:\Users\huzai\AppData\Local\Temp\ipykernel_12200\1148480633.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df4=kashti_1.append(kashti_2)
Out[ ]:
(891, 15)

23- Filltering a Dataset¶

In [ ]:
df.head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
  • The unique() method in Pandas is used to return the unique values of a particular column in a pandas DataFrame or Series.
In [ ]:
df.sex.unique()
Out[ ]:
array(['male', 'female'], dtype=object)
In [ ]:
df[(df.sex=='female')].head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
8 1 3 female 27.0 0 2 11.1333 S Third woman False NaN Southampton yes False
9 1 2 female 14.0 1 0 30.0708 C Second child False NaN Cherbourg yes False

Boolean indexing¶

In [ ]:
df[((df.embark_town=='Southampton') |
    (df.embark_town=='Queenstown')) & 
    (df.sex=='female')
    ].count()
Out[ ]:
survived       239
pclass         239
sex            239
age            198
sibsp          239
parch          239
fare           239
embarked       239
class          239
who            239
adult_male     239
deck            58
embark_town    239
alive          239
alone          239
dtype: int64
  • The isin() method in Pandas is used to filter a DataFrame or Series based on values that match a given list. The method returns a Boolean mask indicating which values match the given list
In [ ]:
df[df.embark_town.isin(['Queenstown'])].head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
5 0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True
16 0 3 male 2.0 4 1 29.1250 Q Third child False NaN Queenstown no False
22 1 3 female 15.0 0 0 8.0292 Q Third child False NaN Queenstown yes True
28 1 3 female NaN 0 0 7.8792 Q Third woman False NaN Queenstown yes True
32 1 3 female NaN 0 0 7.7500 Q Third woman False NaN Queenstown yes True
In [ ]:
df[(df.age > 30)].shape
Out[ ]:
(305, 15)
In [ ]:
df[(df.who == 'child')].shape
Out[ ]:
(83, 15)
In [ ]:
df[(df.alive == 'no') & (df.who == 'child')].shape
Out[ ]:
(34, 15)
In [ ]:
df[(df.alive == 'yes') & (df.who == 'child')].head(10)
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
9 1 2 female 14.00 1 0 30.0708 C Second child False NaN Cherbourg yes False
10 1 3 female 4.00 1 1 16.7000 S Third child False G Southampton yes False
22 1 3 female 15.00 0 0 8.0292 Q Third child False NaN Queenstown yes True
39 1 3 female 14.00 1 0 11.2417 C Third child False NaN Cherbourg yes False
43 1 2 female 3.00 1 2 41.5792 C Second child False NaN Cherbourg yes False
58 1 2 female 5.00 1 2 27.7500 S Second child False NaN Southampton yes False
78 1 2 male 0.83 0 2 29.0000 S Second child False NaN Southampton yes False
125 1 3 male 12.00 1 0 11.2417 C Third child False NaN Cherbourg yes False
165 1 3 male 9.00 0 2 20.5250 S Third child False NaN Southampton yes False
172 1 3 female 1.00 1 1 11.1333 S Third child False NaN Southampton yes False

24- Filtering by large categories¶

value_counts()¶

  • return a series containing counts of unique values
In [ ]:
df.embark_town.value_counts()
Out[ ]:
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64
  • n largest values in dataFrame
In [ ]:
df.age.value_counts().nlargest(5)
Out[ ]:
24.0    30
22.0    27
18.0    26
19.0    25
28.0    25
Name: age, dtype: int64
In [ ]:
counts=df.who.value_counts()
counts.nlargest(5).index
Out[ ]:
Index(['man', 'woman', 'child'], dtype='object')
In [ ]:
counts=df.who.value_counts()
counts.nlargest(5)
Out[ ]:
man      537
woman    271
child     83
Name: who, dtype: int64

25- Splitting a string into multiple columns¶

In [ ]:
df5=pd.DataFrame({'name':['Hunzala Tahir','Huraira Tariq','Huzaifa Tahir','Ibraheem Tahir'],
                'location':['Fasilabad , Pakistan','Islamabad , Pakistan','Lahore , Pakistan','Gojra , Pakistan'],
                    })
df5
Out[ ]:
name location
0 Hunzala Tahir Fasilabad , Pakistan
1 Huraira Tariq Islamabad , Pakistan
2 Huzaifa Tahir Lahore , Pakistan
3 Ibraheem Tahir Gojra , Pakistan
In [ ]:
df5[['f_name','l_name']]=df5.name.str.split(' ',expand=True)
df5
Out[ ]:
name location f_name l_name
0 Hunzala Tahir Fasilabad , Pakistan Hunzala Tahir
1 Huraira Tariq Islamabad , Pakistan Huraira Tariq
2 Huzaifa Tahir Lahore , Pakistan Huzaifa Tahir
3 Ibraheem Tahir Gojra , Pakistan Ibraheem Tahir
In [ ]:
df5[['city','country']]=df5.location.str.split(',',expand=True)
df5
Out[ ]:
name location f_name l_name city country
0 Hunzala Tahir Fasilabad , Pakistan Hunzala Tahir Fasilabad Pakistan
1 Huraira Tariq Islamabad , Pakistan Huraira Tariq Islamabad Pakistan
2 Huzaifa Tahir Lahore , Pakistan Huzaifa Tahir Lahore Pakistan
3 Ibraheem Tahir Gojra , Pakistan Ibraheem Tahir Gojra Pakistan
In [ ]:
df5=df5[['f_name','l_name','city','country']]
df5
Out[ ]:
f_name l_name city country
0 Hunzala Tahir Fasilabad Pakistan
1 Huraira Tariq Islamabad Pakistan
2 Huzaifa Tahir Lahore Pakistan
3 Ibraheem Tahir Gojra Pakistan

26- convert one set of values into another one¶

In [ ]:
df['sex_num']=df.sex.map({'male':0,'female':1})
df.head()
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone sex_num
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 0
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 1
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 1
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 1
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True 0
In [ ]:
df['embarked_num']=df.embarked.factorize()[0]
df.head(15)
Out[ ]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone sex_num embarked_num
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False 0 0
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False 1 1
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True 1 0
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False 1 0
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True 0 0
5 0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True 0 2
6 0 1 male 54.0 0 0 51.8625 S First man True E Southampton no True 0 0
7 0 3 male 2.0 3 1 21.0750 S Third child False NaN Southampton no False 0 0
8 1 3 female 27.0 0 2 11.1333 S Third woman False NaN Southampton yes False 1 0
9 1 2 female 14.0 1 0 30.0708 C Second child False NaN Cherbourg yes False 1 1
10 1 3 female 4.0 1 1 16.7000 S Third child False G Southampton yes False 1 0
11 1 1 female 58.0 0 0 26.5500 S First woman False C Southampton yes True 1 0
12 0 3 male 20.0 0 0 8.0500 S Third man True NaN Southampton no True 0 0
13 0 3 male 39.0 1 5 31.2750 S Third man True NaN Southampton no False 0 0
14 0 3 female 14.0 0 0 7.8542 S Third child False NaN Southampton no True 1 0

27- Reshaping a dataframe¶

In [ ]:
fasla=pd.DataFrame([['12345',100,200,300],['34567',400,500,600],['56789',700,800,900]],columns=['zip','factroy','warehouse','retail'])
fasla.head()
Out[ ]:
zip factroy warehouse retail
0 12345 100 200 300
1 34567 400 500 600
2 56789 700 800 900
  • melt() it takes columns of a DataFrame and "melts" them into a single column.
In [ ]:
fasla_long=fasla.melt(id_vars='zip',var_name='location',value_name='distance')
fasla_long.head()
Out[ ]:
zip location distance
0 12345 factroy 100
1 34567 factroy 400
2 56789 factroy 700
3 12345 warehouse 200
4 34567 warehouse 500

28- loc[]¶

  • The loc[] attribute in Pandas is used to index and select data in a pandas DataFrame based on label(s). It can be used to select one or more rows, columns, or both, based on the labels.
In [ ]:
df4 = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': [10, 20, 30, 40, 50]})

df4.loc[0:2]
Out[ ]:
A B
0 1 10
1 2 20
2 3 30
In [ ]:
df4.loc[:, 'A']
Out[ ]:
0    1
1    2
2    3
3    4
4    5
Name: A, dtype: int64
In [ ]:
df4.loc[0:2, 'A']
Out[ ]:
0    1
1    2
2    3
Name: A, dtype: int64

29- iloc[]¶

  • The iloc[] attribute in Pandas is used to index and select data in a Pandas DataFrame based on integer position(s). It is an integer-based indexing and slicing attribute, and works similarly to loc[] but with integer-based indexing instead of label-based indexing.
In [ ]:
df4.iloc[2]
Out[ ]:
A     3
B    30
Name: 2, dtype: int64
In [ ]:
df4.iloc[2:5, 0:2]
Out[ ]:
A B
2 3 30
3 4 40
4 5 50

30- query()¶

  • The query() method in Pandas is used to filter a DataFrame based on conditions specified as a string. The conditions in the string passed to query() are evaluated using the DataFrame as the context, so you can use column names directly in the query string
In [ ]:
df11 = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': [10, 20, 30, 40, 50]})
In [ ]:
df11.query("A > 2 and B < 40")
Out[ ]:
A B
2 3 30
  • In this example, the query() method is used to filter the DataFrame df based on the conditions "A > 2 and B < 40". The result is a new DataFrame containing only the rows that meet both conditions. Note that the conditions in the query string are evaluated using the DataFrame, so you can refer to columns directly by name.

-------------
Jazak-Allah
-------------