Connect to google drive
from google.colab import drive drive.mount('/content/drive')
Import the file
file_path = '/content/drive/MyDrive/Importent Document/healthcare-dataset-stroke-data.csv'
Import pandas as pd to import the pandas library into Python. The import pandas part of the code tells Python to bring the pandas data analysis library into your current environment. The pd part of the code then tells Python to give pandas the pd alias. This allows us to use pandas functions by typing pd.function_name instead of pandas.function_name
import pandas as pd
Import numpy as np to import the Numpy library
into Python. NumPy is a library for the Python programming language
that provides support
for large, multi-dimensional arrays and matrices, along with a large collection of highlevel mathematical functions to operate
on these arrays1.
It is the fundamental package
for scientific computing with Python and is widely
used in many fields, including data science, machine
learning, and engineering. Import numpy as np allows
us to use numpy functions
by typing np.function_name instead of numpy.function_name.
import numpy as np
df = pd.read_csv(file_path) is a common way to read a CSV
file into a Pandas DataFrame. Here, df is the variable name for the DataFrame object,
pd is the alias for the Pandas
library, read_csv is a function
in the Pandas library that reads a CSV file and returns
a DataFrame object,
and file_path is the path to the CSV file you want to load.
df = pd.read_csv(file_path)
![]()
![]()
![]()
Print the DataFrame We use just variable name.
|
df |
id |
gender |
age |
hypertension |
heart_disease |
ever_married work_type Re |
|
|
0 9046 |
Male |
67.0 |
0 |
1 |
Yes Private |
|
|
1 51676 |
Female |
61.0 |
0 |
0 |
Yes Self- employed |
|
|
2 31112 |
Male |
80.0 |
0 |
1 |
Yes Private |
|
|
3 60182 |
Female |
49.0 |
0 |
0 |
Yes Private |
|
|
4 1665 |
Female |
79.0 |
1 |
0 |
Yes Self- employed |
|
|
... ... |
... |
... |
... |
... |
... ... |
|
|
5105 18234 |
Female |
80.0 |
1 |
0 |
Yes Private |
|
|
5106 44873 |
Female |
81.0 |
0 |
0 |
Yes Self- employed |
|
|
5107 19723 |
Female |
35.0 |
0 |
0 |
Yes Self- employed |
|
|
5108 37544 |
Male |
51 0 |
0 |
0 |
Yes Private |
![]()
![]()
![]()
![]()
The df.isnull() method is used to detect missing values in a DataFrame. It returns a DataFrame of the same size as the original,
where each element
is a boolean value indicating whether the corresponding element in the original DataFrame is missing or not.
df.isnull()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
|
id |
gender |
age |
hypertension |
heart_disease |
ever_married |
work_type |
Residence_type |
avg_glucose_level |
bmi |
smoking_status |
|
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
|
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
5105 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
|
5106 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5107 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5108 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5109 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
df.isnull().sum()
|
id |
0 |
|
gender |
0 |
|
age |
0 |
|
hypertension |
0 |
|
heart_disease |
0 |
|
ever_married |
0 |
|
work_type |
0 |
|
Residence_type |
0 |
|
avg_glucose_level |
0 |
|
bmi |
201 |
|
smoking_status |
0 |
|
stroke dtype: int64 |
0 |
(df.isnull().sum())/len(df)*100 calculates the percentage of missing values
in each column of a DataFrame. It returns a Pandas Series
object where the index is the column names and the values are the percentage of missing values in each column.
(df.isnull().sum())/len(df)*100
|
id |
0.000000 |
|
gender |
0.000000 |
|
age |
0.000000 |
|
hypertension |
0.000000 |
|
heart_disease |
0.000000 |
|
ever_married |
0.000000 |
|
work_type |
0.000000 |
|
Residence_type |
0.000000 |
|
avg_glucose_level |
0.000000 |
|
bmi |
3.933464 |
|
smoking_status |
0.000000 |
|
stroke dtype: float64 |
0.000000 |
df_1 = df.copy() statement creates a copy of the DataFrame df and assigns
it to the variable df_1. The copy()
method is a function in the Pandas
library that creates
a copy of a DataFrame
object.
df_1 = df.copy()
df_1 = df_1.dropna() statement removes rows with missing
values from the DataFrame df_1. The dropna()
method is a function in the Pandas
library that removes
rows or columns
with missing values from a DataFrame.
df_1 = df_1.dropna()
Copy df into df_2
df_2 = df.copy()
df_2['bmi'].fillna(df_2["bmi"].mean(),inplace = True)
df_2.info()
<class
'pandas.core.frame.DataFrame'> RangeIndex:
5110 entries, 0 to 5109
Data columns
(total 12 columns):
|
# |
|
Column |
Non-Null Count |
|
Dtype |
|
0 |
|
id |
5110 non-null |
|
int64 |
|
1 |
|
gender |
5110 non-null |
|
object |
|
2 |
|
age |
5110 non-null |
|
float64 |
|
3 |
|
hypertension |
5110 non-null |
|
int64 |
|
4 |
|
heart_disease |
5110 non-null |
|
int64 |
|
5 |
|
ever_married |
5110 non-null |
|
object |
|
6 |
|
work_type |
5110 non-null |
|
object |
|
7 |
|
Residence_type |
5110 non-null |
|
object |
|
8 |
|
avg_glucose_level |
5110 non-null |
|
float64 |
|
9 |
|
bmi |
5110 non-null |
|
float64 |
|
10 |
|
smoking_status |
5110 non-null |
|
object |
|
11 |
|
stroke |
5110 non-null |
|
int64 |
dtypes: float64(3),
int64(4), object(5) memory usage:
479.2+ KB
df_2.isnull().sum()
id 0
gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 0
smoking_status 0
stroke 0
dtype: int64
![]()
![]()
![]()
![]()
df_2.isnull()
|
|
id |
gender |
age |
hypertension |
heart_disease |
ever_married |
work_type |
Residence_type |
avg_glucose_level |
bmi |
smoking_status |
|
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
5105 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5106 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5107 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5108 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
|
5109 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
![]()
![]()
![]()
![]()
![]()
5110
rows × 12 columns
df_2
![]()
![]()
id
gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_st
|
0 |
9046 |
Male |
67.0 |
0 |
1 |
Yes |
Private |
Urban |
228.69 |
36.600000 |
formerly sm |
|
1 |
51676 |
Female |
61.0 |
0 |
0 |
Yes |
Self- employed |
Rural |
202.21 |
28.893237 |
never sm |
|
2 |
31112 |
Male |
80.0 |
0 |
1 |
Yes |
Private |
Rural |
105.92 |
32.500000 |
never sm |
|
3 |
60182 |
Female |
49.0 |
0 |
0 |
Yes |
Private |
Urban |
171.23 |
34.400000 |
sm |
|
4 |
1665 |
Female |
79.0 |
1 |
0 |
Yes |
Self- employed |
Rural |
174.12 |
24.000000 |
never sm |
|
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
|
|
5105 |
18234 |
Female |
80.0 |
1 |
0 |
Yes |
Private |
Urban |
83.75 |
28.893237 |
never sm |
![]()
![]()
![]()
![]()
5106 44873 Female 81.0 0 0 Yes Self-
employed
5107 19723 Female 35.0 0 0 Yes Self-
employed
Urban 125.20 40.000000 never sm
Rural 82.99 30.600000 never sm
![]()