Connect to google drive

from google.colab import drive drive.mount('/content/drive')

Import the file

file_path = '/content/drive/MyDrive/Importent Document/healthcare-dataset-stroke-data.csv'

Import pandas as pd to import the pandas library into Python. The import pandas part of the code tells Python to bring the pandas data analysis library into your current environment. The pd part of the code then tells Python to give pandas the pd alias. This allows us to use pandas functions by typing pd.function_name instead of pandas.function_name

import pandas as pd

Import numpy as np to import the Numpy library into Python. NumPy is a library for the Python programming language that provides support for large, multi-dimensional arrays and matrices, along with a large collection of highlevel mathematical functions to operate on these arrays1. It is the fundamental package for scientific computing with Python and is widely used in many fields, including data science, machine learning, and engineering. Import numpy as np allows us to use numpy functions by typing np.function_name instead of numpy.function_name.

import numpy as np

df = pd.read_csv(file_path) is a common way to read a CSV file into a Pandas DataFrame. Here, df is the variable name for the DataFrame object, pd is the alias for the Pandas library, read_csv is a function in the Pandas library that reads a CSV file and returns a DataFrame object, and file_path is the path to the CSV file you want to load.

df = pd.read_csv(file_path)

Print the DataFrame We use just variable name.

df	id	gender	age	hypertension	heart_disease	ever_married work_type Re
	0 9046	Male	67.0	0	1	Yes Private
	1 51676	Female	61.0	0	0	Yes Self- employed
	2 31112	Male	80.0	0	1	Yes Private
	3 60182	Female	49.0	0	0	Yes Private
	4 1665	Female	79.0	1	0	Yes Self- employed
	... ...	...	...	...	...	... ...
	5105 18234	Female	80.0	1	0	Yes Private
	5106 44873	Female	81.0	0	0	Yes Self- employed
	5107 19723	Female	35.0	0	0	Yes Self- employed
	5108 37544	Male	51 0	0	0	Yes Private

The df.isnull() method is used to detect missing values in a DataFrame. It returns a DataFrame of the same size as the original, where each element is a boolean value indicating whether the corresponding element in the original DataFrame is missing or not.

df.isnull()

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status
0	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	True	False
2	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...
5105	False	False	False	False	False	False	False	False	False	True	False
5106	False	False	False	False	False	False	False	False	False	False	False
5107	False	False	False	False	False	False	False	False	False	False	False
5108	False	False	False	False	False	False	False	False	False	False	False
5109	False	False	False	False	False	False	False	False	False	False	False

5110 rows × 12 columns

df.isnull().sum()

id	0
gender	0
age	0
hypertension	0
heart_disease	0
ever_married	0
work_type	0
Residence_type	0
avg_glucose_level	0
bmi	201
smoking_status	0
stroke dtype: int64	0

(df.isnull().sum())/len(df)*100 calculates the percentage of missing values in each column of a DataFrame. It returns a Pandas Series object where the index is the column names and the values are the percentage of missing values in each column.

(df.isnull().sum())/len(df)*100

id	0.000000
gender	0.000000
age	0.000000
hypertension	0.000000
heart_disease	0.000000
ever_married	0.000000
work_type	0.000000
Residence_type	0.000000
avg_glucose_level	0.000000
bmi	3.933464
smoking_status	0.000000
stroke dtype: float64	0.000000

df_1 = df.copy() statement creates a copy of the DataFrame df and assigns it to the variable df_1. The copy() method is a function in the Pandas library that creates a copy of a DataFrame object.

df_1 = df.copy()

df_1 = df_1.dropna() statement removes rows with missing values from the DataFrame df_1. The dropna() method is a function in the Pandas library that removes rows or columns with missing values from a DataFrame.

df_1 = df_1.dropna()

Copy df into df_2

df_2 = df.copy()

df_2['bmi'].fillna(df_2["bmi"].mean(),inplace = True)

df_2.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109

Data columns (total 12 columns):

#	Column	Non-Null Count	Dtype
0	id	5110 non-null	int64
1	gender	5110 non-null	object
2	age	5110 non-null	float64
3	hypertension	5110 non-null	int64
4	heart_disease	5110 non-null	int64
5	ever_married	5110 non-null	object
6	work_type	5110 non-null	object
7	Residence_type	5110 non-null	object
8	avg_glucose_level	5110 non-null	float64
9	bmi	5110 non-null	float64
10	smoking_status	5110 non-null	object
11	stroke	5110 non-null	int64

dtypes: float64(3), int64(4), object(5) memory usage: 479.2+ KB

df_2.isnull().sum()

id 0

gender 0

age 0

hypertension 0

heart_disease 0

ever_married 0

work_type 0

Residence_type 0

avg_glucose_level 0

bmi 0

smoking_status 0

stroke 0

dtype: int64

df_2.isnull()

	id	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status
0	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...
5105	False	False	False	False	False	False	False	False	False	False	False
5106	False	False	False	False	False	False	False	False	False	False	False
5107	False	False	False	False	False	False	False	False	False	False	False
5108	False	False	False	False	False	False	False	False	False	False	False
5109	False	False	False	False	False	False	False	False	False	False	False

5110 rows × 12 columns

df_2

id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level bmi smoking_st

0	9046	Male	67.0	0	1	Yes	Private	Urban	228.69	36.600000	formerly sm
1	51676	Female	61.0	0	0	Yes	Self- employed	Rural	202.21	28.893237	never sm
2	31112	Male	80.0	0	1	Yes	Private	Rural	105.92	32.500000	never sm
3	60182	Female	49.0	0	0	Yes	Private	Urban	171.23	34.400000	sm
4	1665	Female	79.0	1	0	Yes	Self- employed	Rural	174.12	24.000000	never sm
...	...	...	...	...	...	...	...	...	...	...
5105	18234	Female	80.0	1	0	Yes	Private	Urban	83.75	28.893237	never sm