Data Mining and Machine Learning Lab Task 2


  Google Co-Lab Link

Connect to google drive 

from google.colab import drive drive.mount('/content/drive') 

Import the file

file_path = '/content/drive/MyDrive/Importent Document/healthcare-dataset-stroke-data.csv'


Import pandas as pd to import the pandas library into Python. The import pandas part of the code tells Python to bring the pandas data analysis library into your current environment. The pd part of the code then tells Python to give pandas the pd alias. This allows us to use pandas functions by typing pd.function_name instead of pandas.function_name


import pandas as pd


Import numpy as np to import the Numpy library into Python. NumPy is a library for the Python programming language that provides support for large, multi-dimensional arrays and matrices, along with a large collection of highlevel mathematical functions to operate on these arrays1. It is the fundamental package for scientific computing with Python and is widely used in many fields, including data science, machine learning, and engineering. Import numpy as np allows us to use numpy functions by typing np.function_name instead of numpy.function_name.

 

import numpy as np

 

 

df = pd.read_csv(file_path) is a common way to read a CSV file into a Pandas DataFrame. Here, df is the variable name for the DataFrame object, pd is the alias for the Pandas library, read_csv is a function in the Pandas library that reads a CSV file and returns a DataFrame object, and file_path is the path to the CSV file you want to load.

 

df = pd.read_csv(file_path)

 

 

Print the DataFrame We use just variable name.

 

 

df

 

 

id

 

 

gender

 

 

age

 

 

hypertension

 

 

heart_disease

 

 

ever_married work_type Re

 

0            9046

Male

67.0

0

1

Yes              Private

 

1         51676

Female

61.0

0

0

Yes                   Self-

employed

 

2         31112

Male

80.0

0

1

Yes              Private

 

3         60182

Female

49.0

0

0

Yes              Private

 

4            1665

Female

79.0

1

0

Yes                   Self-

employed

 

...                  ...

...

...

...

...

...                         ...

 

5105 18234

Female

80.0

1

0

Yes              Private

 

5106 44873

Female

81.0

0

0

Yes                   Self-

employed

 

 

5107 19723

 

Female

 

35.0

 

0

 

0

Yes                   Self-

employed

 

5108 37544

Male

51 0

0

0

Yes              Private

 

The df.isnull() method is used to detect missing values in a DataFrame. It returns a DataFrame of the same size as the original, where each element is a boolean value indicating whether the corresponding element in the original DataFrame is missing or not.

 

df.isnull()


 

id

gender

age

hypertension

heart_disease

ever_married

work_type

Residence_type

avg_glucose_level

bmi

smoking_status

0

False

False

False

False

False

False

False

False

False

False

False

1

False

False

False

False

False

False

False

False

False

True

False

2

False

False

False

False

False

False

False

False

False

False

False

3

False

False

False

False

False

False

False

False

False

False

False

4

False

False

False

False

False

False

False

False

False

False

False

...

...

...

...

...

...

...

...

...

...

...

...

5105

False

False

False

False

False

False

False

False

False

True

False

5106

False

False

False

False

False

False

False

False

False

False

False

5107

False

False

False

False

False

False

False

False

False

False

False

5108

False

False

False

False

False

False

False

False

False

False

False

5109

False

False

False

False

False

False

False

False

False

False

False

 

 

 

5110 rows × 12 columns

 

df.isnull().sum()

 

id

0

gender

0

age

0

hypertension

0

heart_disease

0

ever_married

0

work_type

0

Residence_type

0

avg_glucose_level

0

bmi

201

smoking_status

0

stroke

dtype: int64

0

 

 

(df.isnull().sum())/len(df)*100 calculates the percentage of missing values in each column of a DataFrame. It returns a Pandas Series object where the index is the column names and the values are the percentage of missing values in each column.

 

(df.isnull().sum())/len(df)*100

 

id

0.000000

gender

0.000000

age

0.000000

hypertension

0.000000

heart_disease

0.000000

ever_married

0.000000

work_type

0.000000

Residence_type

0.000000

avg_glucose_level

0.000000

bmi

3.933464

smoking_status

0.000000

stroke

dtype: float64

0.000000

 

 

df_1 = df.copy() statement creates a copy of the DataFrame df and assigns it to the variable df_1. The copy() method is a function in the Pandas library that creates a copy of a DataFrame object.

 

df_1 = df.copy()

 

 

df_1 = df_1.dropna() statement removes rows with missing values from the DataFrame df_1. The dropna() method is a function in the Pandas library that removes rows or columns with missing values from a DataFrame.

 

df_1 = df_1.dropna()

 

 

Copy df into df_2


df_2 = df.copy()

 

 

df_2['bmi'].fillna(df_2["bmi"].mean(),inplace = True)

 

 

df_2.info()

 

<class 'pandas.core.frame.DataFrame'> RangeIndex: 5110 entries, 0 to 5109

Data columns (total 12 columns):

#

 

Column

Non-Null Count

 

Dtype

0

 

id

5110 non-null

 

int64

1

 

gender

5110 non-null

 

object

2

 

age

5110 non-null

 

float64

3

 

hypertension

5110 non-null

 

int64

4

 

heart_disease

5110 non-null

 

int64

5

 

ever_married

5110 non-null

 

object

6

 

work_type

5110 non-null

 

object

7

 

Residence_type

5110 non-null

 

object

8

 

avg_glucose_level

5110 non-null

 

float64

9

 

bmi

5110 non-null

 

float64

10

 

smoking_status

5110 non-null

 

object

11

 

stroke

5110 non-null

 

int64

dtypes: float64(3), int64(4), object(5) memory usage: 479.2+ KB

 

 

df_2.isnull().sum()

 

id                                                 0

gender                                       0

age                                               0

hypertension                       0

heart_disease                     0

ever_married                       0

work_type                               0

Residence_type                  0

avg_glucose_level           0

bmi                                               0

smoking_status                  0

stroke                                       0

dtype: int64

 

df_2.isnull()

 

 

id

gender

age

hypertension

heart_disease

ever_married

work_type

Residence_type

avg_glucose_level

bmi

smoking_status

0

False

False

False

False

False

False

False

False

False

False

False

1

False

False

False

False

False

False

False

False

False

False

False

2

False

False

False

False

False

False

False

False

False

False

False

3

False

False

False

False

False

False

False

False

False

False

False

4

False

False

False

False

False

False

False

False

False

False

False

...

...

...

...

...

...

...

...

...

...

...

...

5105

False

False

False

False

False

False

False

False

False

False

False

5106

False

False

False

False

False

False

False

False

False

False

False

5107

False

False

False

False

False

False

False

False

False

False

False

5108

False

False

False

False

False

False

False

False

False

False

False

5109

False

False

False

False

False

False

False

False

False

False

False

 

5110 rows × 12 columns





df_2


 

id gender age hypertension heart_disease ever_married work_type Residence_type avg_glucose_level        bmi smoking_st

 

0

9046

Male

67.0

0

1

Yes

Private

Urban

228.69

36.600000

formerly sm

 

1

 

51676

 

Female

 

61.0

 

0

 

0

 

Yes

 

Self- employed

 

Rural

 

202.21

 

28.893237

 

never sm

2

31112

Male

80.0

0

1

Yes

Private

Rural

105.92

32.500000

never sm

3

60182

Female

49.0

0

0

Yes

Private

Urban

171.23

34.400000

sm

 

4

 

1665

 

Female

 

79.0

 

1

 

0

 

Yes

Self- employed

 

Rural

 

174.12

 

24.000000

 

never sm

...

...

...

...

...

...

...

...

...

...

...

 

5105

18234

Female

80.0

1

0

Yes

Private

Urban

83.75

28.893237

never sm

 


5106   44873    Female   81.0                                 0                                    0                            Yes                   Self-

employed

5107   19723    Female   35.0                                 0                                    0                            Yes                   Self-

employed


Urban                                  125.20   40.000000             never sm

 

Rural                                     82.99   30.600000             never sm


 

Google Co-Lab Link

 

 

 



Post a Comment

"Give your valuable feedback"
"Thank you"