Badawi Aminu Muhammed

Badawi Amin Muhammed

[B.Sc, M.Sc.]

Data Scientist • Business Intelligence Expert • Research Analyst

About Résumé Certifications Projects Analytics Services Contact

Heart Disease Prediction¶

Introduction to Heart Disease Prediction

Predicting and diagnosing heart disease is one of the major challenge in the medical industry and relies on factors such as the physical examination, symptoms and signs of the patient.

This project takes us through how to train a model for the task of heart disease prediction using Machine Learning. I will use the Logistic Regression algorithm in machine learning to train a model to predict heart disease.

Common Factors that influence heart disease are body cholesterol levels, smoking habit and obesity, family history of illnesses, blood pressure, and work environment. Machine learning algorithms play an essential and precise role in the prediction of heart disease using these factors.

Advances in technology allow machine language to combine with Big Data tools to manage unstructured and exponentially growing data. Heart disease is seen as the world’s deadliest disease of human life. In particular, in this type of disease, the heart is not able to push the required amount of blood to the remaining organs of the human body to perform regular functions.

Heart disease can be predicted based on various symptoms such as age, gender, heart rate, etc. and reduces the death rate of heart patients.

Due to the increasing use of technology and data collection, we can now predict heart disease using machine learning algorithms. Now let’s go further with the task of heart disease prediction using machine learning with Python.

Heart Disease Prediction Using Machine Learning

Now in this section, we go through the task of Heart Disease Prediction using machine learning by using the Logistic regression algorithm.

Since we are using Python programming language for this task of heart disease prediction so let’s start by importing some necessary libraries:

In [15]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# matplotlib inline

sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

Then, we read the data into the environment using the panda function:

In [17]:
df = pd.read_csv("C:/Users/hp/Documents/heart.csv")
df.head()
Out[17]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

Exploratory Data Analysis

EDA helps us find answers to some important questions such as:

-What question (s) are you trying to solve?
-What kind of data do we have and how do we handle the different types?
-What is missing in the data and how do you deal with it?
-Where are the outliers and why should you care?
-How can you add, change, or remove features to get the most out of your data?

Now let’s start with exploratory data analysis:

In [18]:
pd.set_option("display.float", "{:.2f}".format)
df.describe()
Out[18]:
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
count 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00
mean 54.37 0.68 0.97 131.62 246.26 0.15 0.53 149.65 0.33 1.04 1.40 0.73 2.31 0.54
std 9.08 0.47 1.03 17.54 51.83 0.36 0.53 22.91 0.47 1.16 0.62 1.02 0.61 0.50
min 29.00 0.00 0.00 94.00 126.00 0.00 0.00 71.00 0.00 0.00 0.00 0.00 0.00 0.00
25% 47.50 0.00 0.00 120.00 211.00 0.00 0.00 133.50 0.00 0.00 1.00 0.00 2.00 0.00
50% 55.00 1.00 1.00 130.00 240.00 0.00 1.00 153.00 0.00 0.80 1.00 0.00 2.00 1.00
75% 61.00 1.00 2.00 140.00 274.50 0.00 1.00 166.00 1.00 1.60 2.00 1.00 3.00 1.00
max 77.00 1.00 3.00 200.00 564.00 1.00 2.00 202.00 1.00 6.20 2.00 4.00 3.00 1.00

we then visualize the frequency of individuals with heart disease and those without:

In [42]:
df.target.value_counts().plot(kind="bar", color=["red", "orange"])
Out[42]:
<Axes: xlabel='target'>
No description has been provided for this image

According to the data, We have 165 people with heart disease and 138 people without heart disease, so our problem is balanced.

we will also check if there are missing values NA.

In [ ]:
# Checking for missing values
df.isna().sum()
Out[ ]:
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

The data looks perfect to use as we don’t have null values.

we explore the numerical nature of the data to observe the categorical values and the continous values.

In [21]:
categorical_val = []
continous_val = []
for column in df.columns:
    print('==============================')
    print(f"{column} : {df[column].unique()}")
    if len(df[column].unique()) <= 10:
        categorical_val.append(column)
    else:
        continous_val.append(column)
==============================
age : [63 37 41 56 57 44 52 54 48 49 64 58 50 66 43 69 59 42 61 40 71 51 65 53
 46 45 39 47 62 34 35 29 55 60 67 68 74 76 70 38 77]
==============================
sex : [1 0]
==============================
cp : [3 2 1 0]
==============================
trestbps : [145 130 120 140 172 150 110 135 160 105 125 142 155 104 138 128 108 134
 122 115 118 100 124  94 112 102 152 101 132 148 178 129 180 136 126 106
 156 170 146 117 200 165 174 192 144 123 154 114 164]
==============================
chol : [233 250 204 236 354 192 294 263 199 168 239 275 266 211 283 219 340 226
 247 234 243 302 212 175 417 197 198 177 273 213 304 232 269 360 308 245
 208 264 321 325 235 257 216 256 231 141 252 201 222 260 182 303 265 309
 186 203 183 220 209 258 227 261 221 205 240 318 298 564 277 214 248 255
 207 223 288 160 394 315 246 244 270 195 196 254 126 313 262 215 193 271
 268 267 210 295 306 178 242 180 228 149 278 253 342 157 286 229 284 224
 206 167 230 335 276 353 225 330 290 172 305 188 282 185 326 274 164 307
 249 341 407 217 174 281 289 322 299 300 293 184 409 259 200 327 237 218
 319 166 311 169 187 176 241 131]
==============================
fbs : [1 0]
==============================
restecg : [0 1 2]
==============================
thalach : [150 187 172 178 163 148 153 173 162 174 160 139 171 144 158 114 151 161
 179 137 157 123 152 168 140 188 125 170 165 142 180 143 182 156 115 149
 146 175 186 185 159 130 190 132 147 154 202 166 164 184 122 169 138 111
 145 194 131 133 155 167 192 121  96 126 105 181 116 108 129 120 112 128
 109 113  99 177 141 136  97 127 103 124  88 195 106  95 117  71 118 134
  90]
==============================
exang : [0 1]
==============================
oldpeak : [2.3 3.5 1.4 0.8 0.6 0.4 1.3 0.  0.5 1.6 1.2 0.2 1.8 1.  2.6 1.5 3.  2.4
 0.1 1.9 4.2 1.1 2.  0.7 0.3 0.9 3.6 3.1 3.2 2.5 2.2 2.8 3.4 6.2 4.  5.6
 2.9 2.1 3.8 4.4]
==============================
slope : [0 2 1]
==============================
ca : [0 2 1 3 4]
==============================
thal : [1 2 3 0]
==============================
target : [1 0]

Next, we visualize the distribution of the above categories:

In [43]:
plt.figure(figsize=(15, 15))

for i, column in enumerate(categorical_val, 1):
    plt.subplot(3, 3, i)
    df[df["target"] == 0][column].hist(bins=20, color='blue', label='Have Heart Disease = NO', alpha=0.6)
    df[df["target"] == 1][column].hist(bins=20, color='red', label='Have Heart Disease = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)
No description has been provided for this image

Observations from the above plot:

1. cp {Chest pain}: People with cp 1, 2, 3 are more likely to have heart disease than people with cp 0.

2. restecg {resting EKG results}: People with a value of 1 (reporting an abnormal heart rhythm, which can range from mild symptoms to severe problems) are more likely to have heart disease.

3. exang {exercise-induced angina}: people with a value of 0 (No ==> angina induced by exercise) have more heart disease than people with a value of 1 (Yes ==> angina induced by exercise).

4. slope {the slope of the ST segment of peak exercise}: People with a slope value of 2 (Downslopins: signs of an unhealthy heart) are more likely to have heart disease than people with a slope value of 2 slope is 0 (Upsloping: best heart rate with exercise) or 1 (Flatsloping: minimal change (typical healthy heart)).

5. ca {number of major vessels (0-3) stained by fluoroscopy}: the more blood movement the better, so people with ca equal to 0 are more likely to have heart disease.

6. thal {thalium stress result}: People with a thal value of 2 (defect corrected: once was a defect but ok now) are more likely to have heart disease.

In [23]:
plt.figure(figsize=(15, 15))

for i, column in enumerate(continous_val, 1):
    plt.subplot(3, 2, i)
    df[df["target"] == 0][column].hist(bins=35, color='blue', label='Have Heart Disease = NO', alpha=0.6)
    df[df["target"] == 1][column].hist(bins=35, color='red', label='Have Heart Disease = YES', alpha=0.6)
    plt.legend()
    plt.xlabel(column)
No description has been provided for this image

Observations from the above plot:

  1. trestbps: resting blood pressure anything above 130-140 is generally of concern

  2. chol: greater than 200 is of concern.

  3. thalach: People with a maximum of over 140 are more likely to have heart disease.

  4. the old peak of exercise-induced ST depression vs. rest looks at heart stress during exercise an unhealthy heart will stress more.

Next, let us create a scatter plot to observe correlation.

In [24]:
# Create another figure
plt.figure(figsize=(10, 8))

# Scatter with postivie examples
plt.scatter(df.age[df.target==1],
            df.thalach[df.target==1],
            c="salmon")

# Scatter with negative examples
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            c="lightblue")

# Add some helpful info
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);
No description has been provided for this image

Correlation Matrix we now create a correlation matrix.

In [ ]:
corr_matrix = df.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
Out[ ]:
(14.5, -0.5)
No description has been provided for this image
In [26]:
df.drop('target', axis=1).corrwith(df.target).plot(kind='bar', grid=True, figsize=(12, 8), 
  title="Correlation with target")
Out[26]:
<Axes: title={'center': 'Correlation with target'}>
No description has been provided for this image

Observations from correlation:

  1. fbs and chol are the least correlated with the target variable.
  2. All other variables have a significant correlation with the target variable.

Data Preparation for Modelling

After exploring the dataset, we can observe that we need to convert some categorical variables to dummy variables and scale all values before training the machine learning models.

So, for this task, we'll use the get_dummies method to create dummy columns for categorical variables:

In [51]:
try:
    categorical_val.remove('target')
    dataset = pd.get_dummies(df, columns = categorical_val)

    from sklearn.preprocessing import StandardScaler

    s_sc = StandardScaler()
    col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
    dataset[col_to_scale] = s_sc.fit_transform(dataset[col_to_scale])
    # Your code here
except Exception as e:
    pass

Applying Logistic Regression

Now, develop a machine learning model for the task of heart disease prediction. we'll use the logistic regression algorithm.

But before training the model, we will first define a helper function for printing the classification report of the performance of the machine learning model:

In [48]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):
    if train:
        pred = clf.predict(X_train)
        clf_report = pd.DataFrame(classification_report(y_train, pred, output_dict=True))
        print("Train Result:\n================================================")
        print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
        
    elif train==False:
        pred = clf.predict(X_test)
        clf_report = pd.DataFrame(classification_report(y_test, pred, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
        print("_______________________________________________")
        print(f"CLASSIFICATION REPORT:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

Now let’s split the data into training and test sets. we split the data into 70% training and 30% testing:

In [49]:
from sklearn.model_selection import train_test_split

X = dataset.drop('target', axis=1)
y = dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now let’s train the model and print the classification report of our logistic regression model:

In [50]:
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)

print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)
print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)
Train Result:
================================================
Accuracy Score: 86.79%
_______________________________________________
CLASSIFICATION REPORT:
              0      1  accuracy  macro avg  weighted avg
precision  0.88   0.86      0.87       0.87          0.87
recall     0.82   0.90      0.87       0.86          0.87
f1-score   0.85   0.88      0.87       0.87          0.87
support   97.00 115.00      0.87     212.00        212.00
_______________________________________________
Confusion Matrix: 
 [[ 80  17]
 [ 11 104]]

Test Result:
================================================
Accuracy Score: 86.81%
_______________________________________________
CLASSIFICATION REPORT:
              0     1  accuracy  macro avg  weighted avg
precision  0.87  0.87      0.87       0.87          0.87
recall     0.83  0.90      0.87       0.86          0.87
f1-score   0.85  0.88      0.87       0.87          0.87
support   41.00 50.00      0.87      91.00         91.00
_______________________________________________
Confusion Matrix: 
 [[34  7]
 [ 5 45]]

Now, we can now check the performance of our model:

In [37]:
test_score = accuracy_score(y_test, lr_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, lr_clf.predict(X_train)) * 100

results_df = pd.DataFrame(data=[["Logistic Regression", train_score, test_score]], 
                          columns=['Model', 'Training Accuracy %', 'Testing Accuracy %'])
results_df
Out[37]:
Model Training Accuracy % Testing Accuracy %
0 Logistic Regression 86.79 86.81

As you can see the model performs very well of the test set as it is giving almost the same accuracy in the test set as in the training set.

hence, the model can perform well in prediciting heart disease cases given similar factors.

Badawi Aminu Muhammed
Officialbadawy@gmail.com
08065440075

Badawi Amin Muhammed

Data Scientist, BI Expert, and Research Analyst building practical analytics, automation, and decision-support systems.

Book Consultation

Quick Links

About Projects Services Contact

Connect

officialbadawy@gmail.com badawy.muhammed@floretcenter.org +234 806 544 0075
© 2026 Badawi Amin Muhammed. All rights reserved. Zaria-Kaduna, Nigeria (WAT)