Multi linear regression for heart disease risk prediction system

Multi linear regression for heart disease risk prediction system.

Step 1: Import Required Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: Load and Prepare the Dataset

For this example, I'll create a synthetic dataset. In a real scenario, you would load your dataset from a file.

# Creating a synthetic dataset
np.random.seed(42)
data_size = 200

age = np.random.randint(30, 70, data_size)
cholesterol = np.random.randint(150, 300, data_size)
blood_pressure = np.random.randint(80, 180, data_size)
smoking = np.random.randint(0, 2, data_size)  # 0 for non-smoker, 1 for smoker
diabetes = np.random.randint(0, 2, data_size)  # 0 for no diabetes, 1 for diabetes

# Risk score (synthetic target variable)
risk_score = (
    0.3 * age
    + 0.2 * cholesterol
    + 0.3 * blood_pressure
    + 10 * smoking
    + 8 * diabetes
    + np.random.normal(0, 10, data_size)
)

# Creating a DataFrame
df = pd.DataFrame({
    'Age': age,
    'Cholesterol': cholesterol,
    'Blood Pressure': blood_pressure,
    'Smoking': smoking,
    'Diabetes': diabetes,
    'Risk Score': risk_score
})

# Display the first few rows of the dataset
print(df.head())

Step 3: Exploratory Data Analysis (EDA)

# Pairplot to visualize relationships between features and target
sns.pairplot(df)
plt.show()

# Correlation matrix to check relationships between features
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.show()

Step 4: Split the Dataset into Training and Testing Sets


# Features and target variable
X = df[['Age', 'Cholesterol', 'Blood Pressure', 'Smoking', 'Diabetes']]
y = df['Risk Score']

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, linear Regression Model

# Creating and training the model
model = LinearRegression()
model.fit(X_train, y_train)

# Model coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Step 6: Make Predictions and Evaluate the Model

# Making predictions on the test set
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Step 7: Visualize the Results

Data Generation: A synthetic dataset is created with features like Age, Cholesterol, Blood Pressure, Smoking, and Diabetes to predict a synthetic Risk Score.
EDA: Exploratory Data Analysis helps understand the relationships between the features and the target variable.
Model Training: The multiple linear regression model is trained on the dataset. The model’s coefficients indicate the weight of each feature in predicting the risk score.
Evaluation: The model's performance is evaluated using Mean Squared Error (MSE) and R-squared values.
Visualization: Visualizing actual vs. predicted values and residuals helps in assessing the model's fit.

Real Dataset Consideration:

Replace the synthetic data generation part with your actual dataset, ensuring that your data is clean and well-preprocessed. You might need to handle missing values, normalize/standardize features, and encode categorical variables depending on your dataset's characteristics.

This code provides a foundation for building a heart disease risk prediction system using multiple linear regression. Let me know if you need further assistance with your specific dataset or model improvements!

TECHNICAL WRITING

Search This Blog