Multi linear regression for heart disease risk prediction system.
Step 1: Import Required Libraries
import pandas as pdimport numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load and Prepare the Dataset
For this example, I'll create a synthetic dataset. In a real scenario, you would load your dataset from a file.
# Creating a synthetic datasetnp.random.seed(42)
data_size = 200
age = np.random.randint(30, 70, data_size)
cholesterol = np.random.randint(150, 300, data_size)
blood_pressure = np.random.randint(80, 180, data_size)
smoking = np.random.randint(0, 2, data_size) # 0 for non-smoker, 1 for smoker
diabetes = np.random.randint(0, 2, data_size) # 0 for no diabetes, 1 for diabetes
# Risk score (synthetic target variable)
risk_score = (
0.3 * age
+ 0.2 * cholesterol
+ 0.3 * blood_pressure
+ 10 * smoking
+ 8 * diabetes
+ np.random.normal(0, 10, data_size)
)
# Creating a DataFrame
df = pd.DataFrame({
'Age': age,
'Cholesterol': cholesterol,
'Blood Pressure': blood_pressure,
'Smoking': smoking,
'Diabetes': diabetes,
'Risk Score': risk_score
})
# Display the first few rows of the dataset
print(df.head())
Step 3: Exploratory Data Analysis (EDA)
# Pairplot to visualize relationships between features and targetsns.pairplot(df)
plt.show()
# Correlation matrix to check relationships between features
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm")
plt.show()
Step 4: Split the Dataset into Training and Testing Sets
# Features and target variable
X = df[['Age', 'Cholesterol', 'Blood Pressure', 'Smoking', 'Diabetes']]
y = df['Risk Score']
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, l
inear Regression Model
# Creating and training the modelmodel = LinearRegression()
model.fit(X_train, y_train)
# Model coefficients
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
Step 6: Make Predictions and Evaluate the Model
# Making predictions on the test sety_pred = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Step 7: Visualize the Results
:
Data Generation: A synthetic dataset is created with features like
Age
,Cholesterol
,Blood Pressure
,Smoking
, andDiabetes
to predict a syntheticRisk Score
.EDA: Exploratory Data Analysis helps understand the relationships between the features and the target variable.
Model Training: The multiple linear regression model is trained on the dataset. The model’s coefficients indicate the weight of each feature in predicting the risk score.
Evaluation: The model's performance is evaluated using Mean Squared Error (MSE) and R-squared values.
Visualization: Visualizing actual vs. predicted values and residuals helps in assessing the model's fit.
Real Dataset Consideration:
Replace the synthetic data generation part with your actual dataset, ensuring that your data is clean and well-preprocessed. You might need to handle missing values, normalize/standardize features, and encode categorical variables depending on your dataset's characteristics.
This code provides a foundation for building a heart disease risk prediction system using multiple linear regression. Let me know if you need further assistance with your specific dataset or model improvements!
Comments
Post a Comment