So, here’s the story—I recently worked on a school assignment by Professor Zhuang involving a pretty cool algorithm called the Incremental Association Markov Blanket (IAMB). Now, I do not have a background in data science or statistics, so this is new territory for me, but I love to learn something new. The goal? Use IAMB to select features in a dataset and see how it impacts the performance of a machine-learning model.
We’ll go over the basics of the IAMB algorithm and apply it to the Pima Indians Diabetes Dataset from Jason Brownlee's datasets. This dataset tracks health data on women and includes whether they have diabetes or not. We’ll use IAMB to figure out which features (like BMI or glucose levels) matter most for predicting diabetes.
The IAMB algorithm is like a friend who helps you clean up a list of suspects in a mystery—it’s a feature selection method designed to pick out only the variables that truly matter for predicting your target. In this case, the target is whether someone has diabetes.
In simpler terms, IAMB helps us avoid clutter in our dataset by selecting only the most relevant features. This is especially handy when you want to keep things simple boost model performance and speed up the training time.
Source: Algorithms for Large-Scale Markov Blanket Discovery
Here’s where alpha comes in. In statistics, alpha (α) is the threshold we set to decide what counts as "statistically significant." As part of the instructions given by the professor, I used an alpha of 0.05, meaning I only want to keep features that have less than a 5% chance of being randomly associated with the target variable. So, if a feature’s p-value is less than 0.05, it means there’s a strong, statistically significant association with our target.
By using this alpha threshold, we’re focusing only on the most meaningful variables, ignoring any that don’t pass our “significance” test. It’s like a filter that keeps the most relevant features and tosses out the noise.
Here's the setup: the Pima Indians Diabetes Dataset has health features (blood pressure, age, insulin levels, etc.) and our target, Outcome (whether someone has diabetes).
First, we load the data and check it out:
import pandas as pd # Load and preview the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv' column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'] data = pd.read_csv(url, names=column_names) print(data.head())
Here’s our updated version of the IAMB algorithm. We’re using p-values to decide which features to keep, so only those with p-values less than our alpha (0.05) are selected.
import pingouin as pg def iamb(target, data, alpha=0.05): markov_blanket = set() # Forward Phase: Add features with a p-value alpha for feature in list(markov_blanket): reduced_mb = markov_blanket - {feature} result = pg.partial_corr(data=data, x=feature, y=target, covar=reduced_mb) p_value = result.at[0, 'p-val'] if p_value > alpha: markov_blanket.remove(feature) return list(markov_blanket) # Apply the updated IAMB function on the Pima dataset selected_features = iamb('Outcome', data, alpha=0.05) print("Selected Features:", selected_features)
When I ran this, it gave me a refined list of features that IAMB thought were most closely related to diabetes outcomes. This list helps narrow down the variables we need for building our model.
Selected Features: ['BMI', 'DiabetesPedigreeFunction', 'Pregnancies', 'Glucose']
Once we have our selected features, the real test compares model performance with all features versus IAMB-selected features. For this, I went with a simple Gaussian Naive Bayes model because it’s straightforward and does well with probabilities (which ties in with the whole Bayesian vibe).
Here’s the code to train and test the model:
from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score, f1_score, roc_auc_score # Split data X = data.drop('Outcome', axis=1) y = data['Outcome'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Model with All Features model_all = GaussianNB() model_all.fit(X_train, y_train) y_pred_all = model_all.predict(X_test) # Model with IAMB-Selected Features X_train_selected = X_train[selected_features] X_test_selected = X_test[selected_features] model_iamb = GaussianNB() model_iamb.fit(X_train_selected, y_train) y_pred_iamb = model_iamb.predict(X_test_selected) # Evaluate models results = { 'Model': ['All Features', 'IAMB-Selected Features'], 'Accuracy': [accuracy_score(y_test, y_pred_all), accuracy_score(y_test, y_pred_iamb)], 'F1 Score': [f1_score(y_test, y_pred_all, average='weighted'), f1_score(y_test, y_pred_iamb, average='weighted')], 'AUC-ROC': [roc_auc_score(y_test, y_pred_all), roc_auc_score(y_test, y_pred_iamb)] } results_df = pd.DataFrame(results) display(results_df)
Here’s what the comparison looks like:
Using only the IAMB-selected features gave a slight boost in accuracy and other metrics. It’s not a huge jump, but the fact that we’re getting better performance with fewer features is promising. Plus, it means our model isn’t relying on “noise” or irrelevant data.
I hope this gives a friendly intro to IAMB! If you’re curious, give it a shot—it’s a handy tool in the machine learning toolbox, and you might just see some cool improvements in your own projects.
Source: Algorithms for Large-Scale Markov Blanket Discovery
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3