A Machine Learning model made with AWS SageMaker and its Python SDK for Classification of HDFS Logs using Terraform for automation of infrastructure setup.
Language: HCL (terraform), Python
# Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference. container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='1.0-1')
hyperparameters = { "max_depth":"5", ## Maximum depth of a tree. Higher means more complex models but risk of overfitting. "eta":"0.2", ## Learning rate. Lower values make the learning process slower but more precise. "gamma":"4", ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity. "min_child_weight":"6", ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting. "subsample":"0.7", ## Fraction of training data used. Reduces overfitting by sampling part of the data. "objective":"binary:logistic", ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification. "num_round":50 ## Number of boosting rounds, essentially how many times the model is trained. } # A SageMaker estimator that calls the xgboost-container estimator = sagemaker.estimator.Estimator(image_uri=container, # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use. hyperparameters=hyperparameters, # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process. role=sagemaker.get_execution_role(), # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3. train_instance_count=1, # Sets the number of training instances. Here, it’s using a single instance. train_instance_type='ml.m5.large', # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources. train_volume_size=5, # 5GB # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB. output_path=output_path, # Defines where the model artifacts and output of the training job will be saved in S3. train_use_spot_instances=True, # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price. train_max_run=300, # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes). train_max_wait=600) # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
estimator.fit({'train': s3_input_train,'validation': s3_input_test})
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
from sagemaker.serializers import CSVSerializer import numpy as np from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # Drop the label column from the test data test_data_features = test_data_final.drop(columns=['Label']).values # Set the content type and serializer xgb_predictor.serializer = CSVSerializer() xgb_predictor.content_type = 'text/csv' # Perform prediction predictions = xgb_predictor.predict(test_data_features).decode('utf-8') y_test = test_data_final['Label'].values # Convert the predictions into a array predictions_array = np.fromstring(predictions, sep=',') print(predictions_array.shape) # Converting predictions them to binary (0 or 1) threshold = 0.5 binary_predictions = (predictions_array >= threshold).astype(int) # Accuracy accuracy = accuracy_score(y_test, binary_predictions) # Precision precision = precision_score(y_test, binary_predictions) # Recall recall = recall_score(y_test, binary_predictions) # F1 Score f1 = f1_score(y_test, binary_predictions) # Confusion Matrix cm = confusion_matrix(y_test, binary_predictions) # False Positive Rate (FPR) using the confusion matrix tn, fp, fn, tp = cm.ravel() false_positive_rate = fp / (fp tn) # Print the metrics print(f"Accuracy: {accuracy:.8f}") print(f"Precision: {precision:.8f}") print(f"Recall: {recall:.8f}") print(f"F1 Score: {f1:.8f}") print(f"False Positive Rate: {false_positive_rate:.8f}")
# terraform.tfvars access_key = "" secret_key = " " aws_account_id = " "
In the terminal type/paste terraform init to initialize the backend.
Then type/paste terraform Plan to view the plan or simply terraform validate to ensure that there is no error.
Finally in the terminal type/paste terraform apply --auto-approve
This will show two outputs one as bucket_name other as pretrained_ml_instance_name (The 3rd resource is the variable name given to the bucket since they are global resources ).
output = subprocess.check_output('terraform output -json', shell=True, cwd = r'' #C:\Users\Saahen\Desktop\ClassiSage
and change it to the path where the project directory is present and save it.
# Try to upload the local CSV file to the S3 bucket try: print(f"try block executing") s3.upload_file( Filename=local_file_path, Bucket=bucket_name, Key=file_key # S3 file key (filename in the bucket) ) print(f"Successfully uploaded {file_key} to {bucket_name}") # Delete the local file after uploading to S3 os.remove(local_file_path) print(f"Local file {local_file_path} deleted after upload.") except Exception as e: print(f"Failed to upload file: {e}") os.remove(local_file_path)
to upload dataset to S3 Bucket.
S3 Bucket with named 'data-bucket-' with 2 objects uploaded, a dataset and the pretrained_sm.ipynb file containing model code.
aws s3 cp s3:///pretrained_sm.ipynb /home/ec2-user/SageMaker/
Terminal command to upload the pretrained_sm.ipynb from S3 to Notebook's Jupyter environment
# S3 bucket, region, session bucket_name = 'data-bucket-axhq3rp8' my_region = boto3.session.Session().region_name sess = boto3.session.Session() print("Region is " my_region " and bucket is " bucket_name)
Output of the code cell execution
# Print the metrics print(f"Accuracy: {accuracy:.8f}") print(f"Precision: {precision:.8f}") print(f"Recall: {recall:.8f}") print(f"F1 Score: {f1:.8f}") print(f"False Positive Rate: {false_positive_rate:.8f}")
Execution of 8th cell
# Set an output path where the trained model will be saved prefix = 'pretrained-algo' output_path ='s3://{}/{}/output'.format(bucket_name, prefix) print(output_path)
Execution of 23rd cell
estimator.fit({'train': s3_input_train,'validation': s3_input_test})
Execution of 24th code cell
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
Additional Console Observation:
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name) bucket_to_delete.objects.all().delete()
