A Machine Learning model made with AWS SageMaker and its Python SDK for Classification of HDFS Logs using Terraform for automation of infrastructure setup.
Link: GitHub
Language: HCL (terraform), Python
# Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference. container = get_image_uri(boto3.Session().region_name, 'xgboost', repo_version='1.0-1')
hyperparameters = { "max_depth":"5", ## Maximum depth of a tree. Higher means more complex models but risk of overfitting. "eta":"0.2", ## Learning rate. Lower values make the learning process slower but more precise. "gamma":"4", ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity. "min_child_weight":"6", ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting. "subsample":"0.7", ## Fraction of training data used. Reduces overfitting by sampling part of the data. "objective":"binary:logistic", ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification. "num_round":50 ## Number of boosting rounds, essentially how many times the model is trained. } # A SageMaker estimator that calls the xgboost-container estimator = sagemaker.estimator.Estimator(image_uri=container, # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use. hyperparameters=hyperparameters, # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process. role=sagemaker.get_execution_role(), # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3. train_instance_count=1, # Sets the number of training instances. Here, it’s using a single instance. train_instance_type='ml.m5.large', # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources. train_volume_size=5, # 5GB # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB. output_path=output_path, # Defines where the model artifacts and output of the training job will be saved in S3. train_use_spot_instances=True, # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price. train_max_run=300, # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes). train_max_wait=600) # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).
estimator.fit({'train': s3_input_train,'validation': s3_input_test})
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
from sagemaker.serializers import CSVSerializer import numpy as np from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # Drop the label column from the test data test_data_features = test_data_final.drop(columns=['Label']).values # Set the content type and serializer xgb_predictor.serializer = CSVSerializer() xgb_predictor.content_type = 'text/csv' # Perform prediction predictions = xgb_predictor.predict(test_data_features).decode('utf-8') y_test = test_data_final['Label'].values # Convert the predictions into a array predictions_array = np.fromstring(predictions, sep=',') print(predictions_array.shape) # Converting predictions them to binary (0 or 1) threshold = 0.5 binary_predictions = (predictions_array >= threshold).astype(int) # Accuracy accuracy = accuracy_score(y_test, binary_predictions) # Precision precision = precision_score(y_test, binary_predictions) # Recall recall = recall_score(y_test, binary_predictions) # F1 Score f1 = f1_score(y_test, binary_predictions) # Confusion Matrix cm = confusion_matrix(y_test, binary_predictions) # False Positive Rate (FPR) using the confusion matrix tn, fp, fn, tp = cm.ravel() false_positive_rate = fp / (fp tn) # Print the metrics print(f"Accuracy: {accuracy:.8f}") print(f"Precision: {precision:.8f}") print(f"Recall: {recall:.8f}") print(f"F1 Score: {f1:.8f}") print(f"False Positive Rate: {false_positive_rate:.8f}")
# terraform.tfvars access_key = "" secret_key = " " aws_account_id = " "
In the terminal type/paste terraform init to initialize the backend.
Then type/paste terraform Plan to view the plan or simply terraform validate to ensure that there is no error.
Finally in the terminal type/paste terraform apply --auto-approve
This will show two outputs one as bucket_name other as pretrained_ml_instance_name (The 3rd resource is the variable name given to the bucket since they are global resources ).
output = subprocess.check_output('terraform output -json', shell=True, cwd = r'' #C:\Users\Saahen\Desktop\ClassiSage
and change it to the path where the project directory is present and save it.
# Try to upload the local CSV file to the S3 bucket try: print(f"try block executing") s3.upload_file( Filename=local_file_path, Bucket=bucket_name, Key=file_key # S3 file key (filename in the bucket) ) print(f"Successfully uploaded {file_key} to {bucket_name}") # Delete the local file after uploading to S3 os.remove(local_file_path) print(f"Local file {local_file_path} deleted after upload.") except Exception as e: print(f"Failed to upload file: {e}") os.remove(local_file_path)
to upload dataset to S3 Bucket.
S3 Bucket with named 'data-bucket-' with 2 objects uploaded, a dataset and the pretrained_sm.ipynb file containing model code.
aws s3 cp s3:///pretrained_sm.ipynb /home/ec2-user/SageMaker/
Terminal command to upload the pretrained_sm.ipynb from S3 to Notebook's Jupyter environment
# S3 bucket, region, session bucket_name = 'data-bucket-axhq3rp8' my_region = boto3.session.Session().region_name sess = boto3.session.Session() print("Region is " my_region " and bucket is " bucket_name)
Output of the code cell execution
# Print the metrics print(f"Accuracy: {accuracy:.8f}") print(f"Precision: {precision:.8f}") print(f"Recall: {recall:.8f}") print(f"F1 Score: {f1:.8f}") print(f"False Positive Rate: {false_positive_rate:.8f}")
Execution of 8th cell
# Set an output path where the trained model will be saved prefix = 'pretrained-algo' output_path ='s3://{}/{}/output'.format(bucket_name, prefix) print(output_path)
Execution of 23rd cell
estimator.fit({'train': s3_input_train,'validation': s3_input_test})
Execution of 24th code cell
xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
Additional Console Observation:
sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)
bucket_to_delete = boto3.resource('s3').Bucket(bucket_name) bucket_to_delete.objects.all().delete()
ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup
NOTE:
If you liked the idea and the implementation of this Machine Learning Project using AWS Cloud's S3 and SageMaker for HDFS log classification, using Terraform for IaC (Infrastructure setup automation), Kindly consider liking this post and starring after checking-out the project repository at GitHub.
Isenção de responsabilidade: Todos os recursos fornecidos são parcialmente provenientes da Internet. Se houver qualquer violação de seus direitos autorais ou outros direitos e interesses, explique os motivos detalhados e forneça prova de direitos autorais ou direitos e interesses e envie-a para o e-mail: [email protected]. Nós cuidaremos disso para você o mais rápido possível.
Copyright© 2022 湘ICP备2022001581号-3