」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > ClassiSage:基於 Terraform IaC 自動化 AWS SageMaker HDFS 日誌分類模型

ClassiSage:基於 Terraform IaC 自動化 AWS SageMaker HDFS 日誌分類模型

發佈於2024-11-07
瀏覽:715

ClassiSage

A Machine Learning model made with AWS SageMaker and its Python SDK for Classification of HDFS Logs using Terraform for automation of infrastructure setup.

Link: GitHub
Language: HCL (terraform), Python

Content

  • Overview: Project Overview.
  • System Architecture: System Architecture Diagram
  • ML Model: Model Overview.
  • Getting Started: How to run the project.
  • Console Observations: Changes in instances and infrastructure that can be observed while running the project.
  • Ending and Cleanup: Ensuring no additional charges.
  • Auto Created Objects: Files and Folders created during execution process.

  • Firstly follow the Directory Structure for better project setup.
  • Take major reference from the ClassiSage's Project Repository uploaded in GitHub for better understanding.

Overview

  • The model is made with AWS SageMaker for Classification of HDFS Logs along with S3 for storing dataset, Notebook file (containing code for SageMaker instance) and Model Output.
  • The Infrastructure setup is automated using Terraform a tool to provide infrastructure-as-code created by HashiCorp
  • The data set used is HDFS_v1.
  • The project implements SageMaker Python SDK with the model XGBoost version 1.2

System Architecture

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ML Model

  • Image URI
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Initializing Hyper Parameter and Estimator call to the container
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Training Job
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Deployment
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Validation
  from sagemaker.serializers import CSVSerializer
  import numpy as np
  from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

  # Drop the label column from the test data
  test_data_features = test_data_final.drop(columns=['Label']).values

  # Set the content type and serializer
  xgb_predictor.serializer = CSVSerializer()
  xgb_predictor.content_type = 'text/csv'

  # Perform prediction
  predictions = xgb_predictor.predict(test_data_features).decode('utf-8')

  y_test = test_data_final['Label'].values

  # Convert the predictions into a array
  predictions_array = np.fromstring(predictions, sep=',')
  print(predictions_array.shape)

  # Converting predictions them to binary (0 or 1)
  threshold = 0.5
  binary_predictions = (predictions_array >= threshold).astype(int)

  # Accuracy
  accuracy = accuracy_score(y_test, binary_predictions)

  # Precision
  precision = precision_score(y_test, binary_predictions)

  # Recall
  recall = recall_score(y_test, binary_predictions)

  # F1 Score
  f1 = f1_score(y_test, binary_predictions)

  # Confusion Matrix
  cm = confusion_matrix(y_test, binary_predictions)

  # False Positive Rate (FPR) using the confusion matrix
  tn, fp, fn, tp = cm.ravel()
  false_positive_rate = fp / (fp   tn)

  # Print the metrics
  print(f"Accuracy: {accuracy:.8f}")
  print(f"Precision: {precision:.8f}")
  print(f"Recall: {recall:.8f}")
  print(f"F1 Score: {f1:.8f}")
  print(f"False Positive Rate: {false_positive_rate:.8f}")

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Getting Started

  • Clone the repository using Git Bash / download a .zip file / fork the repository.
  • Go to your AWS Management Console, click on your account profile on the Top-Right corner and select My Security Credentials from the dropdown.
  • Create Access Key: In the Access keys section, click on Create New Access Key, a dialog will appear with your Access Key ID and Secret Access Key.
  • Download or Copy Keys: (IMPORTANT) Download the .csv file or copy the keys to a secure location. This is the only time you can view the secret access key.
  • Open the cloned Repo. in your VS Code
  • Create a file under ClassiSage as terraform.tfvars with its content as
  # terraform.tfvars
  access_key = ""
  secret_key = ""
  aws_account_id = ""
  • Download and install all the dependancies for using Terraform and Python.
  • In the terminal type/paste terraform init to initialize the backend.

  • Then type/paste terraform Plan to view the plan or simply terraform validate to ensure that there is no error.

  • Finally in the terminal type/paste terraform apply --auto-approve

  • This will show two outputs one as bucket_name other as pretrained_ml_instance_name (The 3rd resource is the variable name given to the bucket since they are global resources ).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After Completion of the command is shown in the terminal, navigate to ClassiSage/ml_ops/function.py and on the 11th line of the file with code
  output = subprocess.check_output('terraform output -json', shell=True, cwd = r'' #C:\Users\Saahen\Desktop\ClassiSage

and change it to the path where the project directory is present and save it.

  • Then on the ClassiSage\ml_ops\data_upload.ipynb run all code cell till cell number 25 with the code
  # Try to upload the local CSV file to the S3 bucket
  try:
    print(f"try block executing")
    s3.upload_file(
        Filename=local_file_path, 
        Bucket=bucket_name,       
        Key=file_key               # S3 file key (filename in the bucket)
    )
    print(f"Successfully uploaded {file_key} to {bucket_name}")

    # Delete the local file after uploading to S3
    os.remove(local_file_path)
    print(f"Local file {local_file_path} deleted after upload.")

  except Exception as e:
    print(f"Failed to upload file: {e}")
    os.remove(local_file_path)

to upload dataset to S3 Bucket.

  • Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After the execution of the notebook re-open your AWS Management Console.
  • You can search for S3 and Sagemaker services and will see an instance of each service initiated (A S3 bucket and a SageMaker Notebook)

S3 Bucket with named 'data-bucket-' with 2 objects uploaded, a dataset and the pretrained_sm.ipynb file containing model code.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • Go to the notebook instance in the AWS SageMaker, click on the created instance and click on open Jupyter.
  • After that click on new on the top right side of the window and select on terminal.
  • This will create a new terminal.

  • On the terminal paste the following (Replacing with the bucket_name output that is shown in the VS Code's terminal output):
  aws s3 cp s3:///pretrained_sm.ipynb /home/ec2-user/SageMaker/

Terminal command to upload the pretrained_sm.ipynb from S3 to Notebook's Jupyter environment

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • Go Back to the opened Jupyter instance and click on the pretrained_sm.ipynb file to open it and assign it a conda_python3 Kernel.
  • Scroll Down to the 4th cell and replace the variable bucket_name's value by the VS Code's terminal output for bucket_name = ""
  # S3 bucket, region, session
bucket_name = 'data-bucket-axhq3rp8'
my_region = boto3.session.Session().region_name
sess = boto3.session.Session()
print("Region is "   my_region   " and bucket is "   bucket_name)

Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • On the top of the file do a Restart by going to the Kernel tab.
  • Execute the Notebook till code cell number 27, with the code
# Print the metrics
print(f"Accuracy: {accuracy:.8f}")
print(f"Precision: {precision:.8f}")
print(f"Recall: {recall:.8f}")
print(f"F1 Score: {f1:.8f}")
print(f"False Positive Rate: {false_positive_rate:.8f}")
  • You will get the intended result. The data will be fetched, split into train and test sets after being adjusted for Labels and Features with a defined output path, then a model using SageMaker's Python SDK will be Trained, Deployed as a EndPoint, Validated to give different metrics.

Console Observation Notes

Execution of 8th cell

# Set an output path where the trained model will be saved
prefix = 'pretrained-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)
  • An output path will be setup in the S3 to store model data.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 23rd cell

estimator.fit({'train': s3_input_train,'validation': s3_input_test})
  • A training job will start, you can check it under the training tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After some time (3 mins est.) It shall be completed and will show the same.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 24th code cell

xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
  • An endpoint will be deployed under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Additional Console Observation:

  • Creation of an Endpoint Configuration under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Creation of an model also under under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


Ending and Cleanup

  • In the VS Code comeback to data_upload.ipynb to execute last 2 code cells to download the S3 bucket's data into the local system.
  • The folder will be named downloaded_bucket_content. Directory Structure of folder Downloaded.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • You will get a log of downloaded files in the output cell. It will contain a raw pretrained_sm.ipynb, final_dataset.csv and a model output folder named 'pretrained-algo' with the execution data of the sagemaker code file.
  • Finally go into pretrained_sm.ipynb present inside the SageMaker instance and execute the final 2 code cells. The end-point and the resources within the S3 bucket will be deleted to ensure no additional charges.
  • Deleting The EndPoint
  sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Clearing S3: (Needed to destroy the instance)
  bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
  bucket_to_delete.objects.all().delete()
  • Come back to the VS Code terminal for the project file and then type/paste terraform destroy --auto-approve
  • All the created resource instances will be deleted.

Auto Created Objects

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

NOTE:
If you liked the idea and the implementation of this Machine Learning Project using AWS Cloud's S3 and SageMaker for HDFS log classification, using Terraform for IaC (Infrastructure setup automation), Kindly consider liking this post and starring after checking-out the project repository at GitHub.

版本聲明 本文轉載於:https://dev.to/saahen_sriyan_mishra/classisage-terraform-iac-automated-aws-sagemaker-based-hdfs-log-classification-model-4pk4?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • 在 Next.js 中使用 React 伺服器元件和伺服器操作
    在 Next.js 中使用 React 伺服器元件和伺服器操作
    簡介:使用 React 伺服器元件增強 Next.js Next.js 已發展到包含 React Server Components 和 Server Actions 等強大功能,它們提供了一種處理伺服器端渲染和邏輯的新方法。這些功能提供了一種更有效率、更簡化的方法來建立 Web ...
    程式設計 發佈於2024-11-07
  • 為什麼我在 Java 中收到「無法對非靜態欄位進行靜態引用」錯誤?
    為什麼我在 Java 中收到「無法對非靜態欄位進行靜態引用」錯誤?
    避免「無法對非靜態欄位進行靜態引用」錯誤在Java程式設計中,「無法對非靜態欄位進行靜態引用」錯誤嘗試在靜態方法中存取非靜態欄位(也稱為實例變數)時,會發生「引用非靜態欄位」錯誤。 在提供的程式碼中,出現錯誤的原因是 main 方法被宣告為靜態,意味著它只能引用類別的靜態成員,包括靜態方法和欄位。然...
    程式設計 發佈於2024-11-07
  • ## 為什麼 Visual Studio 會反白顯示 __int128 但無法編譯它?
    ## 為什麼 Visual Studio 會反白顯示 __int128 但無法編譯它?
    Visual Studio 中的__int128 相容性問題排查雖然Visual Studio 的語法突出顯示表明__int128 資料類型可用,但編譯錯誤表明當前體系結構不支持它。當嘗試在 Visual Studio 中的 C 專案中使用此 128 位元整數類型時,會出現此問題。 Ursache...
    程式設計 發佈於2024-11-07
  • 在 TypeScript 的類別元件的建構函式中是否總是需要定義 `props` 和 `state` ?
    在 TypeScript 的類別元件的建構函式中是否總是需要定義 `props` 和 `state` ?
    當使用 TypeScript 在 React 中處理類別元件時,經常被問到是否有必要且強制在建構函式中定義 props 和 state 的問題。這個問題的答案取決於組件的特定需求。在這篇文章中,我們將了解何時以及為何使用建構子來定義 props 和狀態,以及不同方法的優缺點。 使用...
    程式設計 發佈於2024-11-07
  • 為什麼多年的經驗讓我選擇全端而不是平均棧
    為什麼多年的經驗讓我選擇全端而不是平均棧
    在全栈和平均栈开发方面工作了 6 年多,我可以告诉你,虽然这两种方法都是流行且有效的方法,但它们满足不同的需求,并且有自己的优点和缺点。这两个堆栈都可以帮助您创建 Web 应用程序,但它们的实现方式却截然不同。如果您在两者之间难以选择,我希望我对两者的经验能给您一些有用的见解。 在这篇文章中,我将带...
    程式設計 發佈於2024-11-07
  • 如何處理 Python Base64 解碼中不正確的填滿錯誤?
    如何處理 Python Base64 解碼中不正確的填滿錯誤?
    在Python Base64解碼中處理不正確的填充在Python中使用64.decodestring()解碼basebase64編碼的資料時,你可能會遇到“填充不正確”錯誤。要繞過這個問題,您可以考慮幾種方法。 1。添加填充根據接受的答案中的建議,您可以在解碼之前簡單地添加最大可能的填充字符。在 P...
    程式設計 發佈於2024-11-07
  • PHP 可以像 JavaScript 一樣將函數當作參數傳遞嗎?
    PHP 可以像 JavaScript 一樣將函數當作參數傳遞嗎?
    在 PHP 中將函數作為參數傳遞將函數作為資料元素進行操作是現代程式設計中常用的通用技術。一個這樣的例子是將函數作為參數傳遞,這是 5.3 之前的 PHP 版本中不容易使用的功能。現在,我們深入研究此功能,探索何時以及如何使用它。 問題: 函數可以在 PHP 中作為參數傳遞嗎,類似 JavaScri...
    程式設計 發佈於2024-11-07
  • 反思 GSoC 4
    反思 GSoC 4
    Achievements, Lessons, and Tips for Future Success An exciting summer has come to a close for stdlib with our first participation in Google Summer of...
    程式設計 發佈於2024-11-07
  • 在 Go 中如何將位元組數組轉換為有符號整數和浮點數?
    在 Go 中如何將位元組數組轉換為有符號整數和浮點數?
    Go 中將位元組數組轉換為有符號整數和浮點數在Go 中,二進位包提供了從[]byte轉換無符號整數的函數數組,例如binary.LittleEndian.Uint16()和binary.BigEndian.Uint32()。然而,有符號整數或浮點數沒有直接等價物。 缺少有符號整數轉換函數的原因缺少有...
    程式設計 發佈於2024-11-07
  • 如何修復 Java + MySQL UTF-8 編碼問題:為什麼我的特殊字元顯示為問號?
    如何修復 Java + MySQL UTF-8 編碼問題:為什麼我的特殊字元顯示為問號?
    Java MySQL UTF-8 編碼問題Java MySQL UTF-8 編碼問題您提到了使用Java 和MySQL 時經常遇到的問題,其中儲存了特殊字元作為問號(“?”)。當 MySQL 資料庫、表格和欄位設定為使用 UTF-8 字元編碼,但 JDBC 連線未正確配置時,就會發生此問題。 在您的...
    程式設計 發佈於2024-11-07
  • 令牌桶演算法:流量管理必備指南
    令牌桶演算法:流量管理必備指南
    令牌桶演算法是控製網路流量、確保公平頻寬使用和防止網路擁塞的流行機制。它的運作原理很簡單,即根據令牌可用性來調節資料傳輸,其中令牌代表發送一定量資料的權利。該演算法對於維護各種系統(包括網路、API 和雲端服務)中的流量至關重要,提供了一種在不造成資源過載的情況下管理流量的方法。 令牌桶演算法如...
    程式設計 發佈於2024-11-07
  • 如何為您的 Python 專案選擇最佳的 XML 函式庫?
    如何為您的 Python 專案選擇最佳的 XML 函式庫?
    Python 中的XML 創建:庫和方法綜合指南在Python 中建立XML 文件時,開發人員可以選擇各種庫選項處理。最受歡迎和最直接的選擇是 ElementTree API,它是 Python 標準庫自 2.5 版以來不可或缺的一部分。 ElementTree:高效率選項ElementTree 提...
    程式設計 發佈於2024-11-07
  • 如何使用多個欄位對 Java 中的物件清單進行排序?
    如何使用多個欄位對 Java 中的物件清單進行排序?
    Java 中具有多個字段的列表對象的自定義排序雖然基於一個字段對列表中的對象進行排序很簡單,但使用多個字段進行排序可能有點棘手。本文深入研究按多個欄位排序的問題,並探討 Java 中可用的各種解決方案。 問題考慮一個場景,其中您有一個包含三個欄位的「Report」物件清單:ReportKey、學號和...
    程式設計 發佈於2024-11-07
  • 如何使用遞歸從具有父類別的資料庫產生巢狀選單樹?
    如何使用遞歸從具有父類別的資料庫產生巢狀選單樹?
    選單樹產生的遞歸在您的情況下,您有一個資料庫結構,其中類別有一個「根」欄位指示其父類別。您想要的 HTML 輸出涉及表示類別層次結構的巢狀清單。為此,可以使用遞歸 PHP 函數。 這是一個範例函數:function recurse($categories, $parent = null, $leve...
    程式設計 發佈於2024-11-07
  • Array_column 函數可以用於物件陣列嗎?
    Array_column 函數可以用於物件陣列嗎?
    將 array_column 與物件陣列一起使用將 array_column 與物件陣列一起使用本題探討了將 array_column 函數與由物件組成的陣列一起使用的可行性。開發人員實作了 ArrayAccess 接口,但發現它沒有任何影響。 PHP 5$titles = array_map(fu...
    程式設計 發佈於2024-11-07

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3