”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

ClassiSage:基于 Terraform IaC 自动化 AWS SageMaker HDFS 日志分类模型

发布于2024-11-07
浏览:478

ClassiSage

A Machine Learning model made with AWS SageMaker and its Python SDK for Classification of HDFS Logs using Terraform for automation of infrastructure setup.

Link: GitHub
Language: HCL (terraform), Python

Content

  • Overview: Project Overview.
  • System Architecture: System Architecture Diagram
  • ML Model: Model Overview.
  • Getting Started: How to run the project.
  • Console Observations: Changes in instances and infrastructure that can be observed while running the project.
  • Ending and Cleanup: Ensuring no additional charges.
  • Auto Created Objects: Files and Folders created during execution process.

  • Firstly follow the Directory Structure for better project setup.
  • Take major reference from the ClassiSage's Project Repository uploaded in GitHub for better understanding.

Overview

  • The model is made with AWS SageMaker for Classification of HDFS Logs along with S3 for storing dataset, Notebook file (containing code for SageMaker instance) and Model Output.
  • The Infrastructure setup is automated using Terraform a tool to provide infrastructure-as-code created by HashiCorp
  • The data set used is HDFS_v1.
  • The project implements SageMaker Python SDK with the model XGBoost version 1.2

System Architecture

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ML Model

  • Image URI
  # Looks for the XGBoost image URI and builds an XGBoost container. Specify the repo_version depending on preference.
  container = get_image_uri(boto3.Session().region_name,
                            'xgboost', 
                            repo_version='1.0-1')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Initializing Hyper Parameter and Estimator call to the container
  hyperparameters = {
        "max_depth":"5",                ## Maximum depth of a tree. Higher means more complex models but risk of overfitting.
        "eta":"0.2",                    ## Learning rate. Lower values make the learning process slower but more precise.
        "gamma":"4",                    ## Minimum loss reduction required to make a further partition on a leaf node. Controls the model’s complexity.
        "min_child_weight":"6",         ## Minimum sum of instance weight (hessian) needed in a child. Higher values prevent overfitting.
        "subsample":"0.7",              ## Fraction of training data used. Reduces overfitting by sampling part of the data. 
        "objective":"binary:logistic",  ## Specifies the learning task and corresponding objective. binary:logistic is for binary classification.
        "num_round":50                  ## Number of boosting rounds, essentially how many times the model is trained.
        }
  # A SageMaker estimator that calls the xgboost-container
  estimator = sagemaker.estimator.Estimator(image_uri=container,                  # Points to the XGBoost container we previously set up. This tells SageMaker which algorithm container to use.
                                          hyperparameters=hyperparameters,      # Passes the defined hyperparameters to the estimator. These are the settings that guide the training process.
                                          role=sagemaker.get_execution_role(),  # Specifies the IAM role that SageMaker assumes during the training job. This role allows access to AWS resources like S3.
                                          train_instance_count=1,               # Sets the number of training instances. Here, it’s using a single instance.
                                          train_instance_type='ml.m5.large',    # Specifies the type of instance to use for training. ml.m5.2xlarge is a general-purpose instance with a balance of compute, memory, and network resources.
                                          train_volume_size=5, # 5GB            # Sets the size of the storage volume attached to the training instance, in GB. Here, it’s 5 GB.
                                          output_path=output_path,              # Defines where the model artifacts and output of the training job will be saved in S3.
                                          train_use_spot_instances=True,        # Utilizes spot instances for training, which can be significantly cheaper than on-demand instances. Spot instances are spare EC2 capacity offered at a lower price.
                                          train_max_run=300,                    # Specifies the maximum runtime for the training job in seconds. Here, it's 300 seconds (5 minutes).
                                          train_max_wait=600)                   # Sets the maximum time to wait for the job to complete, including the time waiting for spot instances, in seconds. Here, it's 600 seconds (10 minutes).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Training Job
  estimator.fit({'train': s3_input_train,'validation': s3_input_test})

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Deployment
  xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Validation
  from sagemaker.serializers import CSVSerializer
  import numpy as np
  from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

  # Drop the label column from the test data
  test_data_features = test_data_final.drop(columns=['Label']).values

  # Set the content type and serializer
  xgb_predictor.serializer = CSVSerializer()
  xgb_predictor.content_type = 'text/csv'

  # Perform prediction
  predictions = xgb_predictor.predict(test_data_features).decode('utf-8')

  y_test = test_data_final['Label'].values

  # Convert the predictions into a array
  predictions_array = np.fromstring(predictions, sep=',')
  print(predictions_array.shape)

  # Converting predictions them to binary (0 or 1)
  threshold = 0.5
  binary_predictions = (predictions_array >= threshold).astype(int)

  # Accuracy
  accuracy = accuracy_score(y_test, binary_predictions)

  # Precision
  precision = precision_score(y_test, binary_predictions)

  # Recall
  recall = recall_score(y_test, binary_predictions)

  # F1 Score
  f1 = f1_score(y_test, binary_predictions)

  # Confusion Matrix
  cm = confusion_matrix(y_test, binary_predictions)

  # False Positive Rate (FPR) using the confusion matrix
  tn, fp, fn, tp = cm.ravel()
  false_positive_rate = fp / (fp   tn)

  # Print the metrics
  print(f"Accuracy: {accuracy:.8f}")
  print(f"Precision: {precision:.8f}")
  print(f"Recall: {recall:.8f}")
  print(f"F1 Score: {f1:.8f}")
  print(f"False Positive Rate: {false_positive_rate:.8f}")

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Getting Started

  • Clone the repository using Git Bash / download a .zip file / fork the repository.
  • Go to your AWS Management Console, click on your account profile on the Top-Right corner and select My Security Credentials from the dropdown.
  • Create Access Key: In the Access keys section, click on Create New Access Key, a dialog will appear with your Access Key ID and Secret Access Key.
  • Download or Copy Keys: (IMPORTANT) Download the .csv file or copy the keys to a secure location. This is the only time you can view the secret access key.
  • Open the cloned Repo. in your VS Code
  • Create a file under ClassiSage as terraform.tfvars with its content as
  # terraform.tfvars
  access_key = ""
  secret_key = ""
  aws_account_id = ""
  • Download and install all the dependancies for using Terraform and Python.
  • In the terminal type/paste terraform init to initialize the backend.

  • Then type/paste terraform Plan to view the plan or simply terraform validate to ensure that there is no error.

  • Finally in the terminal type/paste terraform apply --auto-approve

  • This will show two outputs one as bucket_name other as pretrained_ml_instance_name (The 3rd resource is the variable name given to the bucket since they are global resources ).

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After Completion of the command is shown in the terminal, navigate to ClassiSage/ml_ops/function.py and on the 11th line of the file with code
  output = subprocess.check_output('terraform output -json', shell=True, cwd = r'' #C:\Users\Saahen\Desktop\ClassiSage

and change it to the path where the project directory is present and save it.

  • Then on the ClassiSage\ml_ops\data_upload.ipynb run all code cell till cell number 25 with the code
  # Try to upload the local CSV file to the S3 bucket
  try:
    print(f"try block executing")
    s3.upload_file(
        Filename=local_file_path, 
        Bucket=bucket_name,       
        Key=file_key               # S3 file key (filename in the bucket)
    )
    print(f"Successfully uploaded {file_key} to {bucket_name}")

    # Delete the local file after uploading to S3
    os.remove(local_file_path)
    print(f"Local file {local_file_path} deleted after upload.")

  except Exception as e:
    print(f"Failed to upload file: {e}")
    os.remove(local_file_path)

to upload dataset to S3 Bucket.

  • Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After the execution of the notebook re-open your AWS Management Console.
  • You can search for S3 and Sagemaker services and will see an instance of each service initiated (A S3 bucket and a SageMaker Notebook)

S3 Bucket with named 'data-bucket-' with 2 objects uploaded, a dataset and the pretrained_sm.ipynb file containing model code.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • Go to the notebook instance in the AWS SageMaker, click on the created instance and click on open Jupyter.
  • After that click on new on the top right side of the window and select on terminal.
  • This will create a new terminal.

  • On the terminal paste the following (Replacing with the bucket_name output that is shown in the VS Code's terminal output):
  aws s3 cp s3:///pretrained_sm.ipynb /home/ec2-user/SageMaker/

Terminal command to upload the pretrained_sm.ipynb from S3 to Notebook's Jupyter environment

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • Go Back to the opened Jupyter instance and click on the pretrained_sm.ipynb file to open it and assign it a conda_python3 Kernel.
  • Scroll Down to the 4th cell and replace the variable bucket_name's value by the VS Code's terminal output for bucket_name = ""
  # S3 bucket, region, session
bucket_name = 'data-bucket-axhq3rp8'
my_region = boto3.session.Session().region_name
sess = boto3.session.Session()
print("Region is "   my_region   " and bucket is "   bucket_name)

Output of the code cell execution

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


  • On the top of the file do a Restart by going to the Kernel tab.
  • Execute the Notebook till code cell number 27, with the code
# Print the metrics
print(f"Accuracy: {accuracy:.8f}")
print(f"Precision: {precision:.8f}")
print(f"Recall: {recall:.8f}")
print(f"F1 Score: {f1:.8f}")
print(f"False Positive Rate: {false_positive_rate:.8f}")
  • You will get the intended result. The data will be fetched, split into train and test sets after being adjusted for Labels and Features with a defined output path, then a model using SageMaker's Python SDK will be Trained, Deployed as a EndPoint, Validated to give different metrics.

Console Observation Notes

Execution of 8th cell

# Set an output path where the trained model will be saved
prefix = 'pretrained-algo'
output_path ='s3://{}/{}/output'.format(bucket_name, prefix)
print(output_path)
  • An output path will be setup in the S3 to store model data.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 23rd cell

estimator.fit({'train': s3_input_train,'validation': s3_input_test})
  • A training job will start, you can check it under the training tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • After some time (3 mins est.) It shall be completed and will show the same.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Execution of 24th code cell

xgb_predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.m5.large')
  • An endpoint will be deployed under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

Additional Console Observation:

  • Creation of an Endpoint Configuration under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Creation of an model also under under Inference tab.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model


Ending and Cleanup

  • In the VS Code comeback to data_upload.ipynb to execute last 2 code cells to download the S3 bucket's data into the local system.
  • The folder will be named downloaded_bucket_content. Directory Structure of folder Downloaded.

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • You will get a log of downloaded files in the output cell. It will contain a raw pretrained_sm.ipynb, final_dataset.csv and a model output folder named 'pretrained-algo' with the execution data of the sagemaker code file.
  • Finally go into pretrained_sm.ipynb present inside the SageMaker instance and execute the final 2 code cells. The end-point and the resources within the S3 bucket will be deleted to ensure no additional charges.
  • Deleting The EndPoint
  sagemaker.Session().delete_endpoint(xgb_predictor.endpoint)

ClassiSage: Terraform IaC Automated AWS SageMaker based HDFS Log classification Model

  • Clearing S3: (Needed to destroy the instance)
  bucket_to_delete = boto3.resource('s3').Bucket(bucket_name)
  bucket_to_delete.objects.all().delete()
  • Come back to the VS Code terminal for the project file and then type/paste terraform destroy --auto-approve
  • All the created resource instances will be deleted.

Auto Created Objects

ClassiSage/downloaded_bucket_content
ClassiSage/.terraform
ClassiSage/ml_ops/pycache
ClassiSage/.terraform.lock.hcl
ClassiSage/terraform.tfstate
ClassiSage/terraform.tfstate.backup

NOTE:
If you liked the idea and the implementation of this Machine Learning Project using AWS Cloud's S3 and SageMaker for HDFS log classification, using Terraform for IaC (Infrastructure setup automation), Kindly consider liking this post and starring after checking-out the project repository at GitHub.

版本声明 本文转载于:https://dev.to/saahen_sriyan_mishra/classisage-terraform-iac-automated-aws-sagemaker-based-hdfs-log-classification-model-4pk4?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • 在 Next.js 中使用 React 服务器组件和服务器操作
    在 Next.js 中使用 React 服务器组件和服务器操作
    简介:使用 React 服务器组件增强 Next.js Next.js 已经发展到包含 React Server Components 和 Server Actions 等强大功能,它们提供了一种处理服务器端渲染和逻辑的新方法。这些功能提供了一种更高效、更简化的方法来构建 Web ...
    编程 发布于2024-11-07
  • 为什么我在 Java 中收到“无法对非静态字段进行静态引用”错误?
    为什么我在 Java 中收到“无法对非静态字段进行静态引用”错误?
    避免“无法对非静态字段进行静态引用”错误在Java编程中,“无法对非静态字段进行静态引用”错误尝试在静态方法中访问非静态字段(也称为实例变量)时,会发生“引用非静态字段”错误。在提供的代码中,出现错误是因为 main 方法被声明为静态,这意味着它只能引用类的静态成员,包括静态方法和字段。然而,字段b...
    编程 发布于2024-11-07
  • ## 为什么 Visual Studio 突出显示 __int128 但无法编译它?
    ## 为什么 Visual Studio 突出显示 __int128 但无法编译它?
    Visual Studio 中的 __int128 兼容性问题排查虽然 Visual Studio 的语法突出显示表明 __int128 数据类型可用,但编译错误表明当前体系结构不支持它。当尝试在 Visual Studio 中的 C 项目中使用此 128 位整数类型时,会出现此问题。 Ursach...
    编程 发布于2024-11-07
  • 在 TypeScript 的类组件的构造函数中是否总是需要定义 `props` 和 `state` ?
    在 TypeScript 的类组件的构造函数中是否总是需要定义 `props` 和 `state` ?
    当使用 TypeScript 在 React 中处理类组件时,经常会被问到是否有必要且强制在构造函数中定义 props 和 state 的问题。这个问题的答案取决于组件的具体需求。在这篇博文中,我们将了解何时以及为何使用构造函数来定义 props 和状态,以及不同方法的优缺点。 使用...
    编程 发布于2024-11-07
  • 为什么多年的经验让我选择全栈而不是平均栈
    为什么多年的经验让我选择全栈而不是平均栈
    在全栈和平均栈开发方面工作了 6 年多,我可以告诉你,虽然这两种方法都是流行且有效的方法,但它们满足不同的需求,并且有自己的优点和缺点。这两个堆栈都可以帮助您创建 Web 应用程序,但它们的实现方式却截然不同。如果您在两者之间难以选择,我希望我对两者的经验能给您一些有用的见解。 在这篇文章中,我将带...
    编程 发布于2024-11-07
  • 如何处理 Python Base64 解码中不正确的填充错误?
    如何处理 Python Base64 解码中不正确的填充错误?
    在Python Base64解码中处理不正确的填充在Python中使用base64.decodestring()解码base64编码的数据时,你可能会遇到“填充不正确”错误。要绕过这个问题,您可以考虑几种方法。1。添加填充根据接受的答案中的建议,您可以在解码之前简单地添加最大可能的填充字符。在 Py...
    编程 发布于2024-11-07
  • PHP 可以像 JavaScript 一样将函数作为参数传递吗?
    PHP 可以像 JavaScript 一样将函数作为参数传递吗?
    在 PHP 中将函数作为参数传递将函数作为数据元素进行操作是现代编程中常用的一种通用技术。一个这样的例子是将函数作为参数传递,这是 5.3 之前的 PHP 版本中不容易使用的功能。现在,我们深入研究此功能,探索何时以及如何使用它。问题: 函数可以在 PHP 中作为参数传递吗,类似于 JavaScri...
    编程 发布于2024-11-07
  • 反思 GSoC 4
    反思 GSoC 4
    Achievements, Lessons, and Tips for Future Success An exciting summer has come to a close for stdlib with our first participation in Google Summer of...
    编程 发布于2024-11-07
  • 在 Go 中如何将字节数组转换为有符号整数和浮点数?
    在 Go 中如何将字节数组转换为有符号整数和浮点数?
    Go 中将字节数组转换为有符号整数和浮点数在 Go 中,二进制包提供了从 []byte 转换无符号整数的函数数组,例如binary.LittleEndian.Uint16()和binary.BigEndian.Uint32()。然而,有符号整数或浮点数没有直接等价物。缺少有符号整数转换函数的原因缺少...
    编程 发布于2024-11-07
  • 如何修复 Java + MySQL UTF-8 编码问题:为什么我的特殊字符显示为问号?
    如何修复 Java + MySQL UTF-8 编码问题:为什么我的特殊字符显示为问号?
    Java MySQL UTF-8 编码问题您提到了使用 Java 和 MySQL 时经常遇到的问题,其中存储了特殊字符作为问号(“?”)。当 MySQL 数据库、表和列设置为使用 UTF-8 字符编码,但 JDBC 连接未正确配置时,就会出现此问题。在您的代码中,在建立与数据库的连接时,您可以指定附...
    编程 发布于2024-11-07
  • 令牌桶算法:流量管理必备指南
    令牌桶算法:流量管理必备指南
    令牌桶算法是控制网络流量、确保公平带宽使用和防止网络拥塞的流行机制。它的运作原理很简单,即根据令牌可用性来调节数据传输,其中令牌代表发送一定量数据的权利。该算法对于维护各种系统(包括网络、API 和云服务)中的流量至关重要,提供了一种在不造成资源过载的情况下管理流量的方法。 令牌桶算法如何工作 令...
    编程 发布于2024-11-07
  • 如何为您的 Python 项目选择最佳的 XML 库?
    如何为您的 Python 项目选择最佳的 XML 库?
    Python 中的 XML 创建:库和方法综合指南在 Python 中创建 XML 文档时,开发人员可以选择各种库选项处理。最流行和最直接的选择是 ElementTree API,它是 Python 标准库自 2.5 版以来不可或缺的一部分。ElementTree:高效选项ElementTree 提...
    编程 发布于2024-11-07
  • 如何使用多个字段对 Java 中的对象列表进行排序?
    如何使用多个字段对 Java 中的对象列表进行排序?
    Java 中具有多个字段的列表对象的自定义排序虽然基于一个字段对列表中的对象进行排序很简单,但使用多个字段进行排序可能有点棘手。本文深入研究按多个字段排序的问题,并探讨 Java 中可用的各种解决方案。问题考虑一个场景,其中您有一个包含三个字段的“Report”对象列表:ReportKey、学号和学...
    编程 发布于2024-11-07
  • 如何使用递归从具有父类别的数据库生成嵌套菜单树?
    如何使用递归从具有父类别的数据库生成嵌套菜单树?
    菜单树生成的递归在您的情况下,您有一个数据库结构,其中类别有一个“根”字段指示其父类别。您想要的 HTML 输出涉及表示类别层次结构的嵌套列表。为此,可以使用递归 PHP 函数。这是一个示例函数:function recurse($categories, $parent = null, $level...
    编程 发布于2024-11-07
  • Array_column 函数可以用于对象数组吗?
    Array_column 函数可以用于对象数组吗?
    将 array_column 与对象数组一起使用本题探讨了将 array_column 函数与由对象组成的数组一起使用的可行性。开发人员实现了 ArrayAccess 接口,但发现它没有任何影响。PHP 5在 PHP 7 之前,array_column 本身不支持对象数组。要解决此限制,可以使用替代...
    编程 发布于2024-11-07

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3