Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Front page > Programming > Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Published on 2024-11-09

Browse:353

Implementando uma Lambda com GitLab CI/CD e Terraform para Integração SFTP, S Databricks em Go

Reducing Costs with Process Automation in Databricks

I had a need at a client to reduce the cost of processes that ran on Databricks. One of the features that Databricks was responsible for was collecting files from various SFTP, decompressing them and placing them in the Data Lake.

Automating data workflows is a crucial component in modern data engineering. In this article, we will explore how to create an AWS Lambda function using GitLab CI/CD and Terraform that allows a Go application to connect to an SFTP server, collect files, store them in Amazon S3, and finally trigger a job on Databricks. This end-to-end process is essential for systems that rely on efficient data integration and automation.

What You Will Need for This Article

GitLab account with a repository for the project.
AWS account with permissions to create Lambda, S3, and IAM resources.
Databricks account with permissions to create and run jobs.
Basic knowledge of Go, Terraform and GitLab CI/CD.

Step 1: Preparing the Go Application

Start by creating a Go application that will connect to the SFTP server to collect files. Use packages like github.com/pkg/sftp to establish the SFTP connection and github.com/aws/aws-sdk-go to interact with the AWS S3 service.

package main

import (
 "fmt"
 "log"
 "os"
 "path/filepath"

 "github.com/pkg/sftp"
 "golang.org/x/crypto/ssh"
 "github.com/aws/aws-sdk-go/aws"
 "github.com/aws/aws-sdk-go/aws/session"
 "github.com/aws/aws-sdk-go/service/s3/s3manager"
)

func main() {
 // Configuração do cliente SFTP
 user := "seu_usuario_sftp"
 pass := "sua_senha_sftp"
 host := "endereco_sftp:22"
 config := &ssh.ClientConfig{
  User: user,
  Auth: []ssh.AuthMethod{
   ssh.Password(pass),
  },
  HostKeyCallback: ssh.InsecureIgnoreHostKey(),
 }

 // Conectar ao servidor SFTP
 conn, err := ssh.Dial("tcp", host, config)
 if err != nil {
  log.Fatal(err)
 }
 client, err := sftp.NewClient(conn)
 if err != nil {
  log.Fatal(err)
 }
 defer client.Close()

 // Baixar arquivos do SFTP
 remoteFilePath := "/path/to/remote/file"
 localDir := "/path/to/local/dir"
 localFilePath := filepath.Join(localDir, filepath.Base(remoteFilePath))
 dstFile, err := os.Create(localFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer dstFile.Close()

 srcFile, err := client.Open(remoteFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer srcFile.Close()

 if _, err := srcFile.WriteTo(dstFile); err != nil {
  log.Fatal(err)
 }

 fmt.Println("Arquivo baixado com sucesso:", localFilePath)

 // Configuração do cliente S3
 sess := session.Must(session.NewSession(&aws.Config{
  Region: aws.String("us-west-2"),
 }))
 uploader := s3manager.NewUploader(sess)

 // Carregar arquivo para o S3
 file, err := os.Open(localFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer file.Close()

 _, err = uploader.Upload(&s3manager.UploadInput{
  Bucket: aws.String("seu-bucket-s3"),
  Key:    aws.String(filepath.Base(localFilePath)),
  Body:   file,
 })
 if err != nil {
  log.Fatal("Falha ao carregar arquivo para o S3:", err)
 }

 fmt.Println("Arquivo carregado com sucesso no S3")
}

Step 2: Configuring Terraform

Terraform will be used to provision the Lambda function and required resources on AWS. Create a main.tf file with the configuration required to create the Lambda function, IAM policies, and S3 buckets.

provider "aws" {
  region = "us-east-1"
}

resource "aws_iam_role" "lambda_execution_role" {
  name = "lambda_execution_role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRole",
        Effect = "Allow",
        Principal = {
          Service = "lambda.amazonaws.com"
        },
      },
    ]
  })
}

resource "aws_iam_policy" "lambda_policy" {
  name        = "lambda_policy"
  description = "A policy that allows a lambda function to access S3 and SFTP resources"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "s3:ListBucket",
          "s3:GetObject",
          "s3:PutObject",
        ],
        Effect = "Allow",
        Resource = [
          "arn:aws:s3:::seu-bucket-s3",
          "arn:aws:s3:::seu-bucket-s3/*",
        ],
      },
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_policy_attachment" {
  role       = aws_iam_role.lambda_execution_role.name
  policy_arn = aws_iam_policy.lambda_policy.arn
}

resource "aws_lambda_function" "sftp_lambda" {
  function_name = "sftp_lambda_function"

  s3_bucket = "seu-bucket-s3-com-codigo-lambda"
  s3_key    = "sftp-lambda.zip"

  handler = "main"
  runtime = "go1.x"

  role = aws_iam_role.lambda_execution_role.arn

  environment {
    variables = {
      SFTP_HOST     = "endereco_sftp",
      SFTP_USER     = "seu_usuario_sftp",
      SFTP_PASSWORD = "sua_senha_sftp",
      S3_BUCKET     = "seu-bucket-s3",
    }
  }
}

resource "aws_s3_bucket" "s3_bucket" {
  bucket = "seu-bucket-s3"
  acl    = "private"
}

Step 3: Configuring GitLab CI/CD

In GitLab, define the CI/CD pipeline in the .gitlab-ci.yml file. This pipeline should include steps to test the Go application, run Terraform to provision the infrastructure, and a step for cleanup if necessary.

stages:
  - test
  - build
  - deploy

variables:
  S3_BUCKET: "seu-bucket-s3"
  AWS_DEFAULT_REGION: "us-east-1"
  TF_VERSION: "1.0.0"

before_script:
  - 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client -y )'
  - eval $(ssh-agent -s)
  - echo "$PRIVATE_KEY" | tr -d '\r' | ssh-add -
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh
  - ssh-keyscan -H 'endereco_sftp' >> ~/.ssh/known_hosts

test:
  stage: test
  image: golang:1.18
  script:
    - go test -v ./...

build:
  stage: build
  image: golang:1.18
  script:
    - go build -o myapp
    - zip -r sftp-lambda.zip myapp
  artifacts:
    paths:
      - sftp-lambda.zip
  only:
    - master

deploy:
  stage: deploy
  image: hashicorp/terraform:$TF_VERSION
  script:
    - terraform init
    - terraform apply -auto-approve
  only:
    - master
  environment:
    name: production

Step 4: Integrating with Databricks

After uploading the files to S3, the Lambda function must trigger a job in Databricks. This can be done using the Databricks API to launch existing jobs.

package main

import (
 "bytes"
 "encoding/json"
 "fmt"
 "net/http"
)

// Estrutura para a requisição de iniciar um job no Databricks
type DatabricksJobRequest struct {
 JobID int `json:"job_id"`
}

// Função para acionar um job no Databricks
func triggerDatabricksJob(databricksInstance string, token string, jobID int) error {
 url := fmt.Sprintf("https://%s/api/2.0/jobs/run-now", databricksInstance)
 requestBody, _ := json.Marshal(DatabricksJobRequest{JobID: jobID})
 req, err := http.NewRequest("POST", url, bytes.NewBuffer(requestBody))
 if err != nil {
  return err
 }

 req.Header.Set("Content-Type", "application/json")
 req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", token))

 client := &http.Client{}
 resp, err := client.Do(req)
 if err != nil {
  return err
 }
 defer resp.Body.Close()

 if resp.StatusCode != http.StatusOK {
  return fmt.Errorf("Failed to trigger Databricks job, status code: %d", resp.StatusCode)
 }

 return nil
}

func main() {
 // ... (código existente para conectar ao SFTP e carregar no S3)

 // Substitua pelos seus valores reais
 databricksInstance := "your-databricks-instance"
 databricksToken := "your-databricks-token"
 databricksJobID := 123 // ID do job que você deseja acionar

 // Acionar o job no Databricks após o upload para o S3
 err := triggerDatabricksJob(databricksInstance, databricksToken, databricksJobID)
 if err != nil {
  log.Fatal("Erro ao acionar o job do Databricks:", err)
 }

 fmt.Println("Job do Databricks acionado com sucesso")
}

Step 5: Running the Pipeline

Push the code to the GitLab repository for the pipeline to run. Verify that all steps are completed successfully and that the Lambda function is operational and interacting correctly with S3 and Databricks.

Once you have the complete code and the .gitlab-ci.yml file configured, you can run the pipeline by following these steps:

Push your code to the GitLab repository:

  git add .
  git commit -m "Adiciona função Lambda para integração SFTP, S3 e Databricks"
  git push origin master

git add .
git commit -m "Adiciona função Lambda para integração SFTP, S3 e Databricks"
git push origin master
´´´

GitLab CI/CD will detect the new commit and start the pipeline automatically.
Track the execution of the pipeline in GitLab by accessing the CI/CD section of your repository.
If all stages are successful, your Lambda function will be deployed and ready to use.

Remember that you will need to configure environment variables in GitLab CI/CD to store sensitive information such as access tokens and private keys. This can be done in the ‘Settings’ > ‘CI/CD’ > ‘Variables’ section of your GitLab project.

Also, ensure that the Databricks token has the necessary permissions to trigger jobs and that the job exists with the provided ID.

Conclusion

Automation of data engineering tasks can be significantly simplified using tools such as GitLab CI/CD, Terraform, and AWS Lambda. By following the steps outlined in this article, you can create a robust system that automates data collection and integration between SFTP, S3, and Databricks, all with the efficiency and simplicity of Go. With this approach, you will be well equipped to address the challenges of data integration at scale.

My contacts:

LinkedIn - Airton Lira Junior

iMasters - Airton Lira Junior

aws #lambda #terraform #gitlab #ci_cd #go #databricks #dataengineering #automation

Release Statement This article is reproduced at: https://dev.to/airton_lirajunior_2ddebd/implementando-uma-lambda-com-gitlab-cicd-e-terraform-para-integracao-sftp-s3-e-databricks-em-go-5hc0?1 as If there is any infringement, please contact [email protected] to delete it.

Latest tutorial More>

How to Simplify JSON Parsing in PHP for Multi-Dimensional Arrays?
Parsing JSON with PHPTrying to parse JSON data in PHP can be challenging, especially when dealing with multi-dimensional arrays. To simplify the proce...

Programming Posted on 2025-03-25
How to Convert a Pandas DataFrame Column to DateTime Format and Filter by Date?
Transform Pandas DataFrame Column to DateTime FormatScenario:Data within a Pandas DataFrame often exists in various formats, including strings. When w...

Programming Posted on 2025-03-25
Which Method for Declaring Multiple Variables in JavaScript is More Maintainable?
Declaring Multiple Variables in JavaScript: Exploring Two MethodsIn JavaScript, developers often encounter the need to declare multiple variables. Two...

Programming Posted on 2025-03-25
Python Read CSV File UnicodeDecodeError Ultimate Solution
Unicode Decode Error in CSV File ReadingWhen attempting to read a CSV file into Python using the built-in csv module, you may encounter an error stati...

Programming Posted on 2025-03-25
How to Parse JSON Arrays in Go Using the `json` Package?
Parsing JSON Arrays in Go with the JSON PackageProblem: How can you parse a JSON string representing an array in Go using the json package?Code Exampl...

Programming Posted on 2025-03-25
Eval() vs. ast.literal_eval(): Which Python Function Is Safer for User Input?
Weighing eval() and ast.literal_eval() in Python SecurityWhen handling user input, it's imperative to prioritize security. eval(), a powerful Pyth...

Programming Posted on 2025-03-25
How to upload files with additional parameters using java.net.URLConnection and multipart/form-data encoding?
Uploading Files with HTTP RequestsTo upload files to an HTTP server while also submitting additional parameters, java.net.URLConnection and multipart/...

Programming Posted on 2025-03-25
Why Does PHP's DateTime::modify('+1 month') Produce Unexpected Results?
Modifying Months with PHP DateTime: Uncovering the Intended BehaviorWhen working with PHP's DateTime class, adding or subtracting months may not a...

Programming Posted on 2025-03-25
How to Capture and Stream stdout in Real Time for Chatbot Command Execution?
Capturing stdout in Real Time from Command ExecutionIn the realm of developing chatbots capable of executing commands, a common requirement is the abi...

Programming Posted on 2025-03-25
Why Does Microsoft Visual C++ Fail to Correctly Implement Two-Phase Template Instantiation?
The Mystery of "Broken" Two-Phase Template Instantiation in Microsoft Visual C Problem Statement:Users commonly express concerns that Micro...

Programming Posted on 2025-03-25
How Can I Efficiently Read a Large File in Reverse Order Using Python?
Reading a File in Reverse Order in PythonIf you're working with a large file and need to read its contents from the last line to the first, Python...

Programming Posted on 2025-03-25
How Can I Programmatically Select All Text Within a DIV on Mouse Click?
Programmatically Selecting DIV Text on Mouse ClickQuestionGiven a DIV element with text content, how can the user programmatically select the entire t...

Programming Posted on 2025-03-25
How to Send a Raw POST Request with cURL in PHP?
How to Send a Raw POST Request Using cURL in PHPIn PHP, cURL is a popular library for sending HTTP requests. This article will demonstrate how to use ...

Programming Posted on 2025-03-25
Why Doesn't `body { margin: 0; }` Always Remove Top Margin in CSS?
Addressing Body Margin Removal in CSSFor novice web developers, removing the margin of the body element can be a confusing task. Often, the code provi...

Programming Posted on 2025-03-25
How do you extract a random element from an array in PHP?
Random Selection from an ArrayIn PHP, obtaining a random item from an array can be accomplished with ease. Consider the following array:$items = [523,...

Programming Posted on 2025-03-25