Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Front page > Programming > Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Implementing a Lambda with GitLab CI/CD and Terraform for SFTP Integration, S Databricks in Go

Published on 2024-11-09

Browse:268

Implementando uma Lambda com GitLab CI/CD e Terraform para Integração SFTP, S Databricks em Go

Reducing Costs with Process Automation in Databricks

I had a need at a client to reduce the cost of processes that ran on Databricks. One of the features that Databricks was responsible for was collecting files from various SFTP, decompressing them and placing them in the Data Lake.

Automating data workflows is a crucial component in modern data engineering. In this article, we will explore how to create an AWS Lambda function using GitLab CI/CD and Terraform that allows a Go application to connect to an SFTP server, collect files, store them in Amazon S3, and finally trigger a job on Databricks. This end-to-end process is essential for systems that rely on efficient data integration and automation.

What You Will Need for This Article

GitLab account with a repository for the project.
AWS account with permissions to create Lambda, S3, and IAM resources.
Databricks account with permissions to create and run jobs.
Basic knowledge of Go, Terraform and GitLab CI/CD.

Step 1: Preparing the Go Application

Start by creating a Go application that will connect to the SFTP server to collect files. Use packages like github.com/pkg/sftp to establish the SFTP connection and github.com/aws/aws-sdk-go to interact with the AWS S3 service.

package main

import (
 "fmt"
 "log"
 "os"
 "path/filepath"

 "github.com/pkg/sftp"
 "golang.org/x/crypto/ssh"
 "github.com/aws/aws-sdk-go/aws"
 "github.com/aws/aws-sdk-go/aws/session"
 "github.com/aws/aws-sdk-go/service/s3/s3manager"
)

func main() {
 // Configuração do cliente SFTP
 user := "seu_usuario_sftp"
 pass := "sua_senha_sftp"
 host := "endereco_sftp:22"
 config := &ssh.ClientConfig{
  User: user,
  Auth: []ssh.AuthMethod{
   ssh.Password(pass),
  },
  HostKeyCallback: ssh.InsecureIgnoreHostKey(),
 }

 // Conectar ao servidor SFTP
 conn, err := ssh.Dial("tcp", host, config)
 if err != nil {
  log.Fatal(err)
 }
 client, err := sftp.NewClient(conn)
 if err != nil {
  log.Fatal(err)
 }
 defer client.Close()

 // Baixar arquivos do SFTP
 remoteFilePath := "/path/to/remote/file"
 localDir := "/path/to/local/dir"
 localFilePath := filepath.Join(localDir, filepath.Base(remoteFilePath))
 dstFile, err := os.Create(localFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer dstFile.Close()

 srcFile, err := client.Open(remoteFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer srcFile.Close()

 if _, err := srcFile.WriteTo(dstFile); err != nil {
  log.Fatal(err)
 }

 fmt.Println("Arquivo baixado com sucesso:", localFilePath)

 // Configuração do cliente S3
 sess := session.Must(session.NewSession(&aws.Config{
  Region: aws.String("us-west-2"),
 }))
 uploader := s3manager.NewUploader(sess)

 // Carregar arquivo para o S3
 file, err := os.Open(localFilePath)
 if err != nil {
  log.Fatal(err)
 }
 defer file.Close()

 _, err = uploader.Upload(&s3manager.UploadInput{
  Bucket: aws.String("seu-bucket-s3"),
  Key:    aws.String(filepath.Base(localFilePath)),
  Body:   file,
 })
 if err != nil {
  log.Fatal("Falha ao carregar arquivo para o S3:", err)
 }

 fmt.Println("Arquivo carregado com sucesso no S3")
}

Step 2: Configuring Terraform

Terraform will be used to provision the Lambda function and required resources on AWS. Create a main.tf file with the configuration required to create the Lambda function, IAM policies, and S3 buckets.

provider "aws" {
  region = "us-east-1"
}

resource "aws_iam_role" "lambda_execution_role" {
  name = "lambda_execution_role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRole",
        Effect = "Allow",
        Principal = {
          Service = "lambda.amazonaws.com"
        },
      },
    ]
  })
}

resource "aws_iam_policy" "lambda_policy" {
  name        = "lambda_policy"
  description = "A policy that allows a lambda function to access S3 and SFTP resources"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "s3:ListBucket",
          "s3:GetObject",
          "s3:PutObject",
        ],
        Effect = "Allow",
        Resource = [
          "arn:aws:s3:::seu-bucket-s3",
          "arn:aws:s3:::seu-bucket-s3/*",
        ],
      },
    ]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_policy_attachment" {
  role       = aws_iam_role.lambda_execution_role.name
  policy_arn = aws_iam_policy.lambda_policy.arn
}

resource "aws_lambda_function" "sftp_lambda" {
  function_name = "sftp_lambda_function"

  s3_bucket = "seu-bucket-s3-com-codigo-lambda"
  s3_key    = "sftp-lambda.zip"

  handler = "main"
  runtime = "go1.x"

  role = aws_iam_role.lambda_execution_role.arn

  environment {
    variables = {
      SFTP_HOST     = "endereco_sftp",
      SFTP_USER     = "seu_usuario_sftp",
      SFTP_PASSWORD = "sua_senha_sftp",
      S3_BUCKET     = "seu-bucket-s3",
    }
  }
}

resource "aws_s3_bucket" "s3_bucket" {
  bucket = "seu-bucket-s3"
  acl    = "private"
}

Step 3: Configuring GitLab CI/CD

In GitLab, define the CI/CD pipeline in the .gitlab-ci.yml file. This pipeline should include steps to test the Go application, run Terraform to provision the infrastructure, and a step for cleanup if necessary.

stages:
  - test
  - build
  - deploy

variables:
  S3_BUCKET: "seu-bucket-s3"
  AWS_DEFAULT_REGION: "us-east-1"
  TF_VERSION: "1.0.0"

before_script:
  - 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client -y )'
  - eval $(ssh-agent -s)
  - echo "$PRIVATE_KEY" | tr -d '\r' | ssh-add -
  - mkdir -p ~/.ssh
  - chmod 700 ~/.ssh
  - ssh-keyscan -H 'endereco_sftp' >> ~/.ssh/known_hosts

test:
  stage: test
  image: golang:1.18
  script:
    - go test -v ./...

build:
  stage: build
  image: golang:1.18
  script:
    - go build -o myapp
    - zip -r sftp-lambda.zip myapp
  artifacts:
    paths:
      - sftp-lambda.zip
  only:
    - master

deploy:
  stage: deploy
  image: hashicorp/terraform:$TF_VERSION
  script:
    - terraform init
    - terraform apply -auto-approve
  only:
    - master
  environment:
    name: production

Step 4: Integrating with Databricks

After uploading the files to S3, the Lambda function must trigger a job in Databricks. This can be done using the Databricks API to launch existing jobs.

package main

import (
 "bytes"
 "encoding/json"
 "fmt"
 "net/http"
)

// Estrutura para a requisição de iniciar um job no Databricks
type DatabricksJobRequest struct {
 JobID int `json:"job_id"`
}

// Função para acionar um job no Databricks
func triggerDatabricksJob(databricksInstance string, token string, jobID int) error {
 url := fmt.Sprintf("https://%s/api/2.0/jobs/run-now", databricksInstance)
 requestBody, _ := json.Marshal(DatabricksJobRequest{JobID: jobID})
 req, err := http.NewRequest("POST", url, bytes.NewBuffer(requestBody))
 if err != nil {
  return err
 }

 req.Header.Set("Content-Type", "application/json")
 req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", token))

 client := &http.Client{}
 resp, err := client.Do(req)
 if err != nil {
  return err
 }
 defer resp.Body.Close()

 if resp.StatusCode != http.StatusOK {
  return fmt.Errorf("Failed to trigger Databricks job, status code: %d", resp.StatusCode)
 }

 return nil
}

func main() {
 // ... (código existente para conectar ao SFTP e carregar no S3)

 // Substitua pelos seus valores reais
 databricksInstance := "your-databricks-instance"
 databricksToken := "your-databricks-token"
 databricksJobID := 123 // ID do job que você deseja acionar

 // Acionar o job no Databricks após o upload para o S3
 err := triggerDatabricksJob(databricksInstance, databricksToken, databricksJobID)
 if err != nil {
  log.Fatal("Erro ao acionar o job do Databricks:", err)
 }

 fmt.Println("Job do Databricks acionado com sucesso")
}

Step 5: Running the Pipeline

Push the code to the GitLab repository for the pipeline to run. Verify that all steps are completed successfully and that the Lambda function is operational and interacting correctly with S3 and Databricks.

Once you have the complete code and the .gitlab-ci.yml file configured, you can run the pipeline by following these steps:

Push your code to the GitLab repository:

  git add .
  git commit -m "Adiciona função Lambda para integração SFTP, S3 e Databricks"
  git push origin master

git add .
git commit -m "Adiciona função Lambda para integração SFTP, S3 e Databricks"
git push origin master
´´´

GitLab CI/CD will detect the new commit and start the pipeline automatically.
Track the execution of the pipeline in GitLab by accessing the CI/CD section of your repository.
If all stages are successful, your Lambda function will be deployed and ready to use.

Remember that you will need to configure environment variables in GitLab CI/CD to store sensitive information such as access tokens and private keys. This can be done in the ‘Settings’ > ‘CI/CD’ > ‘Variables’ section of your GitLab project.

Also, ensure that the Databricks token has the necessary permissions to trigger jobs and that the job exists with the provided ID.

Conclusion

Automation of data engineering tasks can be significantly simplified using tools such as GitLab CI/CD, Terraform, and AWS Lambda. By following the steps outlined in this article, you can create a robust system that automates data collection and integration between SFTP, S3, and Databricks, all with the efficiency and simplicity of Go. With this approach, you will be well equipped to address the challenges of data integration at scale.

My contacts:

LinkedIn - Airton Lira Junior

iMasters - Airton Lira Junior

aws #lambda #terraform #gitlab #ci_cd #go #databricks #dataengineering #automation

Release Statement This article is reproduced at: https://dev.to/airton_lirajunior_2ddebd/implementando-uma-lambda-com-gitlab-cicd-e-terraform-para-integracao-sftp-s3-e-databricks-em-go-5hc0?1 as If there is any infringement, please contact [email protected] to delete it.

Latest tutorial More>

PHP SimpleXML parsing XML method with namespace colon
Parsing XML with Namespace Colons in PHPSimpleXML encounters difficulties when parsing XML containing tags with colons, such as XML elements with pref...

Programming Posted on 2025-04-27
How to create dynamic variables in Python?
Dynamic Variable Creation in PythonThe ability to create variables dynamically can be a powerful tool, especially when working with complex data struc...

Programming Posted on 2025-04-27
Reasons why Python does not report errors to the slicing of the hyperscope substring
Substring Slicing with Index Out of Range: Duality and Empty SequencesIn Python, accessing elements of a sequence using the slicing operator, such as ...

Programming Posted on 2025-04-27
Which Method for Declaring Multiple Variables in JavaScript is More Maintainable?
Declaring Multiple Variables in JavaScript: Exploring Two MethodsIn JavaScript, developers often encounter the need to declare multiple variables. Two...

Programming Posted on 2025-04-27
How to Correctly Display the Current Date and Time in "dd/MM/yyyy HH:mm:ss.SS" Format in Java?
How to Display Current Date and Time in "dd/MM/yyyy HH:mm:ss.SS" FormatIn the provided Java code, the issue with displaying the date and tim...

Programming Posted on 2025-04-27
Python efficient way to remove HTML tags from text
Stripping HTML Tags in Python for a Pristine Textual RepresentationManipulating HTML responses often involves extracting relevant text content while e...

Programming Posted on 2025-04-27
$Why Isn\'t My CSS Background Image Appearing?$
Why Isn\'t My CSS Background Image Appearing?
Troubleshoot: CSS Background Image Not AppearingYou've encountered an issue where your background image fails to load despite following tutorial i...

Programming Posted on 2025-04-27
Why do images still have borders in Chrome? `border: none;` invalid solution
Removing the Image Border in ChromeOne frequent issue encountered when working with images in Chrome and IE9 is the appearance of a persistent thin bo...

Programming Posted on 2025-04-27
Python Read CSV File UnicodeDecodeError Ultimate Solution
Unicode Decode Error in CSV File ReadingWhen attempting to read a CSV file into Python using the built-in csv module, you may encounter an error stati...

Programming Posted on 2025-04-27
How Do I Efficiently Select Columns in Pandas DataFrames?
Selecting Columns in Pandas DataframesWhen dealing with data manipulation tasks, selecting specific columns becomes necessary. In Pandas, there are va...

Programming Posted on 2025-04-27
$Why Doesn\'t Firefox Display Images Using the CSS `content` Property?$
Why Doesn\'t Firefox Display Images Using the CSS `content` Property?
Displaying Images with Content URL in FirefoxAn issue has been encountered where certain browsers, specifically Firefox, fail to display images when r...

Programming Posted on 2025-04-27
How Can I Synchronously Iterate and Print Values from Two Equal-Sized Arrays in PHP?
Synchronously Iterating and Printing Values from Two Arrays of the Same SizeWhen creating a selectbox using two arrays of equal size, one containing c...

Programming Posted on 2025-04-27
When to use "try" instead of "if" to detect variable values in Python?
Using "try" vs. "if" to Test Variable Value in PythonIn Python, there are situations where you may need to check if a variable has...

Programming Posted on 2025-04-27
How to pass exclusive pointers as function or constructor parameters in C++?
Managing Unique Pointers as Parameters in Constructors and FunctionsUnique pointers (unique_ptr) uphold the principle of unique ownership in C 11. Wh...

Programming Posted on 2025-04-27
Eval() vs. ast.literal_eval(): Which Python Function Is Safer for User Input?
Weighing eval() and ast.literal_eval() in Python SecurityWhen handling user input, it's imperative to prioritize security. eval(), a powerful Pyth...

Programming Posted on 2025-04-27