”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 探索软件工程师的就业市场

探索软件工程师的就业市场

发布于2024-09-27
浏览:685

Exploring Job Market for Software Engineers

Introduction

In this article, we dive into the process of extracting and analyzing job data from LinkedIn, leveraging a combination of Python, Nu shell, and ChatGPT to streamline and enhance our workflow.

I’ll walk you through the steps I took to carry out my research, showing how you can use these techniques to explore job markets in different countries or even in other fields. By combining these tools and methods, you can gather and analyze data to gain valuable insights into any job market you're interested in.

Technologies overview

Python

Python was chosen for its versatile libraries, particularly linkedin_jobs_scraper and openai. These packages streamlined the scraping and processing of job data.

Nu Shell

Nu shell was experimented with to compare its functionality against the traditional bash stack. This experiment aimed to explore its potential benefits in handling and manipulating data.

ChatGPT

ChatGPT was employed to assist in the extraction of specific job features from the collected data, such as years of experience, degree requirements, tech stack, position levels, and core responsibilities.

Data extraction

To start some data is required. LinkedIn was the first website that came to my mind and there was ready to use Python package. I've copied example code, modified it a little and got ready to use script to get a JSON file with a list of job descriptions. Here it's source:

import json
import logging
import os
from threading import Lock

from dotenv import load_dotenv

# linkedin_jobs_scraper loads env statically
# So dotenv should be loaded before imports
load_dotenv()

from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import EventData, Events
from linkedin_jobs_scraper.filters import ExperienceLevelFilters, TypeFilters
from linkedin_jobs_scraper.query import Query, QueryFilters, QueryOptions

CHROMEDRIVER_PATH = os.environ["CHROMEDRIVER_PATH"]

RESULT_FILE_PATH = "result.json"
KEYWORDS = ("Python", "PHP", "Java", "Rust")
LOCATIONS = ("South Korea",)
TYPE_FILTERS = (TypeFilters.FULL_TIME,)
EXPERIENCE = (ExperienceLevelFilters.MID_SENIOR,)
LIMIT = 500

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)


def main():
    result_lock = Lock()
    result = []

    def on_data(data: EventData):
        with result_lock:
            result.append(data._asdict())

        log.info(
            "[JOB]",
            data.title,
            data.company,
            len(data.description),
        )

    def on_error(error):
        log.error("[ERROR]", error)

    def on_end():
        log.info("Scraping finished")

        if not result:
            return

        with open(RESULT_FILE_PATH, "w") as f:
            json.dump(result, f)

    queries = [
        Query(
            query=keyword,
            options=QueryOptions(
                limit=LIMIT,
                locations=[*LOCATIONS],
                filters=QueryFilters(
                    type=[*TYPE_FILTERS],
                    experience=[*EXPERIENCE],
                ),
            ),
        )
        for keyword in KEYWORDS
    ]

    scraper = LinkedinScraper(
        chrome_executable_path=CHROMEDRIVER_PATH,
        headless=True,
        max_workers=len(queries),
        slow_mo=0.5,
        page_load_timeout=40,
    )

    scraper.on(Events.DATA, on_data)
    scraper.on(Events.ERROR, on_error)
    scraper.on(Events.END, on_end)

    scraper.run(queries)


if __name__ == "__main__":
    main()

To download chrome driver I've made the following bash script:

#!/usr/bin/env bash
stable_version=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/LATEST_RELEASE_STABLE')
driver_url=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json' \
    | jq -r ".versions[] | select(.version == \"${stable_version}\") | .downloads.chromedriver[0] | select(.platform == \"linux64\") | .url")
wget "$driver_url"
driver_zip_name=$(echo "$driver_url" | awk -F'/' '{print $NF}')
unzip "$driver_zip_name"
rm "$driver_zip_name"

And my .env file looks like that:

CHROMEDRIVER_PATH="chromedriver-linux64/chromedriver"
LI_AT_COOKIE=

linkedin_jobs_scraper serializes jobs to the following DTO:

class EventData(NamedTuple):
    query: str = ''
    location: str = ''
    job_id: str = ''
    job_index: int = -1  # Only for debug
    link: str = ''
    apply_link: str = ''
    title: str = ''
    company: str = ''
    company_link: str = ''
    company_img_link: str = ''
    place: str = ''
    description: str = ''
    description_html: str = ''
    date: str = ''
    insights: List[str] = []
    skills: List[str] = []

Example sample (description was replaced with ... for better readability):

query location job_id job_index link apply_link title company company_link company_img_link place description description_html date insights skills
Python South Korea 3959499221 0 https://www.linkedin.com/jobs/view/3959499221/?trk=flagship3_search_srp_jobs Senior Python Software Engineer Canonical https://media.licdn.com/dms/image/v2/C560BAQEbIYAkAURcYw/company-logo_100_100/company-logo_100_100/0/1650566107463/canonical_logo?e=1734566400&v=beta&t=emb8cxAFwBnOGwJ8nTftd8ODTFDkC_5SQNz-Jcd8zRU Seoul, Seoul, South Korea (Remote) ... ... [Remote Full-time Mid-Senior level, Skills: Python (Programming Language), Computer Science, 8 more, See how you compare to 18 applicants. Try Premium for RSD0, , Am I a good fit for this job?, How can I best position myself for this job?, Tell me more about Canonical] [Back-End Web Development, Computer Science, Engineering Documentation, Kubernetes, Linux, MLOps, OpenStack, Python (Programming Language), Technical Documentation, Web Services]

Was generated with the following nu shell command:

# Replaces description of a job with elipsis
def hide-description [] {
    update description { |row| '...' } 
    | update description_html { |row| '...' } 
}

cat result.json 
| from  json 
| first 
| hide-description
| to md --pretty 

Last steps before analysis

We already have several ready to use features (title and skills), but I want more:

  • Years of experience
  • Degree
  • Tech stack
  • Position
  • Responsibilities

So let's add them with help of ChatGPT!

import json
import logging
import os

from dotenv import load_dotenv
from linkedin_jobs_scraper.events import EventData
from openai import OpenAI
from tqdm import tqdm

load_dotenv()

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
)

with open("result.json", "rb") as f:
    jobs = json.load(f)

parsed_descriptions = []

for job in tqdm(jobs):
    job = EventData(**job)
    chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": """
                    Process given IT job description. 
                    Output only raw JSON with the following fields:
                        - Experience (amount of years or null)
                        - Degree requirement (str if found else null)
                        - Tech stack (array of strings)
                        - Position (middle, senior, lead, manager, other (describe it))
                        - Core responsibilites (array of strings)

                    Output will be passed directrly to the
                    Python's `json.loads` function. So DO NOT APPLY MARKDOWN FORMATTING
                    Example:
                    ```


                    {
                        "experience": 5, 
                        "degree": "bachelor", 
                        "stack": ["Python", "FastAPI", "Docker"], 
                        "position": "middle",
                        "responsibilities": ["Deliver features", "break production"]
                    }


                    ```

                    Here is a job description:
                """
                  "\n\n"
                  job.description_html,
            }
        ],
    )

    content = chat_completion.choices[0].message.content
    try:
        if not content:
            print("Empty result from ChatGPT")
            continue
        result = json.loads(content)
    except json.decoder.JSONDecodeError as e:
        logging.error(e, chat_completion)
        continue

    result["job_id"] = job.job_id
    parsed_descriptions.append(result)

with open("job_descriptions_analysis.json", "w") as f:
    json.dump(parsed_descriptions, f)

Do not forget to add OPENAI_API_KEY to the .env file

Now we can merge by job_id results with data from LinkedIn:

cat job_descriptions_analysis.json 
| from json 
| merge (cat result.json | from json)
| to json
| save full.json

Our data is ready to analyze!

cat full.json | from json | columns
╭────┬──────────────────╮
│  0 │ experience       │
│  1 │ degree           │
│  2 │ stack            │
│  3 │ position         │
│  4 │ responsibilities │
│  5 │ job_id           │
│  6 │ query            │
│  7 │ location         │
│  8 │ job_index        │
│  9 │ link             │
│ 10 │ apply_link       │
│ 11 │ title            │
│ 12 │ company          │
│ 13 │ company_link     │
│ 14 │ company_img_link │
│ 15 │ place            │
│ 16 │ description      │
│ 17 │ description_html │
│ 18 │ date             │
│ 19 │ insights         │
│ 20 │ skills           │
╰────┴──────────────────╯

Analysis

For the start

let df = cat full.json | from json

Now we can see technologies frequency:

$df
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 20 
| to md --pretty
value count
Python 185
Java 70
AWS 65
Kubernetes 61
SQL 54
C 46
Docker 42
Linux 41
React 37
Kotlin 34
JavaScript 30
C 30
Kafka 28
TypeScript 26
GCP 25
Azure 24
Tableau 22
Hadoop 21
Spark 21
R 20

With Python:

$df
| filter-by-intersection 'stack' ['python']
| get 'stack' 
| flatten 
| where $it != 'Python' # Exclude python itself
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
Java 44
AWS 43
SQL 40
Kubernetes 36
Docker 27
C 26
Linux 24
R 20
GCP 20
C 18

Without Python:

$df
| filter-by-intersection 'stack' ['python'] --invert
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
React 31
Java 26
Kubernetes 25
TypeScript 23
AWS 22
Kotlin 21
C 20
Linux 17
Docker 15
Next.js 15

The most of the jobs require Python, but there are some front-end, Java and C jobs

Magic filter-by-intersection function is a custom one and allow filtering list values that include given set of elements:

# Filters rows by intersecting given `column` with `requirements`
# Case insensitive and works only if ALL requirements exist in a `column` value
# If `--invert` then works as symmetric difference
def filter-by-intersection [
    column: string
    requirements: list
   --invert (-i)
] {
    let required_stack = $requirements | par-each { |el| str downcase }
    let required_len = if $invert { 0 } else { ($requirements | length )}
    $in
    | filter { |row| 
        $required_len == (
            $row 
            | get $column 
            | par-each { |el| str downcase } 
            | where ($it in $requirements) 
            | length
        )
    }
}

What about experience and degree requirement for each position in Python?

$df
| filter-by-intersection 'stack' ['python'] 
| group-by 'position' --to-table
| insert 'group_size' { |group| $group.items | length } 
| where 'group_size' >= 10
| insert 'experience' { |group| 
    $group.items 
    | get 'experience'
    | uniq --count  
    | sort-by 'count' --reverse 
    | update 'value' { |row| if $row.value == null { 0 } else { $row.value }}
    | rename --column { 'value': 'years' }
    | first 3 
} 
| insert 'degree_requirement' { |group| 
    $group.items 
    | each { |row| $row.degree != null } 
    | uniq --count 
    | sort-by 'value'
    | rename --column { 'value': 'required' }
}
| sort-by 'group_size' --reverse 
| select 'group' 'group_size' 'experience' 'degree_requirement'

Output:

╭───┬────────┬────────────┬───────────────────────┬──────────────────────────╮
│ # │ group  │ group_size │      experience       │    degree_requirement    │
├───┼────────┼────────────┼───────────────────────┼──────────────────────────┤
│ 0 │ senior │         83 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     5 │    30 │ │ │ 0 │ false    │    26 │ │
│   │        │            │ │ 1 │     0 │    11 │ │ │ 1 │ true     │    57 │ │
│   │        │            │ │ 2 │     7 │    11 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 1 │ other  │         14 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     8 │ │ │ 0 │ false    │    12 │ │
│   │        │            │ │ 1 │     5 │     1 │ │ │ 1 │ true     │     2 │ │
│   │        │            │ │ 2 │     3 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 2 │ lead   │         12 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     5 │ │ │ 0 │ false    │     6 │ │
│   │        │            │ │ 1 │    10 │     4 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     5 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 3 │ middle │         10 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     3 │     4 │ │ │ 0 │ false    │     4 │ │
│   │        │            │ │ 1 │     5 │     3 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     2 │     2 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
╰───┴────────┴────────────┴───────────────────────┴──────────────────────────╯

Extraction of the most common requirements wasn't as easy as previous steps. So I've met a classification problem, and I'm going to describe my solution in the next chapter of this article.

Conclusion

We successfully extracted and analyzed job data from LinkedIn using the linkedin_jobs_scraper package. Responsibilities in the actual dataset are too sparse and need better processing to make functional classes that will help in CV creation. But the given steps already help me a lot with monitoring and applying to the jobs in half-auto mode.

版本声明 本文转载于:https://dev.to/suzumenobu/exploring-job-market-for-software-engineers-3li8?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • 为什么 PhpMyAdmin 在 Ubuntu 12.04 上给出“MySQLi 扩展缺失”错误?
    为什么 PhpMyAdmin 在 Ubuntu 12.04 上给出“MySQLi 扩展缺失”错误?
    PhpMyAdmin 错误:MySQLi 扩展缺失在 Ubuntu 12.04 上遇到 PhpMyAdmin 问题?尽管安装了 Apache2、PHP5、MySQL 和 PhpMyAdmin,您还是遇到了“mysqli 扩展丢失”错误。尽管您已取消注释 php.ini 中的“extension=my...
    编程 发布于2024-11-07
  • 如何使用 java.net.URLConnection 将文件和附加参数上传到 HTTP 服务器?
    如何使用 java.net.URLConnection 将文件和附加参数上传到 HTTP 服务器?
    在 Java 中使用附加参数将文件上传到 HTTP 服务器将文件上传到 HTTP 服务器是许多应用程序的常见需求。但是,有时还需要随文件一起传递附加参数。这是一个允许您在不使用外部库的情况下发送文件和参数的解决方案:java.net.URLConnection 和 Multipart/Form-Da...
    编程 发布于2024-11-07
  • 如何在 PHP 中逐行读取和处理文本文件?
    如何在 PHP 中逐行读取和处理文本文件?
    在 PHP 中读取文本文件:分步指南许多 Web 开发场景都涉及从文本文件读取数据。在 PHP 中,文件处理函数提供了逐行读取纯文本文件的便捷方法。让我们分解一下使用 PHP 读取文本文件的过程。读取文本文件的代码:以下 PHP 代码片段演示了如何读取文本文件并逐行处理其内容:<?php //...
    编程 发布于2024-11-07
  • 我离不开的生产力工具(奖励)
    我离不开的生产力工具(奖励)
    大家好,你们的孩子 Nomadev 带着另一篇帖子回来了!今天,我很高兴与大家分享一些我每天使用的超级酷的人工智能工具。这些工具已成为我日常工作的重要组成部分,帮助我保持井井有条、高效并完成更多工作。 在当今快节奏的世界中,我们都希望提高生产力和效率。借助人工智能,有大量工具可以帮助我们管理任务、简...
    编程 发布于2024-11-07
  • 在 Go/Templ 中制作一个干净、友好的 Spinner
    在 Go/Templ 中制作一个干净、友好的 Spinner
    无用的 HTML 你们可能认为在 HTML 中制作一个一致、干净且专业的旋转框是一项简单的任务...但是,令我们失望的是,没有标准的属性来告诉输入它应该只接受整数或小数值,所有的输入过滤都必须是JS。哎呀! 我将使用 Go、a-h/Templ、Tailwind 和我心爱的 Alpi...
    编程 发布于2024-11-07
  • 您可以在没有数据库连接的情况下转义字符串以确保数据库安全吗?
    您可以在没有数据库连接的情况下转义字符串以确保数据库安全吗?
    在没有数据库连接的情况下转义字符串以确保数据库安全测试与数据库交互的代码时,通过正确转义用户输入来防止 SQL 注入攻击非常重要。然而,为每个测试连接到数据库可能效率很低。有没有办法在没有活动数据库连接的情况下转义字符串?没有连接转义的限制不幸的是,在没有数据库连接的情况下不可能可靠地转义字符串。 ...
    编程 发布于2024-11-07
  • Entropix:最大化推理性能的采样技术
    Entropix:最大化推理性能的采样技术
    Entropix:最大化推理性能的采样技术 根据 Entropix README,Entropix 使用基于熵的采样方法。本文讲解了基于熵和变熵的具体采样技术。 熵和变熵 让我们首先解释熵和变熵,因为它们是确定采样策略的关键因素。 熵 在信息论中,熵...
    编程 发布于2024-11-07
  • 重叠方法支持多态性
    重叠方法支持多态性
    方法覆盖: 这不仅仅是一个命名问题,而是 Java 的一个基本特性。 它基于动态方法调度的概念。 动态方法调度: 是在运行时而非编译时解决对重叠方法的调用的机制。 允许在 Java 中实现多态性。 工作原理: 超类引用变量可以引用子类对象。 当通过超类引用调用重写的方法时,要执行的方法的版本根据调用...
    编程 发布于2024-11-07
  • 如何对 Move_uploaded_file() 函数进行故障排除?
    如何对 Move_uploaded_file() 函数进行故障排除?
    Move_uploaded_file() 函数故障排除move_uploaded_file() 函数在文件上传机制中起着至关重要的作用。然而,当遇到非功能性问题时,细致的故障排除是必不可少的。要解决这个问题,第一步是激活 PHP 错误报告。这将显示来自 move_uploaded_file() 函数...
    编程 发布于2024-11-07
  • 如何解决使用 UNION 时出现的“Select 语句中的不同列计数”错误?
    如何解决使用 UNION 时出现的“Select 语句中的不同列计数”错误?
    错误:Select 语句中的不同列计数执行使用 UNION 运算符的查询时,必须确保涉及的所有单独 SELECT 语句都遵守两个基本标准:匹配列数:每个 SELECT 语句必须在检索的结果集中产生相同数量的列。数据一致类型: 不同 SELECT 语句中相应列的数据类型应对齐。问题分析考虑提供的查询:...
    编程 发布于2024-11-07
  • 为什么Python项目中的相对路径会导致文件未找到错误?
    为什么Python项目中的相对路径会导致文件未找到错误?
    在 Python 项目中使用相对路径访问文件在 Python 项目中操作文件时,为了方便起见,通常使用相对路径。然而,它们的行为可能变得不明确,特别是在处理多级项目结构时。考虑以下项目布局:project /data test.csv /package ...
    编程 发布于2024-11-07
  • Spring Boot初始化后如何执行代码?
    Spring Boot初始化后如何执行代码?
    Spring Boot初始化后执行代码在Spring Boot应用程序中,您可能会遇到需要在应用程序初始化后执行特定功能的情况。一旦应用程序功能齐全,这对于监视任务或执行其他操作通常是必要的。本文探讨了利用 ApplicationReadyEvent 事件解决此挑战的方法。使用 Applicatio...
    编程 发布于2024-11-07
  • 如何使用 JavaScript 检测 VPN
    如何使用 JavaScript 检测 VPN
    在我们日益互联的世界中,VPN(虚拟专用网络)既带来了优势,也带来了挑战。虽然它们帮助用户维护隐私和安全,但它们也可能被用于恶意目的。 我们将深入探讨如何使用 JavaScript 和 fetch API 在 Web 应用程序中实现 VPN 检测。您可以使用任何您想要的API。 为什...
    编程 发布于2024-11-07
  • 如何使用Apache FOP在PDF文档中正确显示汉字?
    如何使用Apache FOP在PDF文档中正确显示汉字?
    Apache FOP 汉字显示问题使用 Apache FOP 打印 PDF 文档时,汉字可能会显示为“####”尽管安装了必要的语言文件。此问题是由于默认配置中缺乏字体支持引起的。要解决此问题,需要三个步骤:步骤 1:在 FO 文件中指定字体系列使用 font-family 属性指示所需的字体。例如...
    编程 发布于2024-11-07

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3