」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 探索軟體工程師的就業市場

探索軟體工程師的就業市場

發佈於2024-11-08
瀏覽:269

Exploring Job Market for Software Engineers

Introduction

In this article, we dive into the process of extracting and analyzing job data from LinkedIn, leveraging a combination of Python, Nu shell, and ChatGPT to streamline and enhance our workflow.

I’ll walk you through the steps I took to carry out my research, showing how you can use these techniques to explore job markets in different countries or even in other fields. By combining these tools and methods, you can gather and analyze data to gain valuable insights into any job market you're interested in.

Technologies overview

Python

Python was chosen for its versatile libraries, particularly linkedin_jobs_scraper and openai. These packages streamlined the scraping and processing of job data.

Nu Shell

Nu shell was experimented with to compare its functionality against the traditional bash stack. This experiment aimed to explore its potential benefits in handling and manipulating data.

ChatGPT

ChatGPT was employed to assist in the extraction of specific job features from the collected data, such as years of experience, degree requirements, tech stack, position levels, and core responsibilities.

Data extraction

To start some data is required. LinkedIn was the first website that came to my mind and there was ready to use Python package. I've copied example code, modified it a little and got ready to use script to get a JSON file with a list of job descriptions. Here it's source:

import json
import logging
import os
from threading import Lock

from dotenv import load_dotenv

# linkedin_jobs_scraper loads env statically
# So dotenv should be loaded before imports
load_dotenv()

from linkedin_jobs_scraper import LinkedinScraper
from linkedin_jobs_scraper.events import EventData, Events
from linkedin_jobs_scraper.filters import ExperienceLevelFilters, TypeFilters
from linkedin_jobs_scraper.query import Query, QueryFilters, QueryOptions

CHROMEDRIVER_PATH = os.environ["CHROMEDRIVER_PATH"]

RESULT_FILE_PATH = "result.json"
KEYWORDS = ("Python", "PHP", "Java", "Rust")
LOCATIONS = ("South Korea",)
TYPE_FILTERS = (TypeFilters.FULL_TIME,)
EXPERIENCE = (ExperienceLevelFilters.MID_SENIOR,)
LIMIT = 500

logging.basicConfig(level=logging.INFO)
log = logging.getLogger(__name__)


def main():
    result_lock = Lock()
    result = []

    def on_data(data: EventData):
        with result_lock:
            result.append(data._asdict())

        log.info(
            "[JOB]",
            data.title,
            data.company,
            len(data.description),
        )

    def on_error(error):
        log.error("[ERROR]", error)

    def on_end():
        log.info("Scraping finished")

        if not result:
            return

        with open(RESULT_FILE_PATH, "w") as f:
            json.dump(result, f)

    queries = [
        Query(
            query=keyword,
            options=QueryOptions(
                limit=LIMIT,
                locations=[*LOCATIONS],
                filters=QueryFilters(
                    type=[*TYPE_FILTERS],
                    experience=[*EXPERIENCE],
                ),
            ),
        )
        for keyword in KEYWORDS
    ]

    scraper = LinkedinScraper(
        chrome_executable_path=CHROMEDRIVER_PATH,
        headless=True,
        max_workers=len(queries),
        slow_mo=0.5,
        page_load_timeout=40,
    )

    scraper.on(Events.DATA, on_data)
    scraper.on(Events.ERROR, on_error)
    scraper.on(Events.END, on_end)

    scraper.run(queries)


if __name__ == "__main__":
    main()

To download chrome driver I've made the following bash script:

#!/usr/bin/env bash
stable_version=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/LATEST_RELEASE_STABLE')
driver_url=$(curl 'https://googlechromelabs.github.io/chrome-for-testing/known-good-versions-with-downloads.json' \
    | jq -r ".versions[] | select(.version == \"${stable_version}\") | .downloads.chromedriver[0] | select(.platform == \"linux64\") | .url")
wget "$driver_url"
driver_zip_name=$(echo "$driver_url" | awk -F'/' '{print $NF}')
unzip "$driver_zip_name"
rm "$driver_zip_name"

And my .env file looks like that:

CHROMEDRIVER_PATH="chromedriver-linux64/chromedriver"
LI_AT_COOKIE=

linkedin_jobs_scraper serializes jobs to the following DTO:

class EventData(NamedTuple):
    query: str = ''
    location: str = ''
    job_id: str = ''
    job_index: int = -1  # Only for debug
    link: str = ''
    apply_link: str = ''
    title: str = ''
    company: str = ''
    company_link: str = ''
    company_img_link: str = ''
    place: str = ''
    description: str = ''
    description_html: str = ''
    date: str = ''
    insights: List[str] = []
    skills: List[str] = []

Example sample (description was replaced with ... for better readability):

query location job_id job_index link apply_link title company company_link company_img_link place description description_html date insights skills
Python South Korea 3959499221 0 https://www.linkedin.com/jobs/view/3959499221/?trk=flagship3_search_srp_jobs Senior Python Software Engineer Canonical https://media.licdn.com/dms/image/v2/C560BAQEbIYAkAURcYw/company-logo_100_100/company-logo_100_100/0/1650566107463/canonical_logo?e=1734566400&v=beta&t=emb8cxAFwBnOGwJ8nTftd8ODTFDkC_5SQNz-Jcd8zRU Seoul, Seoul, South Korea (Remote) ... ... [Remote Full-time Mid-Senior level, Skills: Python (Programming Language), Computer Science, 8 more, See how you compare to 18 applicants. Try Premium for RSD0, , Am I a good fit for this job?, How can I best position myself for this job?, Tell me more about Canonical] [Back-End Web Development, Computer Science, Engineering Documentation, Kubernetes, Linux, MLOps, OpenStack, Python (Programming Language), Technical Documentation, Web Services]

Was generated with the following nu shell command:

# Replaces description of a job with elipsis
def hide-description [] {
    update description { |row| '...' } 
    | update description_html { |row| '...' } 
}

cat result.json 
| from  json 
| first 
| hide-description
| to md --pretty 

Last steps before analysis

We already have several ready to use features (title and skills), but I want more:

  • Years of experience
  • Degree
  • Tech stack
  • Position
  • Responsibilities

So let's add them with help of ChatGPT!

import json
import logging
import os

from dotenv import load_dotenv
from linkedin_jobs_scraper.events import EventData
from openai import OpenAI
from tqdm import tqdm

load_dotenv()

client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
)

with open("result.json", "rb") as f:
    jobs = json.load(f)

parsed_descriptions = []

for job in tqdm(jobs):
    job = EventData(**job)
    chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": """
                    Process given IT job description. 
                    Output only raw JSON with the following fields:
                        - Experience (amount of years or null)
                        - Degree requirement (str if found else null)
                        - Tech stack (array of strings)
                        - Position (middle, senior, lead, manager, other (describe it))
                        - Core responsibilites (array of strings)

                    Output will be passed directrly to the
                    Python's `json.loads` function. So DO NOT APPLY MARKDOWN FORMATTING
                    Example:
                    ```


                    {
                        "experience": 5, 
                        "degree": "bachelor", 
                        "stack": ["Python", "FastAPI", "Docker"], 
                        "position": "middle",
                        "responsibilities": ["Deliver features", "break production"]
                    }


                    ```

                    Here is a job description:
                """
                  "\n\n"
                  job.description_html,
            }
        ],
    )

    content = chat_completion.choices[0].message.content
    try:
        if not content:
            print("Empty result from ChatGPT")
            continue
        result = json.loads(content)
    except json.decoder.JSONDecodeError as e:
        logging.error(e, chat_completion)
        continue

    result["job_id"] = job.job_id
    parsed_descriptions.append(result)

with open("job_descriptions_analysis.json", "w") as f:
    json.dump(parsed_descriptions, f)

Do not forget to add OPENAI_API_KEY to the .env file

Now we can merge by job_id results with data from LinkedIn:

cat job_descriptions_analysis.json 
| from json 
| merge (cat result.json | from json)
| to json
| save full.json

Our data is ready to analyze!

cat full.json | from json | columns
╭────┬──────────────────╮
│  0 │ experience       │
│  1 │ degree           │
│  2 │ stack            │
│  3 │ position         │
│  4 │ responsibilities │
│  5 │ job_id           │
│  6 │ query            │
│  7 │ location         │
│  8 │ job_index        │
│  9 │ link             │
│ 10 │ apply_link       │
│ 11 │ title            │
│ 12 │ company          │
│ 13 │ company_link     │
│ 14 │ company_img_link │
│ 15 │ place            │
│ 16 │ description      │
│ 17 │ description_html │
│ 18 │ date             │
│ 19 │ insights         │
│ 20 │ skills           │
╰────┴──────────────────╯

Analysis

For the start

let df = cat full.json | from json

Now we can see technologies frequency:

$df
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 20 
| to md --pretty
value count
Python 185
Java 70
AWS 65
Kubernetes 61
SQL 54
C 46
Docker 42
Linux 41
React 37
Kotlin 34
JavaScript 30
C 30
Kafka 28
TypeScript 26
GCP 25
Azure 24
Tableau 22
Hadoop 21
Spark 21
R 20

With Python:

$df
| filter-by-intersection 'stack' ['python']
| get 'stack' 
| flatten 
| where $it != 'Python' # Exclude python itself
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
Java 44
AWS 43
SQL 40
Kubernetes 36
Docker 27
C 26
Linux 24
R 20
GCP 20
C 18

Without Python:

$df
| filter-by-intersection 'stack' ['python'] --invert
| get 'stack' 
| flatten 
| uniq --count 
| sort-by count --reverse 
| first 10
| to md --pretty
value count
React 31
Java 26
Kubernetes 25
TypeScript 23
AWS 22
Kotlin 21
C 20
Linux 17
Docker 15
Next.js 15

The most of the jobs require Python, but there are some front-end, Java and C jobs

Magic filter-by-intersection function is a custom one and allow filtering list values that include given set of elements:

# Filters rows by intersecting given `column` with `requirements`
# Case insensitive and works only if ALL requirements exist in a `column` value
# If `--invert` then works as symmetric difference
def filter-by-intersection [
    column: string
    requirements: list
   --invert (-i)
] {
    let required_stack = $requirements | par-each { |el| str downcase }
    let required_len = if $invert { 0 } else { ($requirements | length )}
    $in
    | filter { |row| 
        $required_len == (
            $row 
            | get $column 
            | par-each { |el| str downcase } 
            | where ($it in $requirements) 
            | length
        )
    }
}

What about experience and degree requirement for each position in Python?

$df
| filter-by-intersection 'stack' ['python'] 
| group-by 'position' --to-table
| insert 'group_size' { |group| $group.items | length } 
| where 'group_size' >= 10
| insert 'experience' { |group| 
    $group.items 
    | get 'experience'
    | uniq --count  
    | sort-by 'count' --reverse 
    | update 'value' { |row| if $row.value == null { 0 } else { $row.value }}
    | rename --column { 'value': 'years' }
    | first 3 
} 
| insert 'degree_requirement' { |group| 
    $group.items 
    | each { |row| $row.degree != null } 
    | uniq --count 
    | sort-by 'value'
    | rename --column { 'value': 'required' }
}
| sort-by 'group_size' --reverse 
| select 'group' 'group_size' 'experience' 'degree_requirement'

Output:

╭───┬────────┬────────────┬───────────────────────┬──────────────────────────╮
│ # │ group  │ group_size │      experience       │    degree_requirement    │
├───┼────────┼────────────┼───────────────────────┼──────────────────────────┤
│ 0 │ senior │         83 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     5 │    30 │ │ │ 0 │ false    │    26 │ │
│   │        │            │ │ 1 │     0 │    11 │ │ │ 1 │ true     │    57 │ │
│   │        │            │ │ 2 │     7 │    11 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 1 │ other  │         14 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     8 │ │ │ 0 │ false    │    12 │ │
│   │        │            │ │ 1 │     5 │     1 │ │ │ 1 │ true     │     2 │ │
│   │        │            │ │ 2 │     3 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 2 │ lead   │         12 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     0 │     5 │ │ │ 0 │ false    │     6 │ │
│   │        │            │ │ 1 │    10 │     4 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     5 │     1 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
│ 3 │ middle │         10 │ ╭───┬───────┬───────╮ │ ╭───┬──────────┬───────╮ │
│   │        │            │ │ # │ years │ count │ │ │ # │ required │ count │ │
│   │        │            │ ├───┼───────┼───────┤ │ ├───┼──────────┼───────┤ │
│   │        │            │ │ 0 │     3 │     4 │ │ │ 0 │ false    │     4 │ │
│   │        │            │ │ 1 │     5 │     3 │ │ │ 1 │ true     │     6 │ │
│   │        │            │ │ 2 │     2 │     2 │ │ ╰───┴──────────┴───────╯ │
│   │        │            │ ╰───┴───────┴───────╯ │                          │
╰───┴────────┴────────────┴───────────────────────┴──────────────────────────╯

Extraction of the most common requirements wasn't as easy as previous steps. So I've met a classification problem, and I'm going to describe my solution in the next chapter of this article.

Conclusion

We successfully extracted and analyzed job data from LinkedIn using the linkedin_jobs_scraper package. Responsibilities in the actual dataset are too sparse and need better processing to make functional classes that will help in CV creation. But the given steps already help me a lot with monitoring and applying to the jobs in half-auto mode.

版本聲明 本文轉載於:https://dev.to/suzumenobu/exploring-job-market-for-software-engineers-3li8?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • 大批
    大批
    方法是可以在物件上呼叫的 fns 數組是對象,因此它們在 JS 中也有方法。 slice(begin):將陣列的一部分提取到新數組中,而不改變原始數組。 let arr = ['a','b','c','d','e']; // Usecase: Extract till index ...
    程式設計 發佈於2025-01-05
  • 在 Go 中使用 WebSocket 進行即時通信
    在 Go 中使用 WebSocket 進行即時通信
    构建需要实时更新的应用程序(例如聊天应用程序、实时通知或协作工具)需要一种比传统 HTTP 更快、更具交互性的通信方法。这就是 WebSockets 发挥作用的地方!今天,我们将探讨如何在 Go 中使用 WebSocket,以便您可以向应用程序添加实时功能。 在这篇文章中,我们将介绍: WebSoc...
    程式設計 發佈於2025-01-05
  • HTML 格式標籤
    HTML 格式標籤
    HTML 格式化元素 **HTML Formatting is a process of formatting text for better look and feel. HTML provides us ability to format text without us...
    程式設計 發佈於2025-01-05
  • 如何在 PHP 中組合兩個關聯數組,同時保留唯一 ID 並處理重複名稱?
    如何在 PHP 中組合兩個關聯數組,同時保留唯一 ID 並處理重複名稱?
    在 PHP 中組合關聯數組在 PHP 中,將兩個關聯數組組合成一個數組是常見任務。考慮以下請求:問題描述:提供的代碼定義了兩個關聯數組,$array1 和 $array2。目標是建立一個新陣列 $array3,它合併兩個陣列中的所有鍵值對。 此外,提供的陣列具有唯一的 ID,而名稱可能重疊。要求是建...
    程式設計 發佈於2025-01-05
  • 插入資料時如何修復「常規錯誤:2006 MySQL 伺服器已消失」?
    插入資料時如何修復「常規錯誤:2006 MySQL 伺服器已消失」?
    插入記錄時如何解決「一般錯誤:2006 MySQL 伺服器已消失」介紹:將資料插入MySQL 資料庫有時會導致錯誤「一般錯誤:2006 MySQL 伺服器已消失」。當與伺服器的連線遺失時會出現此錯誤,通常是由於 MySQL 配置中的兩個變數之一所致。 解決方案:解決此錯誤的關鍵是調整wait_tim...
    程式設計 發佈於2025-01-05
  • Bootstrap 4 Beta 中的列偏移發生了什麼事?
    Bootstrap 4 Beta 中的列偏移發生了什麼事?
    Bootstrap 4 Beta:列偏移的刪除和恢復Bootstrap 4 在其Beta 1 版本中引入了重大更改柱子偏移了。然而,隨著 Beta 2 的後續發布,這些變化已經逆轉。 從 offset-md-* 到 ml-auto在 Bootstrap 4 Beta 1 中, offset-md-*...
    程式設計 發佈於2025-01-05
  • 儘管程式碼有效,為什麼 POST 請求無法擷取 PHP 中的輸入?
    儘管程式碼有效,為什麼 POST 請求無法擷取 PHP 中的輸入?
    解決PHP 中的POST 請求故障在提供的程式碼片段中:action=''而非:action="<?php echo $_SERVER['PHP_SELF'];?>";?>"檢查$_POST陣列:表單提交後使用 var_dump 檢查 $_POST 陣列的內...
    程式設計 發佈於2025-01-05
  • 如何從 Pandas DataFrame 欄位中刪除具有空值的行?
    如何從 Pandas DataFrame 欄位中刪除具有空值的行?
    從Pandas DataFrame 列中刪除空值要根據特定列中的空值從Pandas DataFrame 中刪除行,請依照下列步驟操作步驟:1.識別列:決定DataFrame中包含要刪除的空值的欄位。在本例中,它是“EPS”列。 2。使用 dropna() 方法:dropna() 方法可讓您根據特定條...
    程式設計 發佈於2025-01-01
  • 如何在 Go 中正確鍵入斷言介面值片段?
    如何在 Go 中正確鍵入斷言介面值片段?
    型別斷言介面值切片在程式設計中,常常會遇到需要型別斷言介面值切片的情況。然而,這有時會導致錯誤。讓我們深入研究為什麼斷言介面值切片可能並不總是可行的原因。 當嘗試從介面值切片中將斷言鍵入特定類型(例如[]Symbol)時,[]Node ,如提供的範例所示:args.([]Symbol)您可能會遇到以...
    程式設計 發佈於2025-01-01
  • 為什麼 `list.sort()` 回傳 `None` 以及如何取得排序清單?
    為什麼 `list.sort()` 回傳 `None` 以及如何取得排序清單?
    了解Sort() 方法及其傳回值當嘗試排序並傳回唯一單字清單時,您可能會遇到常見問題: 「return list.sort()」語法未如預期傳回排序清單。這可能會令人困惑,因為它似乎與 sort() 方法的目的相矛盾。為了澄清這一點,讓我們檢查一下 list.sort() 的工作原理以及為什麼它在這...
    程式設計 發佈於2025-01-01
  • 如何使“preg_match”正規表示式不區分大小寫?
    如何使“preg_match”正規表示式不區分大小寫?
    使 preg_match 不區分大小寫在問題中提供的程式碼片段中,區分大小寫導致無法實現預期結果。要修正此問題,您可以在正規表示式中使用 i 修飾符,確保其不區分大小寫。 以下是修改程式碼的方法:preg_match("#(.{100}$keywords.{100})#i", s...
    程式設計 發佈於2025-01-01
  • DocumentFilter 如何有效地將 JTextField 輸入限制為整數?
    DocumentFilter 如何有效地將 JTextField 輸入限制為整數?
    將 JTextField 輸入過濾為整數:使用 DocumentFilter 的有效方法雖然直觀,但使用鍵偵聽器來驗證 JTextField 中的數字輸入是不夠的。相反,更全面的方法是使用 DocumentFilter。 DocumentFilter:強大的解決方案DocumentFilter 監視...
    程式設計 發佈於2025-01-01
  • 如何從 Go 程式設定 `ulimit -n`?
    如何從 Go 程式設定 `ulimit -n`?
    如何在golang程式中設定ulimit -n? Go的syscall.Setrlimit函式允許在Go程式中設定ulimit -n。這允許在程式內自訂資源限制,而無需進行全域變更。 瞭解 setrlimitsetrlimit 系統呼叫設定目前程序的資源限制。它需要兩個參數:資源限制類型 (RLIM...
    程式設計 發佈於2024-12-31
  • 為什麼 Java 列印陣列的方式很奇怪,如何正確列印陣列的內容?
    為什麼 Java 列印陣列的方式很奇怪,如何正確列印陣列的內容?
    Java 中奇怪的數組打印在 Java 中,數組不僅僅是值的集合。它們是具有特定行為和表示的物件。當您使用 System.out.println(arr) 列印陣列時,您實際上是在列印物件本身,而不是其內容。 此預設表示顯示陣列的類別名,後面接著該物件的十六進位雜湊程式碼目的。因此,例如,整數數組可...
    程式設計 發佈於2024-12-31
  • 使用 Lithe 進行 PHP 會話管理:從基本設定到進階使用
    使用 Lithe 進行 PHP 會話管理:從基本設定到進階使用
    當我們談論 Web 應用程式時,首要需求之一是在使用者瀏覽頁面時維護使用者資訊。這就是 Lithe 中的 會話管理 的用武之地,它允許您儲存登入資訊或使用者首選項等資料。 安裝簡單快速 要開始在 Lithe 中使用會話,您只需透過 Composer 來安裝會話中間件。只需在專案的...
    程式設計 發佈於2024-12-31

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3