」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 使用 Pandas 進行 JIRA 分析

使用 Pandas 進行 JIRA 分析

發佈於2024-08-26
瀏覽:421

Problem

It's hard to argue Atlassian JIRA is one of the most popular issue trackers and project management solutions. You can love it, you can hate it, but if you were hired as a software engineer for some company, there is a high probability of meeting JIRA.

If the project you are working on is very active, there can be thousands of JIRA issues of various types. If you are leading a team of engineers, you can be interested in analytical tools that can help you understand what is going on in the project based on data stored in JIRA. JIRA has some reporting facilities integrated, as well as 3rd party plugins. But most of them are pretty basic. For example, it's hard to find rather flexible "forecasting" tools.

The bigger the project, the less satisfied you are with integrated reporting tools. At some point, you will end up using an API to extract, manipulate, and visualize the data. During the last 15 years of JIRA usage, I saw dozens of such scripts and services in various programming languages around this domain.

Many day-to-day tasks may require one-time data analysis, so writing services every time doesn't pay off. You can treat JIRA as a data source and use a typical data analytics tool belt. For example, you may take Jupyter, fetch the list of recent bugs in the project, prepare a list of "features" (attributes valuable for analysis), utilize pandas to calculate the statistics, and try to forecast trends using scikit-learn. In this article, I would like to explain how to do it.

Preparation

JIRA API Access

Here, we will talk about the cloud version of JIRA. But if you are using a self-hosted version, the main concepts are almost the same.

First of all, we need to create a secret key to access JIRA via REST API. To do so, go to profile management - https://id.atlassian.com/manage-profile/profile-and-visibility If you select the "Security" tab, you will find the "Create and manage API tokens" link:

JIRA Analytics with Pandas

Create a new API token here and store it securely. We will use this token later.

JIRA Analytics with Pandas

Jupyter Notebooks

One of the most convenient ways to play with datasets is to utilize Jupyter. If you are not familiar with this tool, do not worry. I will show how to use it to solve our problem. For local experiments, I like to use DataSpell by JetBrains, but there are services available online and for free. One of the most well-known services among data scientists is Kaggle. However, their notebooks don't allow you to make external connections to access JIRA via API. Another very popular service is Colab by Google. It allows you to make remote connections and install additional Python modules.

JIRA has a pretty easy-to-use REST API. You can make API calls using your favorite way of doing HTTP requests and parse the response manually. However, we will utilize an excellent and very popular jira module for that purpose.

Tools in Action

Data Analysis

Let's combine all the parts to come up with the solution.

Go to the Google Colab interface and create a new notebook. After the notebook creation, we need to store previously obtained JIRA credentials as "secrets." Click the "Key" icon in the left toolbar to open the appropriate dialog and add two "secrets" with the following names: JIRA_USER and JIRA_PASSWORD. At the bottom of the screen, you can see the way how to access these "secrets":

JIRA Analytics with Pandas

The next thing is to install an additional Python module for JIRA integration. We can do it by executing the shell command in the scope of the notebook cell:

!pip install jira

The output should look something like the following:

Collecting jira
  Downloading jira-3.8.0-py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.5/77.5 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from jira) (0.7.1)
...
Installing collected packages: requests-toolbelt, jira
Successfully installed jira-3.8.0 requests-toolbelt-1.0.0

We need to fetch the "secrets"/credentials:

from google.colab import userdata

JIRA_URL = 'https://******.atlassian.net'
JIRA_USER = userdata.get('JIRA_USER')
JIRA_PASSWORD = userdata.get('JIRA_PASSWORD')

And validate the connection to the JIRA Cloud:

from jira import JIRA

jira = JIRA(JIRA_URL, basic_auth=(JIRA_USER, JIRA_PASSWORD))
projects = jira.projects()
projects

If the connection is ok and the credentials are valid, you should see a non-empty list of your projects:

[,
 ,
 ,
...

So we can connect and fetch data from JIRA. The next step is to fetch some data for analysis with pandas. Let’s try to fetch the list of solved problems during the last several weeks for some project:

JIRA_FILTER = 19762

issues = jira.search_issues(
    f'filter={JIRA_FILTER}',
    maxResults=False,
    fields='summary,issuetype,assignee,reporter,aggregatetimespent',
)

We need to transform the dataset into the pandas data frame:

import pandas as pd

df = pd.DataFrame([{
    'key': issue.key,
    'assignee': issue.fields.assignee and issue.fields.assignee.displayName or issue.fields.reporter.displayName,
    'time': issue.fields.aggregatetimespent,
    'summary': issue.fields.summary,
} for issue in issues])

df.set_index('key', inplace=True)

df

The output may look like the following:

JIRA Analytics with Pandas

We would like to analyze how much time it usually takes to solve the issue. People are not ideal, so sometimes they forget to log the work. It brings a headache if you try to analyze such data using JIRA built-in tools. But it's not a problem for us to make some adjustments using pandas. For example, we can transform the "time" field from seconds into hours and replace the absent values with the median value (beware, dropna can be more suitable if there are a lot of gaps):

df['time'].fillna(df['time'].median(), inplace=True)
df['time'] = df['time'] / 3600

We can easily visualize the distribution to find out anomalies:

df['time'].plot.bar(xlabel='', xticks=[])

JIRA Analytics with Pandas

It is also interesting to see the distribution of solved problems by the assignee:

top_solvers = df.groupby('assignee').count()[['time']]
top_solvers.rename(columns={'time': 'tickets'}, inplace=True)
top_solvers.sort_values('tickets', ascending=False, inplace=True)

top_solvers.plot.barh().invert_yaxis()

It may look like the following:

JIRA Analytics with Pandas

Predictions

Let's try to predict the amount of time required to finish all open issues. Of course, we can do it without machine learning by using simple approximation and the average time to resolve the issue. So the predicted amount of required time is the number of open issues multiplied by the average time to resolve one. For example, the median time to solve one issue is 2 hours, and we have 9 open issues, so the time required to solve them all is 18 hours (approximation). It's a good enough forecast, but we might know the speed of solving depends on the product, team, and other attributes of the issue. If we want to improve the prediction, we can utilize machine learning to solve this task.

The high-level approach looks the following:

  • Obtain the dataset for “learning”
  • Clean up the data
  • Prepare the "features" aka "feature engineering"
  • Train the model
  • Use the model to predict some value of the target dataset

For the first step, we will use a dataset of tickets for the last 30 weeks. Some parts here are simplified for illustrative purposes. In real life, the amount of data for learning should be big enough to make a useful model (e.g., in our case, we need thousands of issues to be analyzed).

issues = jira.search_issues(
    f'project = PPS AND status IN (Resolved) AND created >= -30w',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674,aggregatetimespent',
)

closed_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
    'time': issue.fields.aggregatetimespent,
} for issue in issues])

closed_tickets.set_index('key', inplace=True)
closed_tickets['time'].fillna(closed_tickets['time'].median(), inplace=True)

closed_tickets

In my case, it's something around 800 tickets and only two fields for "learning": "team" and "product."

The next step is to obtain our target dataset. Why do I do it so early? I want to clean up and do "feature engineering" in one shot for both datasets. Otherwise, the mismatch between the structures can cause problems.

issues = jira.search_issues(
    f'project = PPS AND status IN (Open, Reopened)',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674',
)

open_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
} for issue in issues])

open_tickets.set_index('key', inplace=True)

open_tickets

Please notice we have no "time" column here because we want to predict it. Let's nullify it and combine both datasets to prepare the "features."

open_tickets['time'] = 0
tickets = pd.concat([closed_tickets, open_tickets])

tickets

Columns "team" and "product" contain string values. One of the ways of dealing with that is to transform each value into separate fields with boolean flags.

products = pd.get_dummies(tickets['product'], prefix='product')
tickets = pd.concat([tickets, products], axis=1)
tickets.drop('product', axis=1, inplace=True)

teams = pd.get_dummies(tickets['team'], prefix='team')
tickets = pd.concat([tickets, teams], axis=1)
tickets.drop('team', axis=1, inplace=True)

tickets

The result may look like the following:

JIRA Analytics with Pandas

After the combined dataset preparation, we can split it back into two parts:

closed_tickets = tickets[:len(closed_tickets)]
open_tickets = tickets[len(closed_tickets):][:]

Now it's time to train our model:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

features = closed_tickets.drop(['time'], axis=1)
labels = closed_tickets['time']

features_train, features_val, labels_train, labels_val = train_test_split(features, labels, test_size=0.2)

model = DecisionTreeRegressor()
model.fit(features_train, labels_train)
model.score(features_val, labels_val)

And the final step is to use our model to make a prediction:

open_tickets['time'] = model.predict(open_tickets.drop('time', axis=1, errors='ignore'))
open_tickets['time'].sum() / 3600

The final output, in my case, is 25 hours, which is higher than our initial rough estimation. This was a basic example. However, by using ML tools, you can significantly expand your abilities to analyze JIRA data.

Conclusion

Sometimes, JIRA built-in tools and plugins are not sufficient for effective analysis. Moreover, many 3rd party plugins are rather expensive, costing thousands of dollars per year, and you will still struggle to make them work the way you want. However, you can easily utilize well-known data analysis tools by fetching necessary information via JIRA API and go beyond these limitations. I spent so many hours playing with various JIRA plugins in attempts to create good reports for projects, but they often missed some important parts. Building a tool or a full-featured service on top of JIRA API also often looks like overkill. That's why typical data analysis and ML tools like Jupiter, pandas, matplotlib, scikit-learn, and others may work better here.

JIRA Analytics with Pandas

版本聲明 本文轉載於:https://dev.to/sibprogrammer/jira-analytics-with-pandas-agl?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • Hacktoberfest 週線上拍賣系統
    Hacktoberfest 週線上拍賣系統
    概述 在 Hacktoberfest 的第三週,我決定為一個較小但有前途的專案做出貢獻:線上拍賣系統。儘管該專案仍處於早期階段,但它已經顯示出成長潛力,而且我看到了幫助改進其程式碼庫的機會。我的任務是透過減少冗餘程式碼和改進整體結構來重構項目,使其更具可維護性和可擴展性。 ...
    程式設計 發佈於2024-11-06
  • 如何使用“exception_ptr”在 C++ 執行緒之間傳播異常?
    如何使用“exception_ptr”在 C++ 執行緒之間傳播異常?
    在C 中的線程之間傳播異常當從主線程調用的函數生成多個線程時,就會出現在C 中的執行緒之間傳播異常的任務用於CPU 密集型工作的工作執行緒。挑戰在於處理工作執行緒上可能發生的異常並將其傳播回主執行緒以進行正確處理。 傳統方法一種常見方法是手動捕獲工作線程上的各種異常,記錄它們的詳細信息,然後在主線程...
    程式設計 發佈於2024-11-06
  • 如何使用 3D CSS 轉換來修復 Firefox 中的鋸齒狀邊緣?
    如何使用 3D CSS 轉換來修復 Firefox 中的鋸齒狀邊緣?
    使用3D CSS 變換時Firefox 中的鋸齒狀邊緣與Chrome 中使用CSS 變換時的鋸齒狀邊緣問題類似,Firefox 在3D 變換中也出現了這個問題。背面可見性作為 Chrome 中的潛在解決方案,在 Firefox 中被證明無效。 解決方案:要在Firefox 中緩解此問題,您可以實施以...
    程式設計 發佈於2024-11-06
  • 為什麼 PHP 的 mail() 函數會為電子郵件發送帶來挑戰?
    為什麼 PHP 的 mail() 函數會為電子郵件發送帶來挑戰?
    為什麼PHP 的mail() 函數達不到要求:限制和陷阱雖然PHP 提供了mail() 函數用於發送電子郵件,但它卻失敗了與專用庫或擴展相比較短。以下是與使用mail() 相關的缺點和限制的全面檢查:格式問題:mail() 可能會遇到以下問題:標題和內容格式,尤其是作業系統之間的換行差異。這些錯誤可...
    程式設計 發佈於2024-11-06
  • 使用 npyConverter 簡化 NumPy 檔案轉換
    使用 npyConverter 簡化 NumPy 檔案轉換
    如果您使用 NumPy 的 .npy 檔案並需要將其轉換為 .mat (MATLAB) 或 .csv 格式,npyConverter 就是適合您的工具!這個簡單的基於 GUI 的工具透過乾淨且用戶友好的介面提供 .npy 檔案的批量轉換。 主要特點 批次轉換:將目錄下所有.npy檔...
    程式設計 發佈於2024-11-06
  • 如何停用特定線路的 Eslint 規則?
    如何停用特定線路的 Eslint 規則?
    停用特定行的Eslint 規則在JSHint 中,可以使用語法停用特定行的linting 規則: /* jshint ignore:start */ $scope.someVar = ConstructorFunction(); /* jshint ignore:end */對於 eslint,有幾...
    程式設計 發佈於2024-11-06
  • 如何在沒有錯誤的情況下將清單插入 Pandas DataFrame 單元格?
    如何在沒有錯誤的情況下將清單插入 Pandas DataFrame 單元格?
    將清單插入Pandas 儲存格問題在Python 中,嘗試將清單插入Pandas DataFrame 的儲存格可能會導致錯誤或意圖想不到的結果。例如,當嘗試將清單插入DataFrame df 的儲存格1B 時:df = pd.DataFrame({'A': [12, 23], 'B': [np.na...
    程式設計 發佈於2024-11-06
  • Matplotlib 中的「plt.plot」、「ax.plot」和「figure.add_subplot」之間的主要差異是什麼?
    Matplotlib 中的「plt.plot」、「ax.plot」和「figure.add_subplot」之間的主要差異是什麼?
    Matplotlib 中繪圖、軸與圖形之間的差異Matplotlib 是一個用於建立視覺化的物件導向的 Python 函式庫。它使用三個主要物件:圖形、軸和繪圖。 圖形圖形表示將在其中顯示可視化的整個畫布或視窗。它定義畫布的整體大小和佈局,包括邊距、背景顏色和任何其他全域屬性。 軸軸表示圖中繪製資料...
    程式設計 發佈於2024-11-06
  • FireDucks:以零學習成本獲得超越 pandas 的效能!
    FireDucks:以零學習成本獲得超越 pandas 的效能!
    Pandas 是最受歡迎的庫之一,當我在尋找一種更簡單的方法來加速其性能時,我發現了 FireDucks 並對它產生了興趣! 與 pandas 的比較:為什麼選擇 FireDucks? Pandas 程式可能會遇到嚴重的效能問題,這取決於其編寫方式。然而,作為一名數據科學家,我想花...
    程式設計 發佈於2024-11-06
  • CSS 網格:嵌套網格佈局
    CSS 網格:嵌套網格佈局
    介紹 CSS Grid 是一種佈局系統,因其在創建多列佈局方面的靈活性和效率而迅速受到 Web 開發人員的歡迎。它最有用的功能之一是能夠建立嵌套網格佈局。嵌套網格可以在設計複雜網頁時提供更多控制和精確度。在本文中,我們將探討在 CSS Grid 中使用嵌套網格佈局的優點、缺點和主要...
    程式設計 發佈於2024-11-06
  • 適用於 Java 的 Jupyter 筆記本
    適用於 Java 的 Jupyter 筆記本
    Jupyter Notebook 的强大 Jupyter Notebooks 是一个出色的工具,最初是为了帮助数据科学家和工程师使用 python 编程语言简化数据处理工作而开发的。事实上,笔记本的交互性使其非常适合快速查看代码结果,而无需搭建开发环境、编译、打包等。此功能对于数据...
    程式設計 發佈於2024-11-06
  • 如何在 PyQt 中的主視窗和執行緒之間共享資料:直接引用與訊號和插槽?
    如何在 PyQt 中的主視窗和執行緒之間共享資料:直接引用與訊號和插槽?
    PyQt 中主視窗與執行緒之間共享資料多執行緒應用程式通常需要在主視窗執行緒與工作執行緒之間共用數據。為了確保線程安全和正確的通信,PyQt 提供了幾種實用的方法。 選項 1:直接引用主視窗在此方法中,對主視窗的引用視窗被傳遞給執行緒。然後執行緒可以直接存取主視窗中的數據,例如 spinbox 的值...
    程式設計 發佈於2024-11-06
  • 對於專業開發人員來說最有用的 VS Code 快捷方式?
    對於專業開發人員來說最有用的 VS Code 快捷方式?
    VS Code 中 20 個最有用的快捷鍵 一般導航 指令面板:存取 VS Code 中的所有可用指令。 Ctrl Shift P (Windows/Linux) 或 Cmd Shift P (macOS) 快速開啟:按名稱快速開啟檔案。 Ctrl P (Windows/Linux) 或 Cmd ...
    程式設計 發佈於2024-11-06
  • 何時使用“composer update”與“composer install”?
    何時使用“composer update”與“composer install”?
    探索composer update和composer install之間的區別Composer是一個流行的PHP依賴管理器,提供兩個關鍵命令:composer update和composer install。雖然它們具有管理依賴關係的共同目標,但它們具有不同的目的並以不同的方式運作。 Compose...
    程式設計 發佈於2024-11-06

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3