」工欲善其事,必先利其器。「—孔子《論語.錄靈公》
首頁 > 程式設計 > 使用 Pandas 進行 JIRA 分析

使用 Pandas 進行 JIRA 分析

發佈於2024-08-26
瀏覽:622

Problem

It's hard to argue Atlassian JIRA is one of the most popular issue trackers and project management solutions. You can love it, you can hate it, but if you were hired as a software engineer for some company, there is a high probability of meeting JIRA.

If the project you are working on is very active, there can be thousands of JIRA issues of various types. If you are leading a team of engineers, you can be interested in analytical tools that can help you understand what is going on in the project based on data stored in JIRA. JIRA has some reporting facilities integrated, as well as 3rd party plugins. But most of them are pretty basic. For example, it's hard to find rather flexible "forecasting" tools.

The bigger the project, the less satisfied you are with integrated reporting tools. At some point, you will end up using an API to extract, manipulate, and visualize the data. During the last 15 years of JIRA usage, I saw dozens of such scripts and services in various programming languages around this domain.

Many day-to-day tasks may require one-time data analysis, so writing services every time doesn't pay off. You can treat JIRA as a data source and use a typical data analytics tool belt. For example, you may take Jupyter, fetch the list of recent bugs in the project, prepare a list of "features" (attributes valuable for analysis), utilize pandas to calculate the statistics, and try to forecast trends using scikit-learn. In this article, I would like to explain how to do it.

Preparation

JIRA API Access

Here, we will talk about the cloud version of JIRA. But if you are using a self-hosted version, the main concepts are almost the same.

First of all, we need to create a secret key to access JIRA via REST API. To do so, go to profile management - https://id.atlassian.com/manage-profile/profile-and-visibility If you select the "Security" tab, you will find the "Create and manage API tokens" link:

JIRA Analytics with Pandas

Create a new API token here and store it securely. We will use this token later.

JIRA Analytics with Pandas

Jupyter Notebooks

One of the most convenient ways to play with datasets is to utilize Jupyter. If you are not familiar with this tool, do not worry. I will show how to use it to solve our problem. For local experiments, I like to use DataSpell by JetBrains, but there are services available online and for free. One of the most well-known services among data scientists is Kaggle. However, their notebooks don't allow you to make external connections to access JIRA via API. Another very popular service is Colab by Google. It allows you to make remote connections and install additional Python modules.

JIRA has a pretty easy-to-use REST API. You can make API calls using your favorite way of doing HTTP requests and parse the response manually. However, we will utilize an excellent and very popular jira module for that purpose.

Tools in Action

Data Analysis

Let's combine all the parts to come up with the solution.

Go to the Google Colab interface and create a new notebook. After the notebook creation, we need to store previously obtained JIRA credentials as "secrets." Click the "Key" icon in the left toolbar to open the appropriate dialog and add two "secrets" with the following names: JIRA_USER and JIRA_PASSWORD. At the bottom of the screen, you can see the way how to access these "secrets":

JIRA Analytics with Pandas

The next thing is to install an additional Python module for JIRA integration. We can do it by executing the shell command in the scope of the notebook cell:

!pip install jira

The output should look something like the following:

Collecting jira
  Downloading jira-3.8.0-py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.5/77.5 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from jira) (0.7.1)
...
Installing collected packages: requests-toolbelt, jira
Successfully installed jira-3.8.0 requests-toolbelt-1.0.0

We need to fetch the "secrets"/credentials:

from google.colab import userdata

JIRA_URL = 'https://******.atlassian.net'
JIRA_USER = userdata.get('JIRA_USER')
JIRA_PASSWORD = userdata.get('JIRA_PASSWORD')

And validate the connection to the JIRA Cloud:

from jira import JIRA

jira = JIRA(JIRA_URL, basic_auth=(JIRA_USER, JIRA_PASSWORD))
projects = jira.projects()
projects

If the connection is ok and the credentials are valid, you should see a non-empty list of your projects:

[,
 ,
 ,
...

So we can connect and fetch data from JIRA. The next step is to fetch some data for analysis with pandas. Let’s try to fetch the list of solved problems during the last several weeks for some project:

JIRA_FILTER = 19762

issues = jira.search_issues(
    f'filter={JIRA_FILTER}',
    maxResults=False,
    fields='summary,issuetype,assignee,reporter,aggregatetimespent',
)

We need to transform the dataset into the pandas data frame:

import pandas as pd

df = pd.DataFrame([{
    'key': issue.key,
    'assignee': issue.fields.assignee and issue.fields.assignee.displayName or issue.fields.reporter.displayName,
    'time': issue.fields.aggregatetimespent,
    'summary': issue.fields.summary,
} for issue in issues])

df.set_index('key', inplace=True)

df

The output may look like the following:

JIRA Analytics with Pandas

We would like to analyze how much time it usually takes to solve the issue. People are not ideal, so sometimes they forget to log the work. It brings a headache if you try to analyze such data using JIRA built-in tools. But it's not a problem for us to make some adjustments using pandas. For example, we can transform the "time" field from seconds into hours and replace the absent values with the median value (beware, dropna can be more suitable if there are a lot of gaps):

df['time'].fillna(df['time'].median(), inplace=True)
df['time'] = df['time'] / 3600

We can easily visualize the distribution to find out anomalies:

df['time'].plot.bar(xlabel='', xticks=[])

JIRA Analytics with Pandas

It is also interesting to see the distribution of solved problems by the assignee:

top_solvers = df.groupby('assignee').count()[['time']]
top_solvers.rename(columns={'time': 'tickets'}, inplace=True)
top_solvers.sort_values('tickets', ascending=False, inplace=True)

top_solvers.plot.barh().invert_yaxis()

It may look like the following:

JIRA Analytics with Pandas

Predictions

Let's try to predict the amount of time required to finish all open issues. Of course, we can do it without machine learning by using simple approximation and the average time to resolve the issue. So the predicted amount of required time is the number of open issues multiplied by the average time to resolve one. For example, the median time to solve one issue is 2 hours, and we have 9 open issues, so the time required to solve them all is 18 hours (approximation). It's a good enough forecast, but we might know the speed of solving depends on the product, team, and other attributes of the issue. If we want to improve the prediction, we can utilize machine learning to solve this task.

The high-level approach looks the following:

  • Obtain the dataset for “learning”
  • Clean up the data
  • Prepare the "features" aka "feature engineering"
  • Train the model
  • Use the model to predict some value of the target dataset

For the first step, we will use a dataset of tickets for the last 30 weeks. Some parts here are simplified for illustrative purposes. In real life, the amount of data for learning should be big enough to make a useful model (e.g., in our case, we need thousands of issues to be analyzed).

issues = jira.search_issues(
    f'project = PPS AND status IN (Resolved) AND created >= -30w',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674,aggregatetimespent',
)

closed_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
    'time': issue.fields.aggregatetimespent,
} for issue in issues])

closed_tickets.set_index('key', inplace=True)
closed_tickets['time'].fillna(closed_tickets['time'].median(), inplace=True)

closed_tickets

In my case, it's something around 800 tickets and only two fields for "learning": "team" and "product."

The next step is to obtain our target dataset. Why do I do it so early? I want to clean up and do "feature engineering" in one shot for both datasets. Otherwise, the mismatch between the structures can cause problems.

issues = jira.search_issues(
    f'project = PPS AND status IN (Open, Reopened)',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674',
)

open_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
} for issue in issues])

open_tickets.set_index('key', inplace=True)

open_tickets

Please notice we have no "time" column here because we want to predict it. Let's nullify it and combine both datasets to prepare the "features."

open_tickets['time'] = 0
tickets = pd.concat([closed_tickets, open_tickets])

tickets

Columns "team" and "product" contain string values. One of the ways of dealing with that is to transform each value into separate fields with boolean flags.

products = pd.get_dummies(tickets['product'], prefix='product')
tickets = pd.concat([tickets, products], axis=1)
tickets.drop('product', axis=1, inplace=True)

teams = pd.get_dummies(tickets['team'], prefix='team')
tickets = pd.concat([tickets, teams], axis=1)
tickets.drop('team', axis=1, inplace=True)

tickets

The result may look like the following:

JIRA Analytics with Pandas

After the combined dataset preparation, we can split it back into two parts:

closed_tickets = tickets[:len(closed_tickets)]
open_tickets = tickets[len(closed_tickets):][:]

Now it's time to train our model:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

features = closed_tickets.drop(['time'], axis=1)
labels = closed_tickets['time']

features_train, features_val, labels_train, labels_val = train_test_split(features, labels, test_size=0.2)

model = DecisionTreeRegressor()
model.fit(features_train, labels_train)
model.score(features_val, labels_val)

And the final step is to use our model to make a prediction:

open_tickets['time'] = model.predict(open_tickets.drop('time', axis=1, errors='ignore'))
open_tickets['time'].sum() / 3600

The final output, in my case, is 25 hours, which is higher than our initial rough estimation. This was a basic example. However, by using ML tools, you can significantly expand your abilities to analyze JIRA data.

Conclusion

Sometimes, JIRA built-in tools and plugins are not sufficient for effective analysis. Moreover, many 3rd party plugins are rather expensive, costing thousands of dollars per year, and you will still struggle to make them work the way you want. However, you can easily utilize well-known data analysis tools by fetching necessary information via JIRA API and go beyond these limitations. I spent so many hours playing with various JIRA plugins in attempts to create good reports for projects, but they often missed some important parts. Building a tool or a full-featured service on top of JIRA API also often looks like overkill. That's why typical data analysis and ML tools like Jupiter, pandas, matplotlib, scikit-learn, and others may work better here.

JIRA Analytics with Pandas

版本聲明 本文轉載於:https://dev.to/sibprogrammer/jira-analytics-with-pandas-agl?1如有侵犯,請聯絡[email protected]刪除
最新教學 更多>
  • 哪種方法更有效地用於點 - 填點檢測:射線跟踪或matplotlib \的路徑contains_points?
    哪種方法更有效地用於點 - 填點檢測:射線跟踪或matplotlib \的路徑contains_points?
    在Python 射線tracing方法 matplotlib路徑對象表示多邊形。它檢查給定點是否位於定義路徑內。 This function is often faster than the ray tracing approach, as seen in the code snippet pr...
    程式設計 發佈於2025-02-19
  • 如何在JavaScript對像中動態設置鍵?
    如何在JavaScript對像中動態設置鍵?
    如何為JavaScript對像變量創建動態鍵,嘗試為JavaScript對象創建動態鍵,使用此Syntax jsObj['key' i] = 'example' 1;將不起作用。正確的方法採用方括號:他們維持一個長度屬性,該屬性反映了數字屬性(索引)和一個數字屬性的數量。標準對像沒有模仿這...
    程式設計 發佈於2025-02-19
  • 如何可靠地檢查MySQL表中的列存在?
    如何可靠地檢查MySQL表中的列存在?
    在mySQL中確定列中的列存在,驗證表中的列存在與與之相比有點困惑其他數據庫系統。常用的方法:如果存在(從信息_schema.columns select * * where table_name ='prefix_topic'和column_name =&...
    程式設計 發佈於2025-02-19
  • 如何為PostgreSQL中的每個唯一標識符有效地檢索最後一行?
    如何為PostgreSQL中的每個唯一標識符有效地檢索最後一行?
    [2最後一行與數據集中的每個不同標識符關聯。考慮以下數據: 1 2014-02-01 kjkj 1 2014-03-11 ajskj 3 2014-02-01 sfdg 3 2014-06-12 fdsa 為了檢索數據集中每個唯一ID的最後一行信息,您可以在操作員上使用Postgres的有效效...
    程式設計 發佈於2025-02-19
  • 如何使用char_length()在mySQL中按字符串長度對數據進行排序?
    如何使用char_length()在mySQL中按字符串長度對數據進行排序?
    [2使用內置的char_length()函數。 :返回字符串中的字符數,考慮多BYTE字符encoding(例如UTF-8)。 ] :返回字符串佔用的字節數,該字符的數量可能無法準確反映字符計數多字節編碼。 [&& && && && && && &&華從指定的表格中的所有行,並根據指定列的字符長度...
    程式設計 發佈於2025-02-19
  • 如何檢查對像是否具有Python中的特定屬性?
    如何檢查對像是否具有Python中的特定屬性?
    方法來確定對象屬性存在尋求一種方法來驗證對像中特定屬性的存在。考慮以下示例,其中嘗試訪問不確定屬性會引起錯誤: >>> a = someClass() >>> A.property Trackback(最近的最新電話): 文件“ ”,第1行, AttributeError:SomeClass實...
    程式設計 發佈於2025-02-19
  • 在映射到MySQL枚舉列時,如何確保冬眠保留值?
    在映射到MySQL枚舉列時,如何確保冬眠保留值?
    在hibernate中保存枚舉值:故障排除錯誤的列type ,他們各自的映射至關重要。在Java中使用枚舉類型時,至關重要的是,建立冬眠的方式如何映射到基礎數據庫。 在您的情況下,您已將MySQL列定義為枚舉,並在Java中創建了相應的枚舉代碼。但是,您遇到以下錯誤:“ MyApp中的錯誤列類型...
    程式設計 發佈於2025-02-19
  • 如何使用替換指令在GO MOD中解析模塊路徑差異?
    如何使用替換指令在GO MOD中解析模塊路徑差異?
    克服go mod中的模塊路徑差異 github.com/coreos/etcd/integration imports :解析GO.mod:模塊將其路徑聲明為: go.etcd.io/bbolt [&&&&&&&&&&&&&&&&&&&&&&&&&&&& github.com/coreos/b...
    程式設計 發佈於2025-02-19
  • 如何使用PHP從XML文件中有效地檢索屬性值?
    如何使用PHP從XML文件中有效地檢索屬性值?
    從php 您的目標可能是檢索“ varnum”屬性值,其中提取數據的傳統方法可能會使您留下PHP陷入困境。 使用simplexmlelement :: attributes()函數提供了簡單的解決方案。此函數可訪問對XML元素作為關聯數組的屬性: - > attributes()為$ att...
    程式設計 發佈於2025-02-19
  • Java是否允許多種返回類型:仔細研究通用方法?
    Java是否允許多種返回類型:仔細研究通用方法?
    在java中的多個返回類型:一個誤解介紹,其中foo是自定義類。該方法聲明似乎擁有兩種返回類型:列表和E。但是,情況確實如此嗎? 通用方法:拆開神秘 [方法僅具有單一的返回類型。相反,它採用機制,如鑽石符號“ ”。 分解方法簽名: :本節定義了一個通用類型參數,E。它表示該方法接受擴展FOO類的...
    程式設計 發佈於2025-02-19
  • 大批
    大批
    [2 數組是對象,因此它們在JS中也具有方法。 切片(開始):在新數組中提取部分數組,而無需突變原始數組。 令ARR = ['a','b','c','d','e']; // USECASE:提取直到索引作...
    程式設計 發佈於2025-02-19
  • 在沒有密碼提示的情況下,如何在Ubuntu上安裝MySQL?
    在沒有密碼提示的情況下,如何在Ubuntu上安裝MySQL?
    在ubuntu 使用debconf-set-selections 在安裝過程中避免密碼提示mysql root用戶。這需要以下步驟: sudo debconf-set-selections
    程式設計 發佈於2025-02-19
  • 如何以不同的頻率控制Android設備振動?
    如何以不同的頻率控制Android設備振動?
    控制使用頻率變化的Android設備振動是否想為您的Android應用程序添加觸覺元素?了解如何觸發設備的振動器至關重要。您可以做到這一點:生成基本振動以生成簡單的振動,使用振動器對象:這將導致設備在指定的持續時間內振動。 許可要求通過上述技術,您可以創建在您的Android應用程序中自定義振動,以...
    程式設計 發佈於2025-02-19
  • 如何克服PHP的功能重新定義限制?
    如何克服PHP的功能重新定義限制?
    克服PHP的函數重新定義限制在PHP中,多次定義一個相同名稱的函數是一個no-no。嘗試這樣做,如提供的代碼段所示,將導致可怕的“不能重新列出”錯誤。 //錯誤:“ cance redeclare foo()” 但是,PHP工具腰帶中有一個隱藏的寶石:runkit擴展。它使您能夠靈活地重新定...
    程式設計 發佈於2025-02-19

免責聲明: 提供的所有資源部分來自互聯網,如果有侵犯您的版權或其他權益,請說明詳細緣由並提供版權或權益證明然後發到郵箱:[email protected] 我們會在第一時間內為您處理。

Copyright© 2022 湘ICP备2022001581号-3