”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 使用 Pandas 进行 JIRA 分析

使用 Pandas 进行 JIRA 分析

发布于2024-08-26
浏览:946

Problem

It's hard to argue Atlassian JIRA is one of the most popular issue trackers and project management solutions. You can love it, you can hate it, but if you were hired as a software engineer for some company, there is a high probability of meeting JIRA.

If the project you are working on is very active, there can be thousands of JIRA issues of various types. If you are leading a team of engineers, you can be interested in analytical tools that can help you understand what is going on in the project based on data stored in JIRA. JIRA has some reporting facilities integrated, as well as 3rd party plugins. But most of them are pretty basic. For example, it's hard to find rather flexible "forecasting" tools.

The bigger the project, the less satisfied you are with integrated reporting tools. At some point, you will end up using an API to extract, manipulate, and visualize the data. During the last 15 years of JIRA usage, I saw dozens of such scripts and services in various programming languages around this domain.

Many day-to-day tasks may require one-time data analysis, so writing services every time doesn't pay off. You can treat JIRA as a data source and use a typical data analytics tool belt. For example, you may take Jupyter, fetch the list of recent bugs in the project, prepare a list of "features" (attributes valuable for analysis), utilize pandas to calculate the statistics, and try to forecast trends using scikit-learn. In this article, I would like to explain how to do it.

Preparation

JIRA API Access

Here, we will talk about the cloud version of JIRA. But if you are using a self-hosted version, the main concepts are almost the same.

First of all, we need to create a secret key to access JIRA via REST API. To do so, go to profile management - https://id.atlassian.com/manage-profile/profile-and-visibility If you select the "Security" tab, you will find the "Create and manage API tokens" link:

JIRA Analytics with Pandas

Create a new API token here and store it securely. We will use this token later.

JIRA Analytics with Pandas

Jupyter Notebooks

One of the most convenient ways to play with datasets is to utilize Jupyter. If you are not familiar with this tool, do not worry. I will show how to use it to solve our problem. For local experiments, I like to use DataSpell by JetBrains, but there are services available online and for free. One of the most well-known services among data scientists is Kaggle. However, their notebooks don't allow you to make external connections to access JIRA via API. Another very popular service is Colab by Google. It allows you to make remote connections and install additional Python modules.

JIRA has a pretty easy-to-use REST API. You can make API calls using your favorite way of doing HTTP requests and parse the response manually. However, we will utilize an excellent and very popular jira module for that purpose.

Tools in Action

Data Analysis

Let's combine all the parts to come up with the solution.

Go to the Google Colab interface and create a new notebook. After the notebook creation, we need to store previously obtained JIRA credentials as "secrets." Click the "Key" icon in the left toolbar to open the appropriate dialog and add two "secrets" with the following names: JIRA_USER and JIRA_PASSWORD. At the bottom of the screen, you can see the way how to access these "secrets":

JIRA Analytics with Pandas

The next thing is to install an additional Python module for JIRA integration. We can do it by executing the shell command in the scope of the notebook cell:

!pip install jira

The output should look something like the following:

Collecting jira
  Downloading jira-3.8.0-py3-none-any.whl (77 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.5/77.5 kB 1.3 MB/s eta 0:00:00
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from jira) (0.7.1)
...
Installing collected packages: requests-toolbelt, jira
Successfully installed jira-3.8.0 requests-toolbelt-1.0.0

We need to fetch the "secrets"/credentials:

from google.colab import userdata

JIRA_URL = 'https://******.atlassian.net'
JIRA_USER = userdata.get('JIRA_USER')
JIRA_PASSWORD = userdata.get('JIRA_PASSWORD')

And validate the connection to the JIRA Cloud:

from jira import JIRA

jira = JIRA(JIRA_URL, basic_auth=(JIRA_USER, JIRA_PASSWORD))
projects = jira.projects()
projects

If the connection is ok and the credentials are valid, you should see a non-empty list of your projects:

[,
 ,
 ,
...

So we can connect and fetch data from JIRA. The next step is to fetch some data for analysis with pandas. Let’s try to fetch the list of solved problems during the last several weeks for some project:

JIRA_FILTER = 19762

issues = jira.search_issues(
    f'filter={JIRA_FILTER}',
    maxResults=False,
    fields='summary,issuetype,assignee,reporter,aggregatetimespent',
)

We need to transform the dataset into the pandas data frame:

import pandas as pd

df = pd.DataFrame([{
    'key': issue.key,
    'assignee': issue.fields.assignee and issue.fields.assignee.displayName or issue.fields.reporter.displayName,
    'time': issue.fields.aggregatetimespent,
    'summary': issue.fields.summary,
} for issue in issues])

df.set_index('key', inplace=True)

df

The output may look like the following:

JIRA Analytics with Pandas

We would like to analyze how much time it usually takes to solve the issue. People are not ideal, so sometimes they forget to log the work. It brings a headache if you try to analyze such data using JIRA built-in tools. But it's not a problem for us to make some adjustments using pandas. For example, we can transform the "time" field from seconds into hours and replace the absent values with the median value (beware, dropna can be more suitable if there are a lot of gaps):

df['time'].fillna(df['time'].median(), inplace=True)
df['time'] = df['time'] / 3600

We can easily visualize the distribution to find out anomalies:

df['time'].plot.bar(xlabel='', xticks=[])

JIRA Analytics with Pandas

It is also interesting to see the distribution of solved problems by the assignee:

top_solvers = df.groupby('assignee').count()[['time']]
top_solvers.rename(columns={'time': 'tickets'}, inplace=True)
top_solvers.sort_values('tickets', ascending=False, inplace=True)

top_solvers.plot.barh().invert_yaxis()

It may look like the following:

JIRA Analytics with Pandas

Predictions

Let's try to predict the amount of time required to finish all open issues. Of course, we can do it without machine learning by using simple approximation and the average time to resolve the issue. So the predicted amount of required time is the number of open issues multiplied by the average time to resolve one. For example, the median time to solve one issue is 2 hours, and we have 9 open issues, so the time required to solve them all is 18 hours (approximation). It's a good enough forecast, but we might know the speed of solving depends on the product, team, and other attributes of the issue. If we want to improve the prediction, we can utilize machine learning to solve this task.

The high-level approach looks the following:

  • Obtain the dataset for “learning”
  • Clean up the data
  • Prepare the "features" aka "feature engineering"
  • Train the model
  • Use the model to predict some value of the target dataset

For the first step, we will use a dataset of tickets for the last 30 weeks. Some parts here are simplified for illustrative purposes. In real life, the amount of data for learning should be big enough to make a useful model (e.g., in our case, we need thousands of issues to be analyzed).

issues = jira.search_issues(
    f'project = PPS AND status IN (Resolved) AND created >= -30w',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674,aggregatetimespent',
)

closed_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
    'time': issue.fields.aggregatetimespent,
} for issue in issues])

closed_tickets.set_index('key', inplace=True)
closed_tickets['time'].fillna(closed_tickets['time'].median(), inplace=True)

closed_tickets

In my case, it's something around 800 tickets and only two fields for "learning": "team" and "product."

The next step is to obtain our target dataset. Why do I do it so early? I want to clean up and do "feature engineering" in one shot for both datasets. Otherwise, the mismatch between the structures can cause problems.

issues = jira.search_issues(
    f'project = PPS AND status IN (Open, Reopened)',
    maxResults=False,
    fields='summary,issuetype,customfield_10718,customfield_10674',
)

open_tickets = pd.DataFrame([{
    'key': issue.key,
    'team': issue.fields.customfield_10718,
    'product': issue.fields.customfield_10674,
} for issue in issues])

open_tickets.set_index('key', inplace=True)

open_tickets

Please notice we have no "time" column here because we want to predict it. Let's nullify it and combine both datasets to prepare the "features."

open_tickets['time'] = 0
tickets = pd.concat([closed_tickets, open_tickets])

tickets

Columns "team" and "product" contain string values. One of the ways of dealing with that is to transform each value into separate fields with boolean flags.

products = pd.get_dummies(tickets['product'], prefix='product')
tickets = pd.concat([tickets, products], axis=1)
tickets.drop('product', axis=1, inplace=True)

teams = pd.get_dummies(tickets['team'], prefix='team')
tickets = pd.concat([tickets, teams], axis=1)
tickets.drop('team', axis=1, inplace=True)

tickets

The result may look like the following:

JIRA Analytics with Pandas

After the combined dataset preparation, we can split it back into two parts:

closed_tickets = tickets[:len(closed_tickets)]
open_tickets = tickets[len(closed_tickets):][:]

Now it's time to train our model:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

features = closed_tickets.drop(['time'], axis=1)
labels = closed_tickets['time']

features_train, features_val, labels_train, labels_val = train_test_split(features, labels, test_size=0.2)

model = DecisionTreeRegressor()
model.fit(features_train, labels_train)
model.score(features_val, labels_val)

And the final step is to use our model to make a prediction:

open_tickets['time'] = model.predict(open_tickets.drop('time', axis=1, errors='ignore'))
open_tickets['time'].sum() / 3600

The final output, in my case, is 25 hours, which is higher than our initial rough estimation. This was a basic example. However, by using ML tools, you can significantly expand your abilities to analyze JIRA data.

Conclusion

Sometimes, JIRA built-in tools and plugins are not sufficient for effective analysis. Moreover, many 3rd party plugins are rather expensive, costing thousands of dollars per year, and you will still struggle to make them work the way you want. However, you can easily utilize well-known data analysis tools by fetching necessary information via JIRA API and go beyond these limitations. I spent so many hours playing with various JIRA plugins in attempts to create good reports for projects, but they often missed some important parts. Building a tool or a full-featured service on top of JIRA API also often looks like overkill. That's why typical data analysis and ML tools like Jupiter, pandas, matplotlib, scikit-learn, and others may work better here.

JIRA Analytics with Pandas

版本声明 本文转载于:https://dev.to/sibprogrammer/jira-analytics-with-pandas-agl?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • Go 中的并发模式;工作池和扇出/扇入
    Go 中的并发模式;工作池和扇出/扇入
    Go 以其卓越的并发模型而闻名,但许多开发人员只关注 goroutine 和通道。然而,工作池和扇出/扇入等并发模式提供了真正的效率。 本文将介绍这些高级概念,帮助您最大限度地提高 Go 应用程序的吞吐量。 为什么并发很重要 并发允许程序高效地执行任务,特别是在处理 I/O 操作、...
    编程 发布于2024-11-06
  • 如何在 C++ 中将单个字符转换为 std::string?
    如何在 C++ 中将单个字符转换为 std::string?
    从单个字符创建字符串人们可能会遇到需要将表示为 char 数据类型的单个字符转换为std::string。从字符串中获取字符很简单,只需在所需位置索引字符串即可。然而,相反的过程需要不同的方法。要从单个字符创建 std::string,可以使用多种方法:使用 std::string参数计数为 1:c...
    编程 发布于2024-11-06
  • JavaScript 变量名称中美元符号的含义是什么?
    JavaScript 变量名称中美元符号的含义是什么?
    JavaScript 变量名称中美元符号的意义在编程领域,命名约定的使用对于增强代码至关重要可读性并遵循最佳实践。在 JavaScript 中,美元符号 ($) 通常作为变量名称的前缀出现,特别是引用 jQuery 对象的变量名称。美元符号的用途是什么?与流行的看法相反,JavaScript 变量名...
    编程 发布于2024-11-06
  • 如何重新排列 CSS 网格布局中的列以实现移动响应?
    如何重新排列 CSS 网格布局中的列以实现移动响应?
    在 CSS 网格布局中重新排序列在 CSS 网格布局中,有多种技术可以修改列的顺序以实现具体布局。本问题探讨了重新排列移动布局列的可能性,例如将列移动到底部,同时在桌面布局上保持所需的列顺序。解决方案选项:grid-template-areas: 此属性允许您在网格内定义命名区域,然后将网格项分配给...
    编程 发布于2024-11-06
  • Hacktoberfest 周在线拍卖系统
    Hacktoberfest 周在线拍卖系统
    概述 在 Hacktoberfest 的第三周,我决定为一个较小但有前途的项目做出贡献:在线拍卖系统。尽管该项目仍处于早期阶段,但它已经显示出增长潜力,而且我看到了帮助改进其代码库的机会。我的任务是通过减少冗余代码和改进整体结构来重构项目,使其更具可维护性和可扩展性。 ...
    编程 发布于2024-11-06
  • 如何使用“exception_ptr”在 C++ 线程之间传播异常?
    如何使用“exception_ptr”在 C++ 线程之间传播异常?
    在 C 中的线程之间传播异常 当从主线程调用的函数生成多个线程时,就会出现在 C 中的线程之间传播异常的任务用于 CPU 密集型工作的工作线程。挑战在于处理工作线程上可能发生的异常并将其传播回主线程​​以进行正确处理。传统方法一种常见方法是手动捕获工作线程上的各种异常,记录它们的详细信息,然后在主线...
    编程 发布于2024-11-06
  • 如何使用 3D CSS 变换修复 Firefox 中的锯齿状边缘?
    如何使用 3D CSS 变换修复 Firefox 中的锯齿状边缘?
    使用 3D CSS 变换时 Firefox 中的锯齿状边缘与 Chrome 中使用 CSS 变换时的锯齿状边缘问题类似,Firefox 在 3D 变换中也出现了这个问题。背面可见性作为 Chrome 中的潜在解决方案,在 Firefox 中被证明无效。解决方法:要在 Firefox 中缓解此问题,您...
    编程 发布于2024-11-06
  • 为什么 PHP 的 mail() 函数给电子邮件发送带来挑战?
    为什么 PHP 的 mail() 函数给电子邮件发送带来挑战?
    为什么 PHP 的 mail() 函数达不到要求:限制和陷阱虽然 PHP 提供了 mail() 函数用于发送电子邮件,但它却失败了与专用库或扩展相比较短。以下是与使用 mail() 相关的缺点和限制的全面检查:格式问题:mail() 可能会遇到以下问题:标题和内容格式,尤其是操作系统之间的换行符差异...
    编程 发布于2024-11-06
  • 使用 npyConverter 简化 NumPy 文件转换
    使用 npyConverter 简化 NumPy 文件转换
    如果您使用 NumPy 的 .npy 文件并需要将其转换为 .mat (MATLAB) 或 .csv 格式,npyConverter 就是适合您的工具!这个简单的基于 GUI 的工具通过干净且用户友好的界面提供 .npy 文件的批量转换。 主要特点 批量转换:将目录下所有.npy文件...
    编程 发布于2024-11-06
  • 如何禁用特定线路的 Eslint 规则?
    如何禁用特定线路的 Eslint 规则?
    禁用特定行的 Eslint 规则在 JSHint 中,可以使用语法禁用特定行的 linting 规则: /* jshint ignore:start */ $scope.someVar = ConstructorFunction(); /* jshint ignore:end */对于 eslint...
    编程 发布于2024-11-06
  • 如何在没有错误的情况下将列表插入 Pandas DataFrame 单元格?
    如何在没有错误的情况下将列表插入 Pandas DataFrame 单元格?
    将列表插入 Pandas 单元格问题在 Python 中,尝试将列表插入 Pandas DataFrame 的单元格可能会导致错误或意想不到的结果。例如,当尝试将列表插入 DataFrame df 的单元格 1B 时:df = pd.DataFrame({'A': [12, 23], 'B': [n...
    编程 发布于2024-11-06
  • Matplotlib 中的“plt.plot”、“ax.plot”和“figure.add_subplot”之间的主要区别是什么?
    Matplotlib 中的“plt.plot”、“ax.plot”和“figure.add_subplot”之间的主要区别是什么?
    Matplotlib 中绘图、轴和图形之间的差异Matplotlib 是一个用于创建可视化的面向对象的 Python 库。它使用三个主要对象:图形、轴和绘图。图形图形表示将在其中显示可视化的整个画布或窗口。它定义画布的整体大小和布局,包括边距、背景颜色和任何其他全局属性。轴轴表示图中绘制数据的特定区...
    编程 发布于2024-11-06
  • FireDucks:以零学习成本获得超越 pandas 的性能!
    FireDucks:以零学习成本获得超越 pandas 的性能!
    Pandas 是最受欢迎的库之一,当我在寻找一种更简单的方法来加速其性能时,我发现了 FireDucks 并对它产生了兴趣! 与 pandas 的比较:为什么选择 FireDucks? Pandas 程序可能会遇到严重的性能问题,具体取决于其编写方式。然而,作为一名数据科学家,我想花...
    编程 发布于2024-11-06
  • CSS 网格:嵌套网格布局
    CSS 网格:嵌套网格布局
    介绍 CSS Grid 是一种布局系统,因其在创建多列布局方面的灵活性和效率而迅速受到 Web 开发人员的欢迎。它最有用的功能之一是能够创建嵌套网格布局。嵌套网格可以在设计复杂网页时提供更多控制和精确度。在本文中,我们将探讨在 CSS Grid 中使用嵌套网格布局的优点、缺点和主要...
    编程 发布于2024-11-06
  • 适用于 Java 的 Jupyter 笔记本
    适用于 Java 的 Jupyter 笔记本
    Jupyter Notebook 的强大 Jupyter Notebooks 是一个出色的工具,最初是为了帮助数据科学家和工程师使用 python 编程语言简化数据处理工作而开发的。事实上,笔记本的交互性使其非常适合快速查看代码结果,而无需搭建开发环境、编译、打包等。此功能对于数据...
    编程 发布于2024-11-06

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3