Magic Mushrooms: exploring and treating null data with Mage

Front page > Programming > Magic Mushrooms: exploring and treating null data with Mage

Magic Mushrooms: exploring and treating null data with Mage

Published on 2024-08-27

Browse:471

Mage is a powerful tool for ETL tasks, with features that enable data exploration and mining, quick visualizations through graph templates and several other features that transform your work with data into something magical.

In data processing, during an ETL process it is common to find missing data that can generate problems in the future, depending on the activity we are going to carry out with the dataset, null data can be quite disruptive.

To identify the absence of data in our dataset, we can use Python and the pandas library to check the data that presents null values, in addition we can create graphs that show even more clearly the impact of these null values in our dataset.

Our pipeline consists of 4 steps: starting with loading the data, two processing steps and exporting the data.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Data Loader

In this article we will use the dataset: Binary Prediction of Poisonous Mushrooms which is available on Kaggle as part of a competition. Let's use the training dataset available on the website.

Let's create a Data Loader step using python to be able to load the data we are going to use. Before this step, I created a table in the Postgres database, which I have locally on my machine, to be able to load the data. As the data is in Postgres, we will use the already defined Postgres load template within Mage.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from os import path

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@data_loader
def load_data_from_postgres(*args, **kwargs):
    """
    Template for loading data from a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql
    """
    query = 'SELECT * FROM mushroom'  # Specify your SQL query here
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        return loader.load(query)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """

    assert output is not None, 'The output is undefined'

Within the function load_data_from_postgres() we will define the query that we will use to load the table in the database. In my case, I configured the bank information in the file io_config.yaml where it is defined as the default configuration, so we only need to pass the default name to the variable config_profile.

After executing the block, we will use the Add chart feature, which will provide information about our data through already defined templates. Just click on the icon next to the play button, marked in the image with a yellow line.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

We will select two options to explore our dataset further, the summay_overview and feature_profiles options. Through summary_overview, we obtain information about the number of columns and rows in the dataset. We can also view the total number of columns by type, for example the total number of categorical, numeric and Boolean columns. Feature_profiles, on the other hand, presents more descriptive information about the data, such as: type, minimum value, maximum value, among other information, we can even visualize the missing values, which are the focus of our treatment.

To be able to focus more on missing data, let's use the template: % of missing values, a bar graph with the percentage of data that is missing, in each of the columns.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

The graph presents 4 columns where missing values correspond to more than 80% of its content, and other columns that present missing values but in smaller quantities, this information now allows us to seek different strategies to deal with this null data.

Transformer Drop Columns

For columns that have more than 80% of null values, the strategy we will follow will be to perform a drop columns in the dataframe, selecting the columns that we are going to exclude from the dataframe. Using the TRANSFORMER Block in the Python language, we will select the option Colum removal .

from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:
    """
    Execute Transformer Action: ActionType.REMOVE
    Docs: https://docs.mage.ai/guides/transformer-blocks#remove-columns
    """
    action = build_transformer_action(
        df,
        action_type=ActionType.REMOVE,
        arguments=['veil_type', 'spore_print_color', 'stem_root', 'veil_color'],        
        axis=Axis.COLUMN,
    )
    return BaseAction(action).execute(df)

@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.

    """
    assert output is not None, 'The output is undefined'

Within the function execute_transformer_action() we will insert a list with the name of the columns that we want to exclude from the dataset, in the arguments variable, after this step, just execute the block.

Transformer Fill in Missing Values

Now for the columns that have less than 80% of null values, we will use the strategy Fill in Missing Values, as in some cases despite having missing data, replacing these with values such as average, or fashion, it may be able to meet the data needs without causing many changes to the dataset, depending on your final objective.

There are some tasks, such as classification, where replacing missing data with a value that is relevant (mode, mean, median) for the dataset, can contribute to the classification algorithm, which could reach other conclusions if the data were deleted as in the other strategy we used.

To make a decision regarding which measurement we will use, we will use Mage's Add chart functionality again. Using the template Most frequent values we can visualize the mode and frequency of this value in each of the columns.

Cogumelos Mágicos: explorando e tratando dados nulos com Mage

Following steps similar to the previous ones, we will use the transformer Fill in missing values, to perform the task of subtracting the missing data using the mode of each of the columns: steam_surface, gill_spacing, cap_surface, gill_attachment, ring_type.

from mage_ai.data_cleaner.transformer_actions.constants import ImputationStrategy
from mage_ai.data_cleaner.transformer_actions.base import BaseAction
from mage_ai.data_cleaner.transformer_actions.constants import ActionType, Axis
from mage_ai.data_cleaner.transformer_actions.utils import build_transformer_action
from pandas import DataFrame

if 'transformer' not in globals():
    from mage_ai.data_preparation.decorators import transformer

if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test

@transformer
def execute_transformer_action(df: DataFrame, *args, **kwargs) -> DataFrame:

    """
    Execute Transformer Action: ActionType.IMPUTE
    Docs: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values

    """
    action = build_transformer_action(
        df,
        action_type=ActionType.IMPUTE,
        arguments=df.columns,  # Specify columns to impute
        axis=Axis.COLUMN,
        options={'strategy': ImputationStrategy.MODE},  # Specify imputation strategy
    )

    return BaseAction(action).execute(df)


@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'

In the function execute_transformer_action() , we define the strategy for replacing data in a Python dictionary. For more replacement options, just access the transformer documentation: https://docs.mage.ai/guides/transformer-blocks#fill-in-missing-values.

Data Exporter

When carrying out all the transformations, we will save our now treated dataset, in the same Postgres database but now with a different name so we can differentiate. Using the Data Exporter block and selecting Postgres, we will define the shema and the table where we want to save, remembering that the database configurations are previously saved in the file io_config.yaml.

from mage_ai.settings.repo import get_repo_path
from mage_ai.io.config import ConfigFileLoader
from mage_ai.io.postgres import Postgres
from pandas import DataFrame
from os import path

if 'data_exporter' not in globals():
    from mage_ai.data_preparation.decorators import data_exporter

@data_exporter
def export_data_to_postgres(df: DataFrame, **kwargs) -> None:

    """
    Template for exporting data to a PostgreSQL database.
    Specify your configuration settings in 'io_config.yaml'.
    Docs: https://docs.mage.ai/design/data-loading#postgresql

    """

    schema_name = 'public'  # Specify the name of the schema to export data to
    table_name = 'mushroom_clean'  # Specify the name of the table to export data to
    config_path = path.join(get_repo_path(), 'io_config.yaml')
    config_profile = 'default'

    with Postgres.with_config(ConfigFileLoader(config_path, config_profile)) as loader:

        loader.export(
            df,
            schema_name,
            table_name,
            index=False,  # Specifies whether to include index in exported table
            if_exists='replace', #Specify resolution policy if table name already exists
        )

Thank you and see you next time?

repo -> https://github.com/DeadPunnk/Mushrooms/tree/main

Release Statement This article is reproduced at: https://dev.to/deadpunnk/cogumelos-magicos-explorando-e-tratando-dados-nulos-com-mage-ppb?1 If there is any infringement, please contact [email protected] to delete it

Latest tutorial More>

Concurrency patterns in Go; worker pools and fan-out/fan-in
Go is known for its exceptional concurrency model, but many developers focus only on goroutines and channels. However, concurrency patterns like worke...

Programming Published on 2024-11-06
How to Convert a Single Character to a std::string in C++?
Creating a String from a Single CharacterOne may encounter the need to convert a single character, represented as a char data type, into a std::string...

Programming Published on 2024-11-06
What is the Significance of the Dollar Sign in JavaScript Variable Names?
The Significance of the Dollar Sign in JavaScript Variable NamesIn the realm of programming, the use of naming conventions is essential for enhancing ...

Programming Published on 2024-11-06
How can I rearrange columns in a CSS Grid Layout for mobile responsiveness?
Re-ordering Columns in a CSS Grid LayoutIn CSS Grid Layout, there are various techniques for modifying the order of columns to achieve specific layout...

Programming Published on 2024-11-06
Hacktoberfest week Online Auction System
Overview During Week 3 of Hacktoberfest, I decided to contribute to a smaller but promising project: an Online Auction System. Though the pro...

Programming Published on 2024-11-06
How do you propagate exceptions between threads in C++ using `exception_ptr`?
Propagating Exceptions Between Threads in C The task of propagating exceptions between threads in C arises when a function called from a main threa...

Programming Published on 2024-11-06
How to Fix Jagged Edges in Firefox with 3D CSS Transforms?
Jagged Edges in Firefox with 3D CSS TransformsSimilar to the issue of jagged edges in Chrome with CSS transforms, Firefox also exhibits this problem w...

Programming Published on 2024-11-06
$Why Does PHP\'s mail() Function Pose Challenges for Email Delivery?$
Why Does PHP\'s mail() Function Pose Challenges for Email Delivery?
Why PHP's mail() Function Falls Short: Limitations and PitfallsWhile PHP provides the mail() function for sending emails, it falls short in compar...

Programming Published on 2024-11-06
Streamline Your NumPy File Conversions with npyConverter
If you work with NumPy’s .npy files and need to convert them to .mat (MATLAB) or .csv formats, npyConverter is the tool for you! This simple GUI-based...

Programming Published on 2024-11-06
How to Disable Eslint Rules for a Specific Line?
Eslint Rule Disabling for a Specific LineIn JSHint, linting rules can be disabled for a particular line using the syntax:/* jshint ignore:start */ $sc...

Programming Published on 2024-11-06
How to Insert Lists into Pandas DataFrame Cells Without Errors?
Inserting Lists into Pandas CellsProblemIn Python, attempting to insert a list into a cell of a Pandas DataFrame can result in errors or unexpected re...

Programming Published on 2024-11-06
What are the key differences between `plt.plot`, `ax.plot`, and `figure.add_subplot` in Matplotlib?
Differences Between plot, axes, and figure in MatplotlibMatplotlib is an object-oriented Python library for creating visualizations. It uses three pri...

Programming Published on 2024-11-06
FireDucks: Get performance beyond pandas with zero learning cost!
Pandas is one of the most popular libraries, when I was looking for an easier way to speed up its performance, I discovered FireDucks and became inter...

Programming Published on 2024-11-06
CSS Grid: Nested Grid Layouts
Introduction CSS Grid is a layout system that has quickly gained popularity among web developers for its flexibility and efficiency in creati...

Programming Published on 2024-11-06
Jupyter Notebook for Java
The powerful of Jupyter Notebook Jupyter Notebooks are an excellent tool, originally developed to help data scientists and engineers to simpl...

Programming Published on 2024-11-06