数据可视化基础知识

发布于2024-11-08

Why use data vis

When you need to work with a new data source, with a huge amount of data, it can be important to use data visualization to understand the data better.
The data analysis process is most of the times done in 5 steps:

Extract - Obtain the data from a spreadsheet, SQL, the web, etc. 
Clean - Here we could use exploratory visuals. 
Explore - Here we use exploratory visuals. 
Analyze - Here we might use either exploratory or explanatory visuals. 
Share - Here is where explanatory visuals live.

Types of data

To be able to choose an appropriate plot for a given measure, it is important to know what data you are dealing with.

Qualitative aka categorical types

Nominal qualitative data

Labels with no order or rank associated with the items itself.
Examples: Gender, marital status, menu items

Ordinal qualitative data

Labels that have an order or ranking.
Examples: letter grades, rating

Quantitative aka numeric types

Discrete quantitative values

Numbers can not be split into smaller units
Examples: Pages in a Book, number of trees in a park

Continuous quantitative values

Numbers can be split in smaller units
Examples: Height, Age, Income, Workhours

Summary Statistics

Numerical Data

Mean: The average value.
Median: The middle value when the data is sorted.
Mode: The most frequently occurring value.
Variance/Standard Deviation: Measures of spread or dispersion.
Range: Difference between the maximum and minimum values.

Categorical Data

Frequency: The count of occurrences of each category.
Mode: The most frequent category.

Visualizations

You can get insights to a new data source very quick and also see connections between different datatypes easier.
Because when you only use the standard statistics to summarize your data, you will get the min, max, mean, median and mode, but this might be misleading in other aspects. Like it is shown in Anscombe's Quartet: the mean and deviation are always the same, but the data distribution is always different.

In data visualization, we have two types:

Exploratory data visualization We use this to get insights about the data. It does not need to be visually appealing.
Explanatory data visualization This visualizations need to be accurate, insightful and visually appealing as this is presented to the users.

Chart Junk, Data Ink Ratio and Design Integrity

Chart Junk

To be able to read the information provided via plot without distraction, it is important to avoid chart junk. Like:

Heavy grid lines
Pictures in the visuals
Shades
3d components
Ornaments
Superfluous texts

Data Ink Ratio

The lower your chart junk in a visual is the higher the data ink ratio is. This just means the more "ink" in the visual is used to transport the message of the data, the better it is.

Design Integrity

The Lie Factor is calculated as:

$$
\text{Lie Factor} = \frac{\text{Size of effect shown in graphic}}{\text{Size of effect in data}}
$$

The delta stands for the difference. So it is the relative change shown in the graphic divided by the actual relative change in the data. Ideally it should be 1. If it is not, it means that there is some missmatch in the way the data is presented and the actual change.

Data Visualisation Basics
In the example above, taken from the wiki, the lie factor is 3, when comparing the pixels of each doctor, representing the numbers of doctors in California.

Data Visualisation Basics

Tidy data

make sure you're data is cleaned properly and ready to use:

each variable is a column
each observation is a row
each type of observational unit is a table

Univariate Exploration of Data

This refers to the analysis of a single variable (or feature) in a dataset.

Bar Chart

always plot starting with 0 to present values in real comparable way.
sort nominal data
don't sort ordinal data - here it is more important to know how often the most important category appears than the most frequent
if you have a lot of categories use a horizontal bar chart: having the categories on the y-axes, to make it better readable.

Data Visualisation Basics

Histogram

quantitative version of a bar chart. This is used to plot numeric values.
values are grouped into continous bins, one bar for each is plotted

KDE - Kernel Density Estimation

often a Gaussian or normal distribution, to estimate the density at each point.
KDE plots can reveal trends and the shape of the distribution more clearly, especially for data that is not uniformly distributed.

Pie Chart and Donut Plot

data needs to be in relative frequencies
pie charts work best with 3 slices at maximum. If there are more wedges to display it gets unreadable and the different amounts are hard to compare. Then you would prefer a bar chart.

BiVariate Exploration of Data

Analyzes the relationship between two variables in a dataset.

Clustered Bar Charts

displays the relationship between two categorical values. The bars are organized in clusters based on the level of the first variable.

Scatterplots

each data point is plotted individually as a point, its x-position corresponding to one feature value and its y-position corresponding to the second.
if the plot suffers from overplotting (too many datapoints overlap): you can use transparency and jitter (every point is moved slightly from its true value)

Heatmaps

2d version of a Histogram
data points are placed with its x-position corresponding to one feature value and its y-position corresponding to the second.
the plotting area is divided into a grid, and the numbers of points add up there and the counts are indicated by color

Violin plots

show the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
the distribution is plotted like a kernel density estimate, so we can have a clear
to display the key statistics at the same time, you can embedd a box plot in a violin plot.

Box plots

it also plots the relationship between quantitative (numerical) and qualitative (categorical) variables on a lower level of absraction.
compared to the violin plot, the box plot leans more on the summarization of the data, primarily just reporting a set of descriptive statistics for the numeric values on each categorical level.
it visualizes the five-number summary of the data: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

Key elements of a boxplot:
Box: The central part of the plot represents the interquartile range (IQR), which is the range between the first quartile (Q1, 25th percentile) and the third quartile (Q3, 75th percentile). This contains the middle 50% of the data.

Median Line: Inside the box, a line represents the median (Q2, 50th percentile) of the dataset.

Whiskers: Lines extending from the box, known as "whiskers," show the range of the data that lies within 1.5 times the IQR from Q1 and Q3. They typically extend to the smallest and largest values within this range.

Outliers: Any data points that fall outside 1.5 times the IQR are considered outliers and are often represented by individual dots or marks beyond the whiskers.
Data Visualisation Basics

Combined Violin and Box Plot

The violin plot shows the density across different categories, and the boxplot provides the summary statistics
Data Visualisation Basics

Faceting

the data is divided into disjoint subsets, most often by different levels of a categorical variable. For each of these subsets of the data, the same plot type is rendered on other variables, ie more histograms next to each other with different categorical values.

Line plot

used to plot the trend of one number variable against a seconde variable.

Quantile-Quantile (Q-Q) plot

is a type of plot used to compare the distribution of a dataset with a theoretical distribution (like a normal distribution) or to compare two datasets to check if they follow the same distribution.

Swarm plot

Like to a scatterplot, each data point is plotted with position according to its value on the two variables being plotted. Instead of randomly jittering points as in a normal scatterplot, points are placed as close to their actual value as possible without allowing any overlap.

Spider plot

compare multiple variables across different categories on a radial grid. Also know as radar chart.

Useful links

My sample notebook

Sample Code

Libs used for the sample plots:

Matplotlib: a versatile library for visualizations, but it can take some code effort to put together common visualizations.
Seaborn: built on top of matplotlib, adds a number of functions to make common statistical visualizations easier to generate.
pandas: while this library includes some convenient methods for visualizing data that hook into matplotlib, we'll mainly be using it for its main purpose as a general tool for working with data (https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).