”工欲善其事,必先利其器。“—孔子《论语.录灵公》
首页 > 编程 > 机器学习及其他问题:机器学习 A-Z

机器学习及其他问题:机器学习 A-Z

发布于2024-07-31
浏览:984

onths of Machine Learning and beyond: Machine Learning A-Z

Introduction

Before I even started properly studying machine learning last summer, I've already had several machine learning courses purchased on Udemy. The most basic among that courses was Machine Learning A-Z: AI, Python & R, so, it became my starting point. This course served as a perfect introduction to the field, covering a wide range of classical machine learning techniques and some deep learning.

Course Impression

Typically, as programmers, we work with structured data. However, the world is inherently messy. Machine learning proves to be an invaluable tool for dealing with unstructured information. I was very impressed by the course because it introduced a whole new world of approaches that felt like gaining a superpower.

Course Content

The course explains the machine learning process step by step. The initial, crucial stage of the process is data preprocessing, which happens even before any algorithms can be applied.

Preprocessing of data

Very beginning of preprocessing is data splitting. It is common to divide dataset into three parts: training, validation, and test sets. A training set is used for training of a model, a validation set helps assessing overfitting during training, and a test set is used to evaluate the model’s performance after training.

Handling missing data is another critical aspect. Depending on the situation and the amount of data missing, there are two primary options:

  • Imputing missing values based on other data points
  • Removing rows with missing data entirely

Moreover, often it is important to perform feature scaling, because some machine learning algorithms are sensitive to the scale of the input data. For instance, algorithms that compute distances between data points, like K-Nearest Neighbors (K-NN), will be biased towards variables with a larger scale if the data is not adjusted to compensate this. Feature scaling helps to make sure that the range of independent variables equally contributes to the analysis. This can be done through methods like normalization or standardization. Normalization rescales features to a fixed range, usually from 0 to 1. Standardization adjusts all features to have 0 mean and standard deviation of 1.

These preprocessing steps are necessary to create a robust machine learning models that perform well in real-world scenarios.

Classic Machine Learning Models

Regression

Regression models are a type of statistical tool used for predicting a continuous outcome based on one or more input variables. They are fundamental for forecasting and determining the strength of relationships between variables. These models work by creating an equation that best fits the observed data. I already had some experiences with regression models especially with Linear Regression from the stat courses I took years ago.

Polynomial Regression extends linear regression by adding terms with powers greater than one. This allows the model to fit a wider range of data shapes, capturing more complex relationships between variables. However, higher-degree polynomials can lead to overfitting, where the model fits the training data too closely and performs poorly on unseen data. This occurs because the model learns noise from the training data, mistaking it for actual relationships.

Next, the course introduces Support Vector Regression (SVR), a powerful model that can encapsulate non-linear relationships with a lower risk of overfitting and can model exponential relationships. The main goal of SVR is to create a prediction line that fits most of the data points as closely as possible while also trying to keep the line as smooth and flat as possible. In other words, SVR tries to strike a balance between closely following the training data and avoiding overly complex models that might not work well on new, unseen data. It does this by allowing for a small margin of error, within which deviations are acceptable. This makes SVR a robust choice for predicting continuous values, especially when the data is complex or has a lot of variability.

After that Decision Trees and Random Forests are introduced. Typically known for classification, these techniques are also applicable in regression settings. The course explains how these models can predict an output based on decision rules inferred from the data features. Decision Trees and Random Forests create models based on a series of binary decisions from the features within the dataset. This approach can lead to models that fit well on training data but fail to generalize to new data because the decision-making process is arbitrary and doesn’t necessarily capture underlying mathematical relationships between variables.

On the other hand, methods like SVR and Polynomial Regression aim to identify the mathematical relationships inherent in the data. For example, SVR tries to fit the best possible curve within a certain margin of error, and polynomial regression can model relationships that follow a polynomial equation. If the true relationship between the variables is mathematical, these methods are likely to perform better with less risk of overfitting. This ability to uncover and leverage mathematical relationships makes SVR, Linear, and Polynomial Regression more robust for predicting outcomes where the underlying data relationships are strong and clear.

Model Selection in Regression

The section on regression wraps up with strategies for choosing the best model. Experimentation with different approaches and evaluation of their performance on test data is still recommended, since an experiment is still the only way to select a truly optimal model.

Classification

Classification involves predicting a categorical response based on input variables.

Logistic Regression, despite its name, is a basic classification technique, ideal for binary classification problems. It is used for prediction of outcomes that have two possible states e.g., yes/no, true/false. It works by modelling the probability of the default class, usually labeled 1, as a function of the input features. Logistic regression applies the logistic function to the output of a linear equation, producing a probability score between 0 and 1. This model is robust, straightforward, and efficient for binary classification problems.

The next model in the course is K-Nearest Neighbors (K-NN). It classifies a data point based on how its neighbors are classified, capable of handling multi-class problems and more complex decision boundaries.

The course also covers Support Vector Machines (SVM) for classification, explaining the use of different kernels to handle linear and non-linear classification. Support Vector Machine constructs a hyperplane in a multidimensional space to separate different classes. SVM performs well in high-dimensional spaces. It is versatile due to its ability to use different kernel functions to make the hyperplane more adaptable to the data. For example, linear kernels are great for linearly separable data, while radial basis function (RBF) kernels can map non-linear relationships.

Clustering

Classification and clustering are both methods of organizing data but serve different purposes. Classification is a supervised learning approach where the model is trained on labeled data. This means the model learns from examples that already have an assigned category or class. Its task is to predict the category for new data based on what it has learned. For example, a classification model might determine whether emails are spam or not spam based on training with a dataset of emails labeled accordingly.

Clustering, on the other hand, is an unsupervised learning technique that involves grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. It’s used when we don’t have predefined labels for data. The model itself discovers the inherent groupings in the data. An example of clustering might be segmenting customers into groups based on purchasing behavior without prior knowledge of the different customer types.

Both methods are fundamental in data analysis:

  • Classification uses labeled data for predictive modeling.
  • Clustering helps to discover hidden patterns in data.

Clustering Techniques

K-Means is a popular clustering technique that partitions data into K distinct, non-overlapping clusters based on their features. The process involves randomly initializing K points as cluster centers and assigning each data point to the nearest cluster based on Euclidean distance. The cluster centers are then recalculated as the mean of the assigned points, and this process repeats until the centroids stabilize and no longer move significantly. This method is particularly effective for large datasets and is widely used due to its simplicity and efficiency. K-Means works best with data where the clusters are spherical and evenly sized, making it less effective with complex cluster shapes.

Hierarchical Clustering, unlike K-Means, does not require the number of clusters to be specified in advance. It builds a hierarchy of clusters either by a divisive method or an agglomerative method.

In the agglomerative approach, each data point starts as its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The process continues until all points are merged into a single cluster at the top of the hierarchy. This method is beneficial for identifying the level of similarity between data points and is visually represented using a dendrogram, which can help determine the number of clusters by cutting the dendrogram at a suitable level.

The divisive method of hierarchical clustering, also known as top-down clustering, starts with all observations in a single cluster and progressively splits the cluster into smaller ones. This approach begins at the top of the hierarchy and works its way down, making it conceptually straightforward: every split is designed to create the most distinct and coherent clusters possible at each level of division.

In practice, the divisive method involves examining the cluster at each step and choosing the best point to split it. This involves measuring the distance between observations within the cluster and identifying the largest distance as the point to divide. The process continues recursively, splitting each subsequent cluster until each observation is its own cluster or until a specified number of clusters is reached. It is generally more computationally intensive than the agglomerative approach, as it requires a global view of the data at each split, making it less commonly used in very large datasets.

Hierarchical clustering is particularly useful for smaller datasets or when the relationships between data points need to be closely examined, such as in biological sciences or when clustering historical data.

Deep Learning Models

Deep learning is a subset of machine learning that employs neural networks with many layers. It is a significantly different from classical machine learning techniques. While classical machine learning focuses on features that are often manually selected and engineered, deep learning aims to train neural networks to learn features. The models automate feature extraction by building complex patterns from simpler ones. This makes deep learning exceptionally powerful for tasks such as image and speech recognition, where the input data is highly dimensional and the relationships within the data are complex. However, it requires vast amounts of information to train deep learning models.

Artificial Neural Network

A fundamental element of deep learning is the forward densely connected neural network, or Artificial Neural Network (ANN). In these networks, neurons are arranged in layers, with the first layer taking the input data and the last layer producing output. Each neuron in one layer connects to every neuron in the next layer, making the network "densely connected." These neurons have weights and biases that adjust as the network learns from data during the training process. The output of each neuron is calculated by a nonlinear activation function, which introduces the ability to capture nonlinear relationships in the data.

Layers of neurons, in ANNs, can be represented by vectors consisting of the weights and biases. Data is propagated forward through these layers using matrix multiplication. An output of each layer is calculated by multiplying the input data by the weight matrix and then adding a bias term. This output then passes through an activation function before it is sent to the next layer.

The activation function is crucial because it introduces non-linearity into the model, allowing the network to learn and model complex, non-linear relationships in the data. Without non-linear activation functions, the network, regardless of its depth, would still behave just like a single-layer perceptron, which can only learn linear boundaries.

Convolutional Neural Network

An alternative to basic ANNs is the Convolutional Neural Network (CNN). Unlike densely connected networks where every input is connected to each neuron, CNNs operate over volumes of pixels and use filters to create feature maps that summarize the presence of detected features in the input, such as edges in images. This makes CNNs highly efficient for tasks that involve spatial hierarchies, as they reduce the number of parameters needed, reducing the computational burden.

Convolutional Neural Networks are specialized kinds of neural networks for processing data that has a grid-like topology, such as images. CNNs use filters that perform convolution operations as the filter slides over the input to create a feature map that summarizes the presence of detected features in the input. This makes them exceptionally efficient for image related tasks.

CNNs leverage the mathematical operation of convolution, a fundamental technique in digital signal processing. In the context of DSP, convolution is used to alter a signal by a filter, extracting important features. Similarly, in CNNs, convolution involves applying a filter over an image to produce a feature map. This process effectively allows the network to detect similarities or specific features in the image that correspond to the filter. For example, a filter might be learn to detect edges or specific shapes.

As the input image is processed through successive convolutional layers, the CNN uses multiple filters at each layer to search for increasingly complex patterns. The first layer may detect simple edges or textures, while deeper layers can recognize more complex features like parts of objects or entire objects.

Gradient Descent and Training Neural Networks

Gradient descent is a fundamental optimization algorithm used in training neural networks and other machine learning models. It works by iteratively adjusting the model's parameters to minimize the loss function, which measures how well the model's predictions match the actual data. In each step, the algorithm computes the gradient of the loss function with respect to the model parameters, and moves the parameters in the direction that reduces the loss.

Backpropagation is the technique used to compute these gradients efficiently in neural networks. It involves two phases:

  • A forward pass, where input data is passed through the network to generate predictions.
  • A backward pass, where the gradient of the loss function is computed based on the prediction. It is later propagated back through the network to update the weights.

This process leverages the chain rule of calculus to estimate gradients, ensuring that each weight is adjusted in proportion to its contribution to the overall error. Together, Gradient Descent and Backpropagation enable neural networks to learn from data by iteratively improving their accuracy.

The Loss Functions

Loss functions play a critical role in guiding the training process. It is also known as a cost function or error function. It quantifies the difference between the predicted outputs of the network and the actual target values. This metric provides a concrete measure of how well the network is performing. The goal of training is to minimize this loss, thereby optimizing the model's parameters.

Commonly used loss functions in ANNs vary depending on the specific type of task:

  • For regression tasks, where the goal is to predict continuous values, the Mean Squared Error (MSE) loss is frequently used. MSE calculates the average of the squares of the differences between the predicted and actual values, penalizing larger errors more severely.
  • For classification tasks, where the output is a class label, Cross-Entropy Loss is commonly employed. This loss function measures the dissimilarity between the true label distribution and the predictions provided by the model.

The Vanishing Gradient Problem and ReLu

One significant challenge when building deep neural networks is the vanishing gradient problem. The gradients used in the training process can become too small, preventing weights from changing their values, which stops the network from sufficiently updating parameters.

This issue is particularly prominent with sigmoid or tanh activation functions. To mitigate this, deep learning has adopted the Rectified Linear Unit (ReLu) activation function. ReLu is defined as ReLU(x)=max(0,x), where x represents the input to a neuron. This function helps maintain a stronger gradient during training, allowing deeper networks to learn effectively without the gradients vanishing. This simplicity and efficiency in promoting nonlinearity without affecting the scale of the gradient make ReLu a popular choice in deep learning architectures.

Specialized Machine Learning Techniques

The course progressed into a variety of more specialized machine learning techniques, each tailored to specific applications and domains.

Natural Language Processing

Natural Language Processing (NLP) involves the application of computational techniques to the analysis and synthesis of natural language and speech. One of the main challenges in using machine learning for NLP is that text data is inherently unstructured and high-dimensional. Text must be converted into a numerical format that machine learning algorithms can process, a task complicated by the nuances of language such as syntax, semantics, and context.

The Bag of Words

The Bag of Words (BoW) model addresses this by transforming text into fixed-length vectors by counting how frequently each word appears in a document, ignoring the order and context of words. This method simplifies text data, making it manageable for basic machine learning models and serving as a foundational technique for text classification tasks, such as spam detection or sentiment analysis. However, simplicity of the BoW model, its disregard for word order and semantic context limit its effectiveness for more complex language tasks.

Reinforcement Learning with UCB and Thompson Sampling

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. It differs from supervised learning, since correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. This strategies evolves by balancing the exploration, trying new things, and exploitation, using known information, in decision-making processes.

The agent takes actions based on a policy, receives feedback through rewards or punishments, and updates its policy to maximize long-term rewards. Two notable strategies in RL that help manage the exploration-exploitation dilemma are the Upper Confidence Bound (UCB) and Thompson Sampling.

UCB is an algorithm that prioritizes exploration by selecting actions that have either high rewards or have not been tried often. The idea is to balance the known rewards with the potential of finding higher rewards in lesser-tried actions. UCB does this by constructing confidence bounds around the estimates of action rewards and choosing the action with the highest upper confidence bound. This approach systematically reduces uncertainty and improves decision-making over time.

Thompson Sampling takes a Bayesian approach to the exploration-exploitation problem. It involves sampling from the posterior distributions of the rewards for each action and selecting the action with the highest sample. This method allows for a more probabilistic exploration based on the known performance of actions, dynamically balancing between exploring new actions and exploiting the known ones based on their reward probability distributions.

Both UCB and Thompson Sampling are powerful techniques in situations where the learning environment is initially unknown to the agent, allowing for systematic exploration and optimized learning based on the feedback received from the environment. These methods are particularly useful in real-time decision-making scenarios like A/B testing or network routing.

Dimensionality Reduction Techniques

PCA is a statistical technique used for dimensionality reduction while preserving as much variance as possible. It works by identifying so called principal components - the directions along which the variance of the data is maximized. It reduces the dimension of the data by transforming the original variables into a new set of orthogonal variables. Orthogonality allows this new variable to be as non-correlated as possible, and account for the maximum variance in the data. This is particularly useful in reducing the number of variables in data while maintaining the relationships that contribute most to its variance. By transforming the data into a new set of dimensions with reduced complexity, PCA helps in visualizing high-dimensional data, speeding up learning algorithms, and removing noise.

LDA, on the other hand, is also a dimensionality reduction technique but focuses more on maximizing the separability among known categories. It tries to model the difference between the classes of data. LDA achieves this by finding a linear combination of features that separates classes. The resulting combination can be used as a linear classifier or for dimensionality reduction before later classification.

Both PCA and LDA serve slightly different purposes:

  • PCA is unsupervised, focusing on variance in the data.
  • LDA is supervised, focusing on maximizing class separability.

Modern Model Selection and Boosting Techniques

The latter part of the course explores advanced model selection strategies and introduces boosting. Boosting works by combining multiple weak learners into a stronger model in a sequential manner. Each learner in the sequence focuses on the errors made by the previous one, gradually improving the model's accuracy. The learners are usually simple models like decision trees, and each one contributes incrementally to the final decision, making the ensemble stronger than any individual model alone.

Extreme Gradient Boosting

One of the most popular implementations of this technique is Extreme Gradient Boosting (XGBoost), which stands out due to its efficiency and effectiveness across a wide range of predictive modeling tasks.

Conclusion

The "Machine Learning A-Z: AI, Python & R" course is a great starting point for anyone interested in machine learning. It covers a lot of important topics and gives a broad overview, but it’s just the beginning.

Finishing this course won’t make you an expert ready for a specialized machine learning job right away. Instead, think of it as a first step. It helps you understand the basics and shows you what parts of machine learning might be most interesting to you.

版本声明 本文转载于:https://dev.to/airtucha/9-months-of-machine-learning-and-beyond-machine-learning-a-z-3jfj?1如有侵犯,请联系[email protected]删除
最新教程 更多>
  • 尽管代码有效,为什么 POST 请求无法捕获 PHP 中的输入?
    尽管代码有效,为什么 POST 请求无法捕获 PHP 中的输入?
    解决 PHP 中的 POST 请求故障在提供的代码片段中:action=''而不是:action="<?php echo $_SERVER['PHP_SELF'];?>";?>"检查 $_POST数组:表单提交后使用 var_dump 检查 $_POST 数...
    编程 发布于2024-12-21
  • 除了“if”语句之外:还有哪些地方可以在不进行强制转换的情况下使用具有显式“bool”转换的类型?
    除了“if”语句之外:还有哪些地方可以在不进行强制转换的情况下使用具有显式“bool”转换的类型?
    无需强制转换即可上下文转换为 bool您的类定义了对 bool 的显式转换,使您能够在条件语句中直接使用其实例“t”。然而,这种显式转换提出了一个问题:“t”在哪里可以在不进行强制转换的情况下用作 bool?上下文转换场景C 标准指定了四种值可以根据上下文转换为的主要场景bool:语句:if、whi...
    编程 发布于2024-12-21
  • 在 Go 中使用 WebSocket 进行实时通信
    在 Go 中使用 WebSocket 进行实时通信
    构建需要实时更新的应用程序(例如聊天应用程序、实时通知或协作工具)需要一种比传统 HTTP 更快、更具交互性的通信方法。这就是 WebSockets 发挥作用的地方!今天,我们将探讨如何在 Go 中使用 WebSocket,以便您可以向应用程序添加实时功能。 在这篇文章中,我们将介绍: WebSoc...
    编程 发布于2024-12-21
  • 如何修复 macOS 上 Django 中的“配置不正确:加载 MySQLdb 模块时出错”?
    如何修复 macOS 上 Django 中的“配置不正确:加载 MySQLdb 模块时出错”?
    MySQL配置不正确:相对路径的问题在Django中运行python manage.py runserver时,可能会遇到以下错误:ImproperlyConfigured: Error loading MySQLdb module: dlopen(/Library/Python/2.7/site-...
    编程 发布于2024-12-21
  • 插入数据时如何修复“常规错误:2006 MySQL 服务器已消失”?
    插入数据时如何修复“常规错误:2006 MySQL 服务器已消失”?
    插入记录时如何解决“一般错误:2006 MySQL 服务器已消失”介绍:将数据插入 MySQL 数据库有时会导致错误“一般错误:2006 MySQL 服务器已消失”。当与服务器的连接丢失时会出现此错误,通常是由于 MySQL 配置中的两个变量之一所致。解决方案:解决此错误的关键是调整wait_tim...
    编程 发布于2024-12-21
  • 大批
    大批
    方法是可以在对象上调用的 fns 数组是对象,因此它们在 JS 中也有方法。 slice(begin):将数组的一部分提取到新数组中,而不改变原始数组。 let arr = ['a','b','c','d','e']; // Usecase: Extract till index p...
    编程 发布于2024-12-21
  • 如何使用 MySQL 查找今天生日的用户?
    如何使用 MySQL 查找今天生日的用户?
    如何使用 MySQL 识别今天生日的用户使用 MySQL 确定今天是否是用户的生日涉及查找生日匹配的所有行今天的日期。这可以通过一个简单的 MySQL 查询来实现,该查询将存储为 UNIX 时间戳的生日与今天的日期进行比较。以下 SQL 查询将获取今天有生日的所有用户: FROM USERS ...
    编程 发布于2024-12-21
  • 如何在 PHP 中组合两个关联数组,同时保留唯一 ID 并处理重复名称?
    如何在 PHP 中组合两个关联数组,同时保留唯一 ID 并处理重复名称?
    在 PHP 中组合关联数组在 PHP 中,将两个关联数组组合成一个数组是一项常见任务。考虑以下请求:问题描述:提供的代码定义了两个关联数组,$array1 和 $array2。目标是创建一个新数组 $array3,它合并两个数组中的所有键值对。 此外,提供的数组具有唯一的 ID,而名称可能重合。要求...
    编程 发布于2024-12-21
  • Bootstrap 4 Beta 中的列偏移发生了什么?
    Bootstrap 4 Beta 中的列偏移发生了什么?
    Bootstrap 4 Beta:列偏移的删除和恢复Bootstrap 4 在其 Beta 1 版本中引入了重大更改柱子偏移了。然而,随着 Beta 2 的后续发布,这些变化已经逆转。从 offset-md-* 到 ml-auto在 Bootstrap 4 Beta 1 中, offset-md-*...
    编程 发布于2024-12-21
  • 如何在 Python 中访问和处理命令行参数?
    如何在 Python 中访问和处理命令行参数?
    在 Python 中处理命令行参数在 Python 中,命令行参数位于名为 sys.argv 的列表中。要访问这些参数,请使用以下语法:import sys # Print all command line arguments print("\n".join(sys.argv))...
    编程 发布于2024-12-20
  • Python If 语句中逻辑 AND (&&) 运算符的等价物是什么?
    Python If 语句中逻辑 AND (&&) 运算符的等价物是什么?
    Python中的If语句中&&(逻辑与)的等价是什么?在Python中,无法像其他编程语言那样使用&&作为逻辑与运算符。当使用if语句时,必须使用and关键字。示例:以下示例尝试使用&&作为逻辑与运算符,但会引发语法错误:if cond1 &amp;&amp; cond2:正确方法:...
    编程 发布于2024-12-20
  • 如何修改 Go 中作为函数参数传递的切片?
    如何修改 Go 中作为函数参数传递的切片?
    将切片作为函数参数传递并修改原始切片在 Go 中,将参数传递给函数是按值完成的,这意味着对参数所做的任何更改函数内的内容不会反映在原始变量中。当使用可变数据类型(例如切片)时,这可能会出现问题,因为附加到函数内的切片不会影响超出函数范围的原始切片。考虑以下示例:nums := []int{1, 2,...
    编程 发布于2024-12-20
  • 为什么模板类定义必须包含在头文件中?
    为什么模板类定义必须包含在头文件中?
    在头文件中包含模板类定义:必要性问题是为什么需要模板类的实现和声明驻留在同一头文件中。为了解决这个问题,至关重要的是要了解编译器需要访问整个模板定义(不仅仅是其签名)才能为模板的每个实例化生成代码。因此,函数定义必须移至标题。包含模型提供了对此要求的全面解释。本质上,当实例化模板类时,编译器会为该特...
    编程 发布于2024-12-20
  • 在 JavaScript 中使用浮点数时如何控制小数精度?
    在 JavaScript 中使用浮点数时如何控制小数精度?
    控制 JavaScript 中的小数精度在 JavaScript 中处理浮点数时,您可能会遇到需要控制小数点后显示的位数的情况观点。例如,您可能希望仅显示两位小数的价格。以固定精度格式化浮点型为了实现此目的,JavaScript 提供了 toFixed() 函数。该函数采用一个参数来指定要保留的小数...
    编程 发布于2024-12-20
  • 填充空 Python 列表时如何避免 IndexError?
    填充空 Python 列表时如何避免 IndexError?
    修复将元素分配给列表时的 IndexError尝试通过依次分配每个元素来创建列表时,您可能会遇到 IndexError如果目标列表最初为空。出现此错误的原因是您试图访问列表中不存在的索引。要解决此问题并将元素正确添加到列表中,您可以使用追加方法:for l in i: j.append(l)...
    编程 发布于2024-12-20

免责声明: 提供的所有资源部分来自互联网,如果有侵犯您的版权或其他权益,请说明详细缘由并提供版权或权益证明然后发到邮箱:[email protected] 我们会第一时间内为您处理。

Copyright© 2022 湘ICP备2022001581号-3