According to the Entropix README, Entropix uses an entropy-based sampling method. This article explains the specific sampling techniques based on entropy and varentropy.
Let's start by explaining entropy and varentropy, as these are key factors in determining the sampling strategy.
In information theory, entropy is a measure of the uncertainty of a random variable. The entropy of a random variable X is defined by the following equation:
Entropy is maximized when the probability distribution is uniform. Conversely, when a specific state is much more likely than others, entropy decreases.
Varentropy, closely related to entropy, represents the variability in the information content. Considering the information content I(X), entropy H(X), and variance for a random variable X, varentropy V E(X) is defined as follows:
Varentropy becomes large when the probabilities p(x_i) vary greatly. It becomes small when the probabilities are uniform—either when the distribution has maximum entropy or when one value has a probability of 1 and all others have a probability of 0.
Next, let's explore how sampling strategies change based on entropy and varentropy values.
In this scenario, a particular token has a much higher prediction probability than the others. Since the next token is almost certain, Argmax is used.
if entCode link
2. Low Entropy, High Varentropy → Branch
This occurs when there is some confidence, but multiple viable options exist. In this case, the Branch strategy is used to sample from multiple choices and select the best outcome.
elif ent 5.0: temp_adj = 1.2 0.3 * interaction_strength top_k_adj = max(5, int(top_k * (1 0.5 * (1 - agreement)))) return _sample(logits, temperature=min(1.5, temperature * temp_adj), top_p=top_p, top_k=top_k_adj, min_p=min_p, generator=generator)Code link
Although this strategy is called "Branch," the current code appears to adjust the sampling range and select a single path. (If anyone has more insight, further clarification would be appreciated.)
3. High Entropy, Low Varentropy → CoT or Insert Pause Token
When the prediction probabilities of the next token are fairly uniform, indicating that the next context is not certain, a clarification token is inserted to resolve the ambiguity.
elif ent > 3.0 and ventCode link
4. High Entropy, High Varentropy → Resample
In this case, there are multiple contexts, and the prediction probabilities of the next token are low. A resampling strategy is used with a higher temperature setting and a lower top-p.
elif ent > 5.0 and vent > 5.0: temp_adj = 2.0 0.5 * attn_vent top_p_adj = max(0.5, top_p - 0.2 * attn_ent) return _sample(logits, temperature=max(2.0, temperature * temp_adj), top_p=top_p_adj, top_k=top_k, min_p=min_p, generator=generator)Code link
Intermediate Cases
If none of the above conditions are met, adaptive sampling is performed. Multiple samples are taken, and the best sampling score is calculated based on entropy, varentropy, and attention information.
else: return adaptive_sample( logits, metrics, gen_tokens, n_samples=5, base_temp=temperature, base_top_p=top_p, base_top_k=top_k, generator=generator )Code link
References
Disclaimer: All resources provided are partly from the Internet. If there is any infringement of your copyright or other rights and interests, please explain the detailed reasons and provide proof of copyright or rights and interests and then send it to the email: [email protected] We will handle it for you as soon as possible.
Copyright© 2022 湘ICP备2022001581号-3