[{"content":" OpenAI has modified their API to return the log probabilities before any logit bias is applied. Hence the methods described in this article are no longer applicable.\nThe aim of this article is to provide an alternative derivation of the results in Mattew Finlayson\u0026rsquo;s article Obtaining logprobs from an LLM API.\nMost LLM APIs return the logprobs of only the top-$ k $ predictions. The value of $ k $ is often small, in the order of 10. The goal is to extract the logprobs of all tokens in the vocabulary. Fortunately, some APIs allow adding a logit bias to tokens. The idea is to use the logit bias to control which tokens are returned by the API.\nThe contribution of this article is to unify the derivations in the original article under a single lemma, providing a clearer and more comprehensible derivation.\n[Delete this line and the following lines before publishing the article.]\nPreliminaries Most language models are trained to predict the next token. They do so by computing logits $ l_i $ for each token $ i $ in the vocabulary. The logits $(l_i)_{i=1}^v$ are then converted to probabilities $ (p_i)_{i=1}^v$ by the softmax function: $$ p_i = \\frac{e^{l_i}}{\\sum_{j=1}^v e^{l_j}}, $$ where $ v $ is the size of the vocabulary. The normalizing constant $ Z = \\sum_{j=1}^v e^{l_j} $ ensures that the probabilities sum to 1.\nDefinition (Change in logits). Given the original logits $ ( l_1, l_2, \\ldots, l_v) $ and the modified logits $ ( l_1', l_2', \\ldots, l_v') $, we define the change in logits as $$ ( \\Delta l_1, \\Delta l_2, \\ldots, \\Delta l_v) = ( l_1' - l_1, l_2' - l_2, \\ldots, l_v' - l_v) . $$ We prove the following lemma which relates the probability distributions, before and after a change in logits.\nLemma\nFor any $ i \\in \\{1, 2, \\dots, v\\} $, we have $$ p_i = p_i' \\left(\\frac{e^{-\\Delta l_i}}{\\sum_{j=1}^v p_j' e^{-\\Delta l_j}}\\right). \\tag{1} $$ The significance of the equation is that the original probability $ p_i $ is written in terms of the modified probabilities $(p_j')_{j=1}^v$ and the change in logits $ (\\Delta l_j)_{j=1}^v $. Similarly, we have $$ p_i' = p_i \\left(\\frac{e^{\\Delta l_i}}{\\sum_{j=1}^v p_j e^{\\Delta l_j}}\\right). \\tag{2} $$ proof. To begin, we write down the probabilities in terms of the logits: $$ p_i = \\frac{e^{l_i}}{\\sum_{j=1}^v e^{l_j}} = \\frac{e^{l_i}}{Z} \\quad \\text{and} \\quad p_i' = \\frac{e^{l_i'}}{\\sum_{j=1}^v e^{l_j'}} = \\frac{e^{l_i'}}{Z'}, $$ where $Z$ and $Z'$ are the normalizing constants for the original and modified logits, respectively.\nSince $ l_i' - l_i = \\Delta l_i $, we have $$ e^{l_i} = e^{l_i'} e^{-\\Delta l_i}. $$ Thus, we can write the original probability in terms of the modified probabilities and the logit differences: $$ p_i = \\frac{e^{l_i}}{\\sum_{j=1}^v e^{l_j}} = \\frac{e^{l_i'} e^{-\\Delta l_i}}{\\sum_{j=1}^v e^{l_j'} e^{-\\Delta l_j}}. $$ Dividing the numerator and the denominator by the normalizing constant $ Z' $, we eliminate the dependence on the logits: $$ p_i = \\frac{(e^{l_i'}/Z') e^{-\\Delta l_i}}{\\sum_{j=1}^v (e^{l_j'}/Z') e^{-\\Delta l_j}} = \\frac{p_i' e^{-\\Delta l_i}}{\\sum_{j=1}^v p_j' e^{-\\Delta l_j}}. $$ The symmetric result of Equation (2) can be obtained similarly. $ \\square $\nExtracting probabilities of any $k$ tokens in a single API call The following result is used to extract the probabilities of an arbitrary set of $ k $ tokens in a single API call by exploiting the logit bias option of the API. We assume that the API returns the probabilities of the top-$ k $ tokens, and allows adding logit biases to the tokens.\nWe denote the set of indices of the desired $k$ tokens as $ B $. To expose the probabilities of the tokens in $ B $, we add a sufficiently large logit bias $ b $ to the tokens in $ B $ so that they appear in the top-$ k $ predictions. (If not all desired tokens appeared in the top-$k$ predictions from the API call, we can try again with a larger bias.)\nAfter getting the results from the API, we know the probabilities $ p_i' $ for $ i \\in B $ but not for $ i \\notin B $. The aim is to find the original probabilities $ p_i $ for $ i \\in B $.\nIn this case, the changes in logits are $$ \\Delta l_i = l_i' - l_i = \\begin{cases} b \u0026 \\text{if } i \\in B, \\\\ 0 \u0026 \\text{if } i \\notin B. \\end{cases} $$ By Equation (1) in the lemma, we have for any $i \\in B$: $$ p_i = \\frac{p_i' \\exp(-\\Delta l_i)}{\\sum_{j=1}^v p_j' \\exp(-\\Delta l_j)}. $$ We can split the summation in the denominator into two parts: one for $ j \\in B $ and the other for $ j \\notin B $: $$ p_i = \\frac{p_i' \\exp(-b)}{\\sum_{j \\in B} p_j' \\exp(-b) + \\sum_{j \\notin B} p_j'}. $$ To eliminate the dependence on unknown probabilities, we use the fact that the probabilities sum to 1: $$ \\sum_{i \\in B} p_i + \\sum_{i \\notin B} p_i = 1, $$ which gives $$ p_i = \\frac{p_i' \\exp(-b)}{\\sum_{j \\in B} p_j' \\exp(-b) + 1 - \\sum_{j \\in B} p_j'}. $$ Here we have obtained a formula for the original probabilities $ p_i $ for $ i \\in B $ in terms of only observed probabilities and the bias $ b $ known to us. Hence we can extract the probabilities of the $k$ tokens in $ B $ in a single API call.\nFull probability distribution in $ v / k$ API calls To extract the full probability distribution, we can repeat the above process $ \\lceil v/k \\rceil $ times, each time exposing the probabilities of another $ k $ tokens.\nExtracting the top-$n$ probabilities in $ \\lceil n/k \\rceil $ API calls In most cases, the probability mass is concentrated on a small number of tokens. It is more cost-effective to extract just the top-$n$ probabilities instead of the full distribution. We can assume the rest of the probabilities to be negligible.\nTo this end, we apply a sufficiently large negative bias $ -b $ to the tokens with already known probabilities, exposing the next $k$ tokens. Repeat the process until the top-$ n $ logprobs are obtained. The required number of API calls is $ \\lceil n/k \\rceil $.\nDenote the set of indices of the known tokens as $ C $. In this case the changes in logits are $$ \\Delta l_i = \\begin{cases} -b \u0026 \\text{if } i \\in C, \\\\ 0 \u0026 \\text{otherwise}. \\end{cases} $$ Using Equation (2) of the lemma, we have for any $ i \\notin C $: $$ % p_i' = \\frac{p_i \\exp(\\Delta l_i)}{\\sum_{j=1}^v p_j \\exp(\\Delta l_j)} = \\frac{p_i}{\\sum_{j \\in C} p_j \\exp(-b) + \\sum_{j \\notin C} p_j}. p_i' = p_i \\left(\\frac{e^{\\Delta l_i}}{\\sum_{j=1}^v p_j e^{\\Delta l_j}}\\right). $$ Splitting the summation in the denominator into two parts: one for $ j \\in C $ and the other for $ j \\notin C $, we get $$ p_i' = p_i \\left(\\frac{e^{-b}}{\\sum_{j \\in C} p_j e^{-b} + \\sum_{j \\notin C} p_j}\\right). $$ Recall that $ C $ is the set of known indices, $ p_j $ where $ j \\notin C $ are unknown to us. To eliminate the dependence on unknown probabilities, we use the fact that the probabilities sum to 1: $$ \\sum_{i \\in C} p_i + \\sum_{i \\notin C} p_i = 1, $$ which gives $$ p_i' = p_i \\left( \\frac{e^{-b}}{\\sum_{j \\in C} p_j e^{-b} + 1 - \\sum_{j \\in C} p_j} \\right). $$ Rearranging the terms, we get $$ p_i = p_i' \\left( \\frac{\\sum_{j \\in C} p_j e^{-b} + 1 - \\sum_{j \\in C} p_j}{e^{-b}} \\right). $$ which gives a formula for the original probabilities $ p_i $ for $ i \\notin C $ in terms of only observed probabilities, except $p_i'$ which is only known to us if it is one of the next $ k $ tokens exposed by the API.\nTo extract the top-$ n $ probabilities, we can repeat the above process $ \\lceil n/k \\rceil $ times, each time exposing the probabilities of another $ k $ tokens.\n","permalink":"https://kinianlo.github.io/posts/llm-api-probs/","summary":"OpenAI has modified their API to return the log probabilities before any logit bias is applied. Hence the methods described in this article are no longer applicable.\nThe aim of this article is to provide an alternative derivation of the results in Mattew Finlayson\u0026rsquo;s article Obtaining logprobs from an LLM API.\nMost LLM APIs return the logprobs of only the top-$ k $ predictions. The value of $ k $ is often small, in the order of 10.","title":"Another derivation for \"Obtaining logprobs from an LLM API\""},{"content":"Hypothesis Testing In scientific research, we often want to prove that a certain statistical phenomenon is true. For example, we might want to prove (or rather show that it is likely) that a coin is unfair (that it lands on heads and tails with unequal probabilities). We do this by flipping the coin $N$ times and recording the number of heads $N_h$ and the number of tails $N_t$. For example, we may get $N_h = 40$ and $N_t = 60$, which sum to $N = 100$. How do we proceed to prove that the coin is unfair? We can do this by testing a hypothesis. A hypothesis is a statistical model of the system we are studying. In our examples, we might hypothesize that the coin is fair (that is, that it lands on heads and tails with equal probabilities). The idea is that we can calculate how likely it is that we would have gotten the observed data given that the hypothesis is true, using the hypothesised model. If it turns out to be very unlikely for us to observe the data we have observed, then we can reject the hypothesis.\nNow let us make the hypothesis that the coin is fair. To calculate the observed data in the example of a coin flip, we can use the binomial distribution. The probability of observing $N_h$ heads and $N_t$ tails in $N$$ coin flips is given by the binomial distribution:\n$$ P(N_h, N_t) = \\frac{N!}{N_h! N_t!} p^{N_h} (1-p)^{N_t} $$ where $p$ is the probability of observing heads in a single coin flip. Since we are hypothesising that the coin is fair, we have $p = 0.5$.\nPlugging in the observed data, we get:\n$$ P(40, 60) = \\frac{100!}{40! 60!} 0.5^{40} 0.5^{60} \\approx 0.0108 $$ This is a fairly small probability. That means if we were to repeat the experiment many times, we would expect to observe the data we have observed only about 1% of the time. In hypothesis testing, we are not interested in the exact probability of the observed data, but rather in the probability of observing data that are as likely or even more unlikely than the observed data. Why? you might ask. This is a very good question that I do not have a good answer to at the moment. I will try to find out and update this post.\nThe probability of observing data that are as likely or even more unlikely than the observed data is defined as the p-value of the observed data under the hypothesis. The p-value of the example is:\n$$ \\begin{align*} \\text{p-value} \u0026= \\sum_{N_h \\leq 40} P(N_h, N_t = N - N_h) + \\sum_{60 \\leq N_h} P(N_h, N_t = N - N_h)k \\\\ \u0026\\approx 0.02844 + 0.02844 = 0.05688 \\end{align*} $$ This p-value of 0.05688 is the probability of observing data that are as likely or even more unlikely than the observed data ($N_h = 40$, $N_t = 60$) under the hypothesis that the coin is fair.\nUsually, an observation with a p-value under a hypothesis of less than 0.05 is considered to be statistically significant, which means that we can reject the hypothesis.\nComposite hypothesis In the above example, we had only one hypothesis, namely that the coin is fair. What if we want to show that the coin has a greater probability of landing on heads than tails? We can do this by testing a composite hypothesis which can be collectively written as a set of hypotheses:\n$$ \\{H_p \\mid p \\leq 0.5\\}. $$ Now we calculate the p-value $\\text{p-value}(H_p)$ of the observed data under each of these hypotheses $H_p$.\nBecause the goal of doing all these is to rule out the possibility that the coin lands on tails with a greater probability, we need every hypothesis $H_p$ to have a p-value less than the predetermined threshold, e.g. 0.5. Thus we can define the p-value of the composite hypothesis as the maximum of the p-values of the individual hypotheses:\n$$ \\text{p-value} = \\max \\text{p-value}(H_p) = \\max \\{H_p \\mid p \\leq 0.5\\}. $$ For some spaces of hypotheses, there might not be a maximum p-value. In these cases, we can define the p-value as the supremum of the p-values of the individual hypotheses:\n$$ \\text{p-value} = \\sup \\text{p-value}(H_p) = \\sup \\{H_p \\mid p \\leq 0.5\\}. $$ Beware that the maximum or supremum might not be easy to find.\nBootstrapping In the above example, we have a complete analytical formula for the probability distribution of the observations. Well, what if we don\u0026rsquo;t? What if we have a complicated system that we cannot model analytically? Or worse, what if we have a gigantic composite null hypothesis which makes the maximum or supremum of the p-values of the individual hypotheses difficult to find?\nOne way would be to apply Monte Carlo to the problem. We can sample a large number of hypotheses from the space of composite hypotheses and calculate the p-value of each of the sampled hypotheses. However, sometimes it is not even easy to calculate the p-value of a single hypothesis.\nIf this is the case, we can forget about p-value and turn to confidence intervals. Usually, we calculate a number summarising the data, such as the mean or the standard deviation. Such a number is called sample statistic. This number tells you something interesting about the system you are studying. For example, the sample mean of coin flips reveals how biased the coin is. Suppose the sample mean is $0.4$. Does it mean that the coin is biased? Well we can use the p-value to answer this question. But there is another way to answer this question. Imagine we have the luxury of repeating the coin flip experiment many times. That way we can calculate a sample mean for each of the experiments and plot a histogram of the sample means. From the histogram, we can see how spread out the sample means are.\nHowever, we do not usually have the luxury of repeating the experiment many times. Instead, we can sample from the data we already have to form a bootstrap sample. The bootstrap sample is a sample of the data that is the same size as the original data. Essentially, we are selecting data points from the original data with replacements. This way we can create out of thin air new data sets, with which we can calculate the sample statistic and plot a histogram of the sample statistic.\n","permalink":"https://kinianlo.github.io/posts/2023-02-12-stats/","summary":"Hypothesis Testing In scientific research, we often want to prove that a certain statistical phenomenon is true. For example, we might want to prove (or rather show that it is likely) that a coin is unfair (that it lands on heads and tails with unequal probabilities). We do this by flipping the coin $N$ times and recording the number of heads $N_h$ and the number of tails $N_t$. For example, we may get $N_h = 40$ and $N_t = 60$, which sum to $N = 100$.","title":"Notes on Statistics"},{"content":"Since I started blogging, I have been using markdown to write my posts. Due to the nature of the topics I write about, I often need to write maths using LaTex.\nInitially, I had been using kramdown as the markdown parser since it has a built-in math syntax. Essentially, kramdown recognises anything in between $$ and $$ as LaTex code and not markdown. That means 1,y in $$(x_1, y_1)$$ is not interpreted as italics or bold. If you put the maths within a paragraph, then the maths is considered inline. If you put the maths in its own paragraph, then the maths is considered display.\nIt was all fine until I decided to use the built-in markdown previewer in vscode, which uses CommonMark as its markdown parser (or rather its markdown standard). Yes, there are (too) many markdown standards out there. The problem is that in CommonMark, anything in between two $$\u0026rsquo;s is interpreted as a display math block, regardless of whether it is in its own paragraph or not. If you need an inline math block, you need to use a single $ instead. kramdown refuses to use a single $ for inline maths for it is not unusual to have a pair of dollar signs in a sentence, e.g. \u0026ldquo;The price is $10 a discount of $2.\u0026rdquo;.\nLuckily, there is a vscode extension called Markdown + Math that allows you to use kramdown\u0026rsquo;s math syntax in the vscode markdown previewer.\nHow does LaTex maths usually work in markdown? First of all, maths support is almost always considered an extension to basically every major markdown standard. For example, there is not a single mention of the word \u0026ldquo;math\u0026rdquo; in the CommonMark Spec (version 0.30) nor the GitHub Flavored Markdown Spec (version 0.29-gfm). That makes every markdown parser that follows these standards not recognise any maths syntax by default. So the first step to getting maths working in markdown is to make your markdown parser recognises a certain maths syntax, usually in the form of delimiters. So that the parser would not interpret the LaTex code as markdown and perform rendering on the maths.\nList of markdown libraries kramdown is the markdown parser used by Jekyll as default. The double dollar signs syntax is used for both inline and display maths. There is no obvious way to change the maths delimiters. That said, kramdown does not do any LaTex parsing by itself. By default, it just changes the maths delimiters to those recognised by MathJax, which is a strictly client-side parser. However, you can configure kramdown to use other maths engines. For example, you can use KaTex which allows rendering LaTex to MathML both on the server side and the client side. MathJax is a javascript library that renders LaTex code in the browser. It is solely a client-side parser. In MathJax version 3, the delimiters recognised are \\[ and \\] for display maths and \\( and \\) for inline maths. In MathJax version 2, the delimiters are HTML tags \u0026lt;script type=\u0026quot;math/tex\u0026quot;\u0026gt; and \u0026lt;script type=\u0026quot;math/tex; mode=display\u0026quot;\u0026gt; for display maths and \u0026lt;script type=\u0026quot;math/tex; mode=inline\u0026quot;\u0026gt; for inline maths. Note: I have been encountering rendering issues with MathJax. Sometimes I would get a rectangular block of white covering part of the rendered maths. KaTex is a fast math typesetting library for the web. The core of KaTex is used to render LaTex code with web technologies, i.e. javascript and css. Markdown is not involved at all. However, there is an extension called Auto-render of KaTex that searches for maths in any text and renders it. You can choose what delimiters to use for maths. markdown-it is a markdown parser following the CommonMark Spec. There is no maths support by itself, which is a good thing, believe it or not. There is a plugin called markdown-it-texmath that allows configuring the maths delimiters. The default delimiters are $$ for display maths and $ for inline maths. Importantly, it supports the use of kramdown-style maths delimiters. Test area $$ $$ $$a + b$$ $$a _ b$$ . $$ \\lambda * 5 $$ $ \\lambda * 5 $ $ a+b $ This $$1 $$ $$ 2$$ .\n$$ 1 + 2 $$ This is a $ a + b $ c $ $$ a + b $$\n","permalink":"https://kinianlo.github.io/posts/2023-02-09-maths-in-markdown/","summary":"Since I started blogging, I have been using markdown to write my posts. Due to the nature of the topics I write about, I often need to write maths using LaTex.\nInitially, I had been using kramdown as the markdown parser since it has a built-in math syntax. Essentially, kramdown recognises anything in between $$ and $$ as LaTex code and not markdown. That means 1,y in $$(x_1, y_1)$$ is not interpreted as italics or bold.","title":"Writing LaTex maths in markdown"},{"content":"Binary search is a essential algorithm used in coding interview. It is used to search for an target element x in a sorted list a: list with a time complexity O(log(n)). The naive way to search for x in a is to iterate through the list and check if a[i] == x for every i in range(len(a)). The time complexity of such an naive approach is O(n). However, if the list is not sorted, we can sort it first sort it first with a time complexity of O(nlog(n)) and then followed by a binary search. Thus if you are just going to do one search, it is better to just go with the naive approach. However, if you are search more than O(log(n)) times, then it might be better to first sort the list and then do binary search.\nThe general idea of binary search is to rule out half of the list at every iteration. Since the given list is sorted, we can tell which half to rule out by comparing the target element x with the element in the middle of the list. Since the length of the list is halved at every iteration, you will end up with a list of length 1 at roughly the log(n)-th iteration. In each iteration, you only do one comparison. Thus the time complexity of the algorithm is O(log(n)).\nBisection without worrying about edge cases The idea of binary search is simple. However, there are some edge cases that need to be handled. For example, what if the target element x does not exist in the list a? Or what if there are multiple x in a? These are things that we do not want to worry about when implementing binary search, especially when we are doing it in an interview. Most certainly, we will introduce bugs when we are under pressure.\nA stress-free way to implement binary search is alter the problem slightly. Instead of finding the target element x in a, the algorithm will slice the list a into three parts (p_less = a[:l], p_equal = a[l:r], p_greater = a[r:]) which satisfy the following assertions:\nassert all(c \u0026lt; x for c in p_less ) == True assert all(c == x for c in p_equal ) == True assert all(c \u0026gt; x for c in p_greater) == True All edges cases are automatically handled because p2 could be empty or contain multiple x. The output of the algorithm are now the two indices l and r that divide a into three parts.\nThis is exactly how the built-in bisect library works in Python. The pointer l is calculated by bisect.bisect_left and the pointer r is calculated by bisect.bisect_right. It is very important to understand how the bisect library works because it is a very useful tool in Python for coding interview.\nImplementation To find l, we start by initializing the low and high pointer:\nlo = 0 hi = len(a) - 1 In doing so, we have divided the list into three parts:\na[:lo]; a[lo:hi]; a[hi:]. We maintain the following assertion throughout the algorithm:\nall(c \u0026lt; x for c in a[:lo]) all(c \u0026gt;= x for c in a[hi:]) The idea is to shorten a[lo:hi] repeatedly while maintaining the above assertions until it becomes an empty list. Here is when bisection comes in. At the start of every iteration, we take a look at the middle element a[m] of a[lo:hi], where m = (lo+hi)//2. There are two cases:\na[m] \u0026lt; x: in this case, since the list is sorted, we can conclude that every element in a[:m+1] is less than x. Thus, we can safely increase the pointer lo to m+1 without violating the first assertion. a[m] \u0026gt;= x: every element in a[m:] is greater or equal to x. Thus, we can safely decrease hi to m without violating the second assertion. The above procedure is repeated until a[lo:hi] is empty, i.e. lo == hi. Now we have all(c \u0026lt; x for c in a[:lo]) == True and all(c \u0026gt;= x for c in a[lo:]) == True. It is obvious that now we have l == lo == hi. Note that it is not essential to let m=(lo+hi)//2. In fact, m can be taken to point to any element in a[lo:hi], that is, m could any integer in range(lo:hi). It is just that letting m be the middle of a[lo:hi] ensures that the length of a[lo:hi] is halved at every iteration.\nThe following implements the above procedure:\ndef bisect_left(a, x): lo, hi = 0, len(a) - 1 while lo \u0026lt; hi: m = (lo + hi) // 2 if a[m] \u0026lt; x: lo = m + 1 else: hi = m return lo Similarly, the following gives the r pointer:\ndef bisect_right(a, x): lo, hi = 0, len(a) - 1 while lo \u0026lt; hi: m = (lo + hi) // 2 if a[m] \u0026lt;= x: lo = m + 1 else: hi = m return lo Conclusion To conclude, we have defined a version of binary search (or bisection) that removes the worry of edge cases. That is achieved by forgoing the aim of finding the target element and focusing on dividing the list into three parts, each containing elements that are \u0026lt;, == and \u0026gt; the target element. Edge cases are automatically handled because the middle partition could be of any length. If the middle partition is empty, that means the target element does not exist in the list.\n","permalink":"https://kinianlo.github.io/posts/2022-10-17-bisect/","summary":"Binary search is a essential algorithm used in coding interview. It is used to search for an target element x in a sorted list a: list with a time complexity O(log(n)). The naive way to search for x in a is to iterate through the list and check if a[i] == x for every i in range(len(a)). The time complexity of such an naive approach is O(n). However, if the list is not sorted, we can sort it first sort it first with a time complexity of O(nlog(n)) and then followed by a binary search.","title":"Notes on binary search"},{"content":"","permalink":"https://kinianlo.github.io/cv/","summary":"","title":"CV"}]