=================
== CJ Virtucio ==
=================

Quantiles with Rscript

quantiles rscript

Working with Rscript without a strong statistics background can be a little daunting. Some of its functions seem to have pretty mysterious properties, and the documentation can be a challenge to grok without prior knowledge. One such function I’ve recently encountered is quantile.

The wikipedia article defines quantiles as:

cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.

If you think of a dataset as a stream, a quantile is a point that divides the stream into a segment. There are several kinds, but perhaps the most common is the quartile. Quartiles are cutpoints that divide datasets into four parts. So, if you have a dataset of [1, 2, 3, 4, 5, 6, 7, 8], the first, second, third, and fourth quartiles are 1, 2, 3, 4, respectively.

I ran into an issue where the code would process and filter out records from data that I was trying to seed for testing. The codebase was very opaque about the reason; all it would tell me was that it was trying to perform processing on an empty dataset. After some digging, I ran into code that looked something like this:

doFoo(quantile(data, 0.98))

Playing with the repl didn’t make things much clearer. I’d create a dataset of one element, and quantile would return the same value:

data = c(1)
print(quantile(data, 0.98))
# 1

I then figured that, if a quantile’s a cutpoint in the data, it only makes sense that the 0.98th cutpoint would be the sole element in the data. So I added more datapoints; true enough, calling the quantile function led to a different result:

data = c(1, 2, 3, 4, 5, 6, 7, 8)
print(quantile(data, 0.98))
# 7.86

If a quantile is a cutpoint in the data, then there can only be one cutpoint in a dataset of size 1. With a much larger dataset, the quantile becomes something more distinct, since there are more datapoints to work with. So it turns out that all I had to do was seed more data.