Essential Math’s for Data Science In JavaScript

Descriptive Statistics For Data science

Data science is not about programming or language specific (R , Python or JavaScript) . Its and art of applying different Math techniques to data

Neeraj Dana
Published in
10 min readNov 12, 2022

--

Photo by Алекс Арцибашев on Unsplash

In Data science a fundamental understanding of statistics and Modeling is required

What will we cover

  1. Descriptive and inferential statistics
  2. Population and sample
  3. Probability and non-probability sampling
  4. Mean
  5. Median
  6. Inter-Quartile range of a data set
  7. Variance of a data set
  8. Standard deviation of a data set

Descriptive and Inferential Statistics

Two Sets of Statistical Tools

Here we are alluding to descriptive and inferential statistics. Descriptive statistics are all about summarizing important properties or attributes of a dataset.

Inferential statistics go a step further. They attempt to actually explain the elements of those datasets in terms of relationships.

descriptive statistics are all about how the data looks. Inferential statistics are all about why the data looks that way.

Terms such as the average, the mean, a table describing various statistics within data, these are all examples of descriptive statistics.

Inferential statistics, on the other hand, usually require a model and they usually introduce the notion of probabilities. That was a high level conceptual overview of these two important classes of statistical techniques. Let’s now dig just a little deeper.

Descriptive Statistics

As we’ve already discussed, their primary function is to summarize data in a meaningful way. And indeed, we commonly use descriptive statistics, such as the average or the most commonly occurring value, these are ultimately just summaries of data. Likewise for tables or histograms or charts, these just describe the data in a manner intended to find patterns. That’s the whole point of most charts. But this in turn leads to the crucial limitation of descriptive statistics. They do not lend themselves to any general conclusions. Consider a statistic like a batting average. This represents nothing more than the data for one particular individual or a period in time.

This cannot be generalized either to other individuals in the same sport, or even used to predict how that same sportsman is going to do in the future. So descriptive statistics, as their name would suggest, simply describe the existing data, and they can serve as valuable aids to other analysis. That’s because they reduce our cognitive load. The raw data is just too noisy and too large for us to mentally process. Good luck trying to evaluate different sportsmen on the basis of their individual scores. You wouldn’t even be able to keep track of all of those scores for any length of time. This is why summary statistics, like batting averages, are so popular.

And finally, descriptive statistics may use a combination of different techniques, such as tables, graphs, and even comments. Let’s very quickly discuss the two most common examples of descriptive statistics, measures of central tendency, and measures of dispersion. If you stop to think about it, both of these intuitively make sense. If we continue with our sporting analogy, given any individual sports person, there are really two key questions that we’d like to answer. First question is on average, how good is she? And the second it how consistent is she? How likely is to bring her A game on a given day? Measures of central tendency answer the first question, measures of dispersion answer the second.

Sample and Population

n statistics, the term population, or more precisely, statistical population, refers to a set of all similar items or events of interest. This set is very likely to be infinite, and usually the idea of the statistical population comes from experience. Consider, for example, the idea of all possible voters in an American presidential election. This is a generalization drawn from past experience. And from this generalization, we can create an infinite set. Now you might very rightly question how we could assess the views of all possible voters out there, and the answer is that we cannot. And this is where the idea of a sample, or a statistical sample, really comes in handy.

A statistical sample is a subset of the population which has been chosen in order to represent the population in some statistical analysis. If the sample is chosen correctly, then characteristics of the entire population that the sample is drawn from can be estimated from the corresponding sample. And the sample can be very, very small in relation to the entire population. But here, it’s really important that sample be drawn up correctly. And in particular, that it indeed be representative of the population. We will have a lot more to say on the topic of bias samples and sampling methods in the sections up ahead. As always, now that we have an overview of the difference between population and sample, let’s drill a little deeper. Let’s start with the idea of the population.

Population

As we’ve already discussed, this includes all of the data that we are interested in. And again, this is intentionally a fairly loose term. The statistical population is very likely to consist of an infinite or immeasurable or even hypothetical set. The population could be small or large. The size does not matter. Note that even if we do define our population in a very narrow way and even if we go out and measure every element of that population currently in existence, there is always the possibility that new elements might come into existence in the future. And that’s why even for a very small population, the difference between the population and the sample still remains. Now the descriptive statistics, which we spoke about a little earlier, these descriptive statistics apply to populations.

So ideally, when we speak of values of the mean, the median, the standard deviation, and so on, we are referring to their values for the population as a whole. And now here is a really important point which we should remember. The values of descriptive statistics for the population are called parameters. The word parameter has a very specific meaning in statistics. Again, it refers to a property of the population as a whole. Parameters are usually going to be unknowable, and the best that we can hope for is to estimate them. So we will find that we will often be estimating population parameters using a sample, a subset of the population. Again, parameters are population properties which need to be estimated. As we shall see in a moment, the corresponding descriptive statistics which apply to a sample, well, they are just known as statistics and not as parameters.

Descriptive Statistics on Populations

So this distinction between a parameter and a statistic is important. Parameters are population level properties which we can never really know or measure. Statistics are sample level properties which we can indeed measure. And what’s more, we will usually use the sample statistic in order to estimate the population parameter. Here’s a little memorization aid which might help you keep this distinction in mind, E for population, E for parameter, S for sample, S for statistic. And remember, we estimate parameters using statistics. Let’s now move on from discussing the population, which is quite mysterious and hard to measure, to measuring and discussing something a lot more tangible, which is a sample. A sample is a subset of our population, and again, it’s a matter of real skill in order to construct the sample which is representative of the population.

Sample

In the real world, it’s going to be impossible in virtually every case to access all of the population data. Then it comes down to working with a smaller sample. But the manner in which that sample is chosen can make all the difference.

Descriptive Statistics on Samples

The population is mysterious and intangible, and it’s parameters can only be estimated. The best way to estimate these is by constructing samples and then going ahead and measuring the descriptive statistics. Those descriptive statistics can then be used to estimate the population properties. E for population, E for parameter, S for sample, and S for statistic. And we estimate the parameter using the statistics.

Mean

The mean is by far the most commonly used single value to describe data. Remember that the mean is simply the average of the numbers in a dataset. If those numbers constitute a statistical sample, then that is known as a sample mean. If those numbers constitute the entire population, then that mean is known as the population mean. And one of the beautiful properties of the mean is that the sample mean is an unbiased estimator of the population mean. Hold that thought, we will come back to it. Another great attraction of the mean is that it can be used with both discrete and continuous data. Please note that both discrete and continuous data here refer to data with ordering. If we have data which is fundamentally un-ordered, such as, let’s say, days of the week or genders, then the mean does not make sense.

And the reason that orderability of our data is important is because of the formula which is used to calculate the mean. The mean is simply the arithmetic average. So we’ll take all of the values in our sample or our population and then go ahead and divide them by the count. One implication of this formula is that the mean need not actually occur in our data. This makes the mean different from say the mode, the mode of a set of numbers is always going to be present within that set, this need not be true for the mean. However, in contrast, the mean does score over the mode in one important way. It includes or takes into account every value in the dataset. In fact, if we took every value in our dataset and calculated its difference from the mean, the sum of all of those mean differences would be zero. This is a really pretty important property of the mean. So let’s formalize the calculation. The mean of a set of numbers is simply obtained by summing up all of those numbers and dividing it by the count of the numbers.

X̄ = sum of all observations (from x1 to xn) divided by the number of observations (n)

Median

We’ve discussed the mean, also known as the arithmetic average. Let’s go ahead and now talk about the median. Before we get into the actual calculation of the median, let’s talk about what it is. Intuitively, the median is that value in the dataset which separates the lower half from the higher half. In other words, there are exactly 50% of the values in our dataset which are on or below the median, and likewise, there are 50% which are on or above the median. In the diagram on-screen now, the median does not lie at the exact center point, and that is intentional. We will come back to this particular case in a moment. But first, let’s talk a little bit about why we even need the median. The mean, or the arithmetic average, is pretty simple to calculate, and pretty simple to understand.

Inter-Quartile range of a data set

Once we are done with that, we’ll discuss the calculation, the pros, and the cons of each of these. The first and simplest measure of dispersion, or spread, is simply known as the range. This is calculated as the difference between the largest and the smallest values in a dataset. Let’s see whether the range satisfies our criterion for statistical dispersion. If all of the values in the dataset are equal, the largest and the smallest values are going to be equal, as well, and the range is going to be zero.

As the numbers become more diverse, the range is going to increase in value. What’s more, the range is always zero or greater than zero so it’s non negative, and of course by construction it is a real number. The range works very well for small datasets, but as our datasets get larger, the range can end up becoming too large. In other words, it’s not a tight enough bound for the dispersion we’re seeking to measure. And this is where three related properties come into play. In order, these are the quartile 1, the quartile 3, and the interquartile range. Let’s break these down. The quartile 1 can be thought of as simply the median of the smaller 50% of our data points.

The quartile 3 can be thought of as the median of the larger 50% of our data points. In simple terms, the quartile is the middle number between the minimum and the median. And the quartile 3 is the middle number between the median and the maximum. Because these numbers, Q1 and Q3 as they are commonly known, are calculated as medians, they are both robust to outliers. Q1 is robust to outliers on the small side, Q3 is robust to outliers on the large side. And we can now take the difference between these two numbers, and that in turn is going to be called the interquartile range.

The interquartile range, or IQR, is a more robust version of dispersion than the range. Because it is constructed as the difference of two medians, and consequently both of those medians are robust to outliers. All of these refinements help, but it turns out that by far the most popular and the best measures of dispersion are variance and standard deviation. These are fairly complex to understand so let’s hold off and come back to them in just a moment. Let’s first make sure we understand the simpler measures of dispersion, let’s start with the range.

Variance

Given any one point, the deviation of that point from the mean, which is also known as the mean deviation. Is simply the distance or the difference between that point and the mean. Well if you know the mean deviation of a single point in our dataset, then the next step is to generalize that. In other words to find some kind of average mean deviation across all the points in our dataset. Here is where a little bit of arithmetic will prove to us that we can’t take the average mean deviation as is. If we did so, all of those mean deviations would cancel out and the average mean deviation would simply be zero. What we can do however is to square each of those mean deviations and then take the average of those squares.

Σ(Xi — X̄)2

Standard Deviation

The best part about the standard deviation is that it measures the spread or the dispersion of data around the mean. It’s always positive. It cannot be negative. The good thing about standard deviation relative to the interquartile range is that every single point within our dataset is considered in its calculation. For this reason it is more sensitive to outliers than the interquartile range, but even so it is far more robust than the range. And finally, like every other measure of dispersion, if all the values in a dataset are the same, the standard deviation of that dataset is going to be zero.

--

--

Top Writer in Javascript React Angular Node js NLP Typescript Machine Learning Data science Maths