This article is about decoding some of the popular jargon used in data science. It is important to understand these concepts better. They are commonly asked in data science job interviews. Let’s get into the topics.
Dependent and Independent Variables
A dependent variable (target variable) is driven by the independent variables in the study. For example, the revenue for a retail store depends on the number of customers walking into the store. Here the store revenue is the dependent variable. The number of customers walking into the store is the independent variable. The dependent variable is called so because its value depends on the independent variable. Also, independent variables are called so because they are independent of other variables that could impact the dependent variable. Like, the amount of rainfall (independent variable) is independent of the number of customers walking into the store. Both these independent variables help in making a better prediction.
While working on a predictive data science problem. There would be generally one dependent variable and multiple independent variables. Below is a very good resource to better understand dependent and independent variables.
An outlier is a value that is not within the normal range for a variable. For example, the average life expectancy is around 70 years. An individual who is of age 119 is considered an outlier because his age is well outside the normal range. While working on a data science problem, it is a general practice to check for outliers in the dataset. Outliers in data could impact the choice of the algorithm in case of a predictive problem.
Here is a detailed article that talks about techniques commonly used in the detection of outliers.
When categorical data have an inferred sequence within them then it is ordinal data. For example, the class of a flight ticket is ordinal data. There is a sequence like first class and second class.
When we have ordinal categorical data then it is best to use integer encoding. Just convert them into an integer representation aligned with the inferred sequence. In this way, the algorithm will be able to look for patterns. Like, as the variable’s value increases or decreases how does it impact the outcome.
Below is a very good article to learn more about the best ways to encode categorical data
One-Hot encoding is a data transformation technique that helps to transform a categorical attribute into a numerical representation. The main advantage of one-hot encoding is it helps in avoiding any confusion to the ML model.
To explain it in simpler terms, attributes such as gender, city, country are non-ordinal. Non-ordinal means there is no order within them, that is all genders are the same. When we convert this non-ordinal attribute into integers then the many algorithms assume higher values to be more/less important. While there isn’t any such relationship. This problem is solved by using one-hot encoding to convert non-ordinal attributes into a binary representation.
To learn more about implementing On-Hot encoding, read below.
Skewness and Kurtosis
Skewness is a measure to understand the distribution of the data. When data have a skewness of 0 it means that the data is exactly symmetrical. The left-hand side of the distribution is exactly the same as the right-hand side. When data is negatively skewed it means that the majority of the data points are less than the median. In positively skewed data most data points are greater than the median.
Kurtosis is also a measure to better understand the data distribution. When data has a positive kurtosis it means that the distribution has a higher peak as compared to a normal distribution. It actually means that there could be many outliers.
Below is a very good article with visual representation to better understand skewness and kurtosis.
Imbalanced datasets are those where the target attribute (the attribute to be predicted) is unevenly distributed. This is definitely not an uncommon scenario while working on data science problems. For example, predicting fraudulent credit card transactions is an excellent example of an imbalanced dataset. Because most of the credit card transactions would be genuine. Yet there are some fraudulent transactions as well.
The imbalanced datasets need special attention as the normal approach to building models or evaluating performance would not work. Here is an article that talks in detail about imbalanced datasets and the best approaches to handle them better.
Scaling features is a technique used generally to bring all the features of the dataset (independent variables) to a consistent scale. To explain this concept with an example. Let’s take a problem where we have features like age and salary. Age is a range of 20–75 and salary is in a range of 50K to 500K. When we use algorithms that are based on gradient descent or any distance-based algorithm. It is important to scale the features to a consistent range before passing them to an algorithm. If the features are not scaled then the features of higher scale will influence the prediction.
To know more about what it scaling is and why it is important, read the below article.
Correlation is a statistical measure that explains the relationship between two features. Let’s say we have two features A and B. If A and B are positively correlated to each other it means that as A increases then B tends to increase as well. If A and B are negatively correlated then as one of them increases the other decreases.
Correlation is commonly used in feature selection while building a model. When there are features that are highly correlated to each other it means they are dependent on each other. They are not truly independent hence generally one of them is removed from the feature list while building a model.
To know more about the correlation and how they are used in feature selection with a working example. Read below article,
Confidence Interval and Confidence Level
Confidence Interval and Confidence Level are easy to be confused with, especially for beginners. Once you understand the concept then it can’t be confused.
Let us consider a simple real-world example. An e-commerce company wants to know the average number of items viewed before making the final purchase. It is not simple to track the clickstream data of every single user. So, the best approach is to calculate the mean for a sample and come up with an estimate. When we analyze the sample user data, we would want to come up with an estimated range. Like, the average user view between 4 to 9 items before making a final purchase. This interval here is the confidence interval. The certainty about the number of users falling into this range per 100 users is the confidence level.
To know more about the calculation behind the confidence interval and confidence level check below article
Homoscedasticity and Heteroscedasticity
Homoscedasticity is an important assumption in linear regression. This is a common question in job interviews. Homoscedasticity means that the residual between the independent variable and the dependent variable is the same across the different values of the independent variable.
Let us take a simple example, we have an independent variable ‘property size’ and the dependent variable is the ‘property value’. It means that we use ‘property size’ to predict the ‘property value’ and the error is the residual. If the error doesn’t change with the different values of ‘property size’ then it meets homoscedasticity. If the residual is high for properties of bigger size as compared to those of smaller properties then it is Heteroscedasticity.
For a better understanding of this concept. Also, to know why homoscedasticity is an important assumption to solve a regression problem. Read the following article,
Preparing for Data Science Interview?
Here is a video from my YouTube channel about the steps involved in preparing for a data science interview. It is not about the preparation the night before the interview but more about the long-term preparation.
This content was originally published here.