Efrat
Vilenski

Roundtables – Exploratory data analysis of multivariate data

Ben Gurion University

Efrat Vil

Efrat
Vilenski

Roundtables – Exploratory data analysis of multivariate data

Ben Gurion University

Efrat Vil

Bio

Visual analytics expert, with unique skill set in data science, business intelligence, database scripting and web development. More than 10 years of rich experience in scalable data environment and data driven applications.

Currently Ph.D. candidate at the Dept. of Industrial Engineering at Ben Gurion University. Research focus on visual analytics, anomaly detection with unsupervised machine learning, multivariate statistics and robust statistics. Working on several IoT data sets.

Bio

Visual analytics expert, with unique skill set in data science, business intelligence, database scripting and web development. More than 10 years of rich experience in scalable data environment and data driven applications.

Currently Ph.D. candidate at the Dept. of Industrial Engineering at Ben Gurion University. Research focus on visual analytics, anomaly detection with unsupervised machine learning, multivariate statistics and robust statistics. Working on several IoT data sets.

Abstract

The exploration of complex multivariate data is a formidable task, as it can be very confusing for humans to comprehend and understand data with many variables. Visual tools are needed to facilitate effective exploration of multivariate data.

 

Exploratory data analysis (EDA) – a term coined by Tukey, refers to a set of techniques for displaying data in a way, that interesting features will become apparent. EDA plays key role in data science pipeline and is used to aid in variety of tasks: deciding between data preprocessing options, selecting appropriate algorithms, comparing results from multiple algorithms, etc.

 

When doing EDA on multivariate data, in some cases, it will be sufficient to isolate each variable and analyze it separately. This mass univariate approach is simple and popular, but in many cases it is overly naive. For instance, if each variable carries a weak signal, which is undetectable univariately, but may aggregate if analyzed multivariately. It may also be the case that the signal is not in the means, but rather in the correlations between processes. In this case, detection using a mass-univariate approach is impossible.

 

Several multivariate charting options are available and may give satisfactory results for limited number of dimensions, for example the radar chart, the stacked radial chart and parallel coordinates. For higher number of dimensions, techniques for dimensionality reduction are appropriate, as well as methods to score, scan and filter the data to focus only on interesting subsets of variables and observations.

Combining visual analytic capabilities with pre-processing algorithms, enables intuitive exploration of complex data with interactive techniques.

Abstract

The exploration of complex multivariate data is a formidable task, as it can be very confusing for humans to comprehend and understand data with many variables. Visual tools are needed to facilitate effective exploration of multivariate data.

 

Exploratory data analysis (EDA) – a term coined by Tukey, refers to a set of techniques for displaying data in a way, that interesting features will become apparent. EDA plays key role in data science pipeline and is used to aid in variety of tasks: deciding between data preprocessing options, selecting appropriate algorithms, comparing results from multiple algorithms, etc.

 

When doing EDA on multivariate data, in some cases, it will be sufficient to isolate each variable and analyze it separately. This mass univariate approach is simple and popular, but in many cases it is overly naive. For instance, if each variable carries a weak signal, which is undetectable univariately, but may aggregate if analyzed multivariately. It may also be the case that the signal is not in the means, but rather in the correlations between processes. In this case, detection using a mass-univariate approach is impossible.

 

Several multivariate charting options are available and may give satisfactory results for limited number of dimensions, for example the radar chart, the stacked radial chart and parallel coordinates. For higher number of dimensions, techniques for dimensionality reduction are appropriate, as well as methods to score, scan and filter the data to focus only on interesting subsets of variables and observations.

Combining visual analytic capabilities with pre-processing algorithms, enables intuitive exploration of complex data with interactive techniques.

Discussion Points

  • Do you have a checklist for completing a thorough EDA on multivariate data?
  • What is your process for multivariate EDA? What charts do you typically use?
  • What are the pitfalls when doing EDA on multivariate data?
  • How do you suggest to improve the multivariate EDA process? The goal is to complete it in less time, while gaining more insights.
  • How can unsupervised learning help with multivariate data EDA?
  • How to combine dimension reduction with multivariate data EDA?

Discussion Points

  • Do you have a checklist for completing a thorough EDA on multivariate data?
  • What is your process for multivariate EDA? What charts do you typically use?
  • What are the pitfalls when doing EDA on multivariate data?
  • How do you suggest to improve the multivariate EDA process? The goal is to complete it in less time, while gaining more insights.
  • How can unsupervised learning help with multivariate data EDA?
  • How to combine dimension reduction with multivariate data EDA?