Dana
Kaner

labeling against all odds

PerimeterX

Dana Kaner

Dana
Kaner

labeling against all odds

PerimeterX

Dana Kaner

Bio

Dana Kaner is a data scientist at PerimeterX, a startup company that aims to fend off automated attacks on websites. Dana designs and implements scalable machine learning models for real-time detection of malicious activity.

Dana holds a bachelor’s degree in mathematics and economics and a master’s degree in applied statistics and data science from Tel Aviv University.

Bio

Dana Kaner is a data scientist at PerimeterX, a startup company that aims to fend off automated attacks on websites. Dana designs and implements scalable machine learning models for real-time detection of malicious activity.

Dana holds a bachelor’s degree in mathematics and economics and a master’s degree in applied statistics and data science from Tel Aviv University.

Abstract

A common misconception is that data science is only about choosing the right model for the problem at hand, when, in fact, considerable time and effort are put into more fundamental challenges. One major bottleneck in machine learning is getting a reliable labeled data to train the model on.

 

How to learn on partially labeled data? How to deal with a data set that may potentially contain miss-labeled observations? How to retrain the model when there is only partial feedback on your predictions? In our talk, we will review some academic approaches that discuss these issues alongside case studies from PerimeterX I will present. I encourage the participants at the roundtable to share their experience in this subject so we can explore possible solutions to these challenges in various domains.

Abstract

A common misconception is that data science is only about choosing the right model for the problem at hand, when, in fact, considerable time and effort are put into more fundamental challenges. One major bottleneck in machine learning is getting a reliable labeled data to train the model on.

 

How to learn on partially labeled data? How to deal with a data set that may potentially contain miss-labeled observations? How to retrain the model when there is only partial feedback on your predictions? In our talk, we will review some academic approaches that discuss these issues alongside case studies from PerimeterX I will present. I encourage the participants at the roundtable to share their experience in this subject so we can explore possible solutions to these challenges in various domains.

Discussion Points

  • First, how to decide whether a labeled data is a must? 
  • Different types of labeling challenges we’ve dealt with as data scientists (partial labels, noisy labels, etc.)
  • Academic approaches that discuss possible solutions to these problems
  • Practical solutions we eventually implemented 
  • Interesting case studies and results

Discussion Points

  • First, how to decide whether a labeled data is a must? 
  • Different types of labeling challenges we’ve dealt with as data scientists (partial labels, noisy labels, etc.)
  • Academic approaches that discuss possible solutions to these problems
  • Practical solutions we eventually implemented 
  • Interesting case studies and results