2nd International Workshop on

Data-Efficient Machine Learning (DeMaL)

Knowledge Discovery and Data Mining (KDD) Conference 2021 demalworkshop@gmail.com

Training, retraining and deploying web-scale machine learning models requires large amounts of high-quality data. Often, this is achieved via a time-consuming, labor intensive human annotation process. While in web-scale applications, there is an abundance of unlabeled, often extremely noisy data, there is a severe lack of high quality labeled data from which practitioners can train ML models that perform well on customer-facing applications. To this end, it is imperative that ML scientists and engineers devise innovative ways to deal with the constrained setting of small amounts of labeled data, and make the best use of limited (time and monetary) budget available to obtain annotated data. Thus, one needs to train dataefficient machine learning models. This has led to the proliferation of creative techniques such as data augmentation, transfer learning, self-supervised learning, active learning, multi-task learning to name a few. While many of these techniques have shown to work well under specific settings, web data offers additional challenges. Web data is multi-modal in nature, it has implicit signals from user-interactions, and often involves multiple agents.

Given the uniqueness, importance, and growing interest in these problems, the workshop on Data-efficient Machine Learning (DeMaL) is a venue to present ideas and solutions to these problems. The full day workshop aims to bring together practitioners in both academia and industry working on the collection, annotation and usage of labeled data for large scale web applications. Check out the Call for Contributions for topics of interest.


The workshop will take place on Aug 15 2021 from 1pm - 5pm Pacific time according to the schedule below. All times listed below are in your local time for convenience.

Session 1
- Opening Remarks
- Keynote by Aidong Zhang
Session 2
- Weakly Supervised Classification Using Group-Level Labels<; Guruprasad Nayak, Rahul Ghosh, Xiaowei Jia and Vipin Kumar.
- Establishing Reliability in Crowdsourced Voice Feedback (CVF) and Evaluating the Reliability in the Absence of Ground-Truth; Aashish Jain and Sudeeksha Murari.
- A Concept Knowledge-Driven Keywords Retrieval Framework for Sponsored Search; Yijiang Lian, Yubo Liu, Zhicong Ye, Liang Yuan, Yanfeng Zhu, Min Zhao, Jianyi Cheng and Xinwei Feng.
- Search based Self-Learning Query Rewrite System in Conversational AI; Xing Fan, Eunah Cho, Xiaojiang Huang and Chenlei Guo.
- Towards NLU Model Robustness to ASR Errors at Scale; Yaroslav Nechaev, Weitong Ruan and Imre Kiss.
Session 3
- Keynote by Ed Chi
Session 4
- Active two-phase learning for classification of large datasets with extreme class-skew; Tarun Gupta and Sedat Gokalp.
- CDCGen: Cross-Domain Conditional Generation via Normalizing Flows and Adversarial Training; Hari Prasanna Das, Ryan Tran, Japjot Singh, Yu-wen Lin and Costas J. Spanos.
- RoBERTaIQ: An Efficient Framework for Automatic Interaction Quality Estimation of Dialogue Systems; Saurabh Gupta, Xing Fan, Derek Liu, Benjamin Yao, Yuan Ling, Kun Zhou, Tuan-Hung Pham and Chenlei Guo.
- Improving Natural Language Understanding Accuracy by Pre-Adapting to Live Traffic; Fu Lisheng and Konstantine Arkoudas.
- Closing Remarks

Important Dates

Paper Submission Deadline: June 4, 2021
Acceptance notification: July 5, 2021
Camera-ready due: June 30, 2021
Publication of Workshop Proceedings: July 2, 2021
Workshop Date: Aug 15, 2021

Accepted Papers

Long Papers

The following papers have been accepted as long papers to the workshop. These will have a 25min slot for presenting the paper.

Short Papers

The following papers have been accepted as short papers to the workshop. These will have a 25min slot for presenting the paper.

Keynote Speakers (alphabetical order)

Title: Transfer Learning and Meta-Learning for Few-Shot Learning Applications

Abstract: Transfer learning has been proposed to re-use the trained model parameters in similar machine learning applications. However, transfer learning cannot be effectively applied from one domain to a different domain. Recently, meta-learning, which utilizes prior knowledge learned from related tasks and generalizes to new tasks of limited supervised experience, has shown to be an effective approach for few-shot learning probems. In this talk, I will discuss the advantage of meta-learning over traditional classification in prediction problems and use The Cancer Genome Atlas (TCGA) cancer patient data to demonstrate that meta-learning can outperform the conventional transfer learning in cancer prediction problems. I will also present a knowledge-guided meta-learning strategy which integrates biological knowledge with meta-learning for improved classification performance on few-shot learning problems.

Bio: Dr. Aidong Zhang is a William Wulf Faculty Fellow and Professor of Computer Science in the School of Engineering and Applied Sciences at University of Virginia (UVA). She also holds joint appointments with Department of Biomedical Engineering and Data Science Institute at University of Virginia. Her research interests include machine learning, data science, bioinformatics, and health informatics. Dr. Zhang is a fellow of ACM, AIMBE, and IEEE.

Title: Efficient Neural Modeling for Large-Scale Real-World Recommendations

Abstract: Fundamental improvements in recommendation and ranking have been much harder to come by, when compared with recent progress on other long-standing AI problems such as computer vision, speech recognition, language models, and machine translation. Some reasons include: (1) large amounts of data making training difficult, yet having (2) noisy and sparse labels; (3) changing dynamics of context such as user preferences and items; and (4) low-latency requirement for a recommendation response. Finally, (5) the data are skewed long-tailed distributions (i.e. power law). That is, while there is a good amount of logged data, but for many users and items in the long tail, we actually suffer from extreme sparsity of labels.) These problems means we have to make very efficient use of data to provide the personalized recommendation experiences that users expect. In this talk, we will touch upon many recent advances in data-efficient neural modeling techniques for recommendations and their impact in real-world products. We will particularly focus on (a) using multi-task models with gated mixture of experts to jointly learn tasks with different label sparsity; and (b) employing recent Self-Supervised Learning techniques to robustly learn feature representations; and finish up with (c) some thoughts on the relationships between data efficiency and compressive sensing.

Bio: Ed H. Chi is a Distinguished Scientist at Google, leading several machine learning research teams focusing on neural modeling, reinforcement learning, dialog modeling, reliable/robust machine learning, and recommendation systems in the Google Brain team. His team has delivered significant improvements for YouTube, News, Ads, Google Play Store at Google with >420 product improvements since 2013. With 39 patents and >150 research articles, he is also known for research on user behavior in web and social media. Prior to Google, he was the Area Manager and a Principal Scientist at Palo Alto Research Center's Augmented Social Cognition Group, where he led the team in understanding how social systems help groups of people to remember, think and reason. Ed completed his three degrees (B.S., M.S., and Ph.D.) in 6.5 years from University of Minnesota. Recognized as an ACM Distinguished Scientist and elected into the CHI Academy, he recently received a 20-year Test of Time award for research in information visualization. He has been featured and quoted in the press, including the Economist, Time Magazine, LA Times, and the Associated Press. An avid swimmer, photographer and snowboarder in his spare time, he also has a blackbelt in Taekwondo.

Organizing Committee

Program Committee

We thank the following program committee for reviewing the workshop papers and providing valuable feedback to the authors.

How to Join

If you have registered for KDD, the easiest way is to join using the KDD conference platform via the following link.

Alternatively, please send an email to demalworkshop at gmail dot com to request joining details. Once you receive the details you can join using the following Zoom link

Call for Contributions

We identify broad set of techniques that can be used to learn from limited data. The topics of interest include, but are not limited to
  • Semi-supervised and Self-supervised Learning
  • Active Methods: active learning, bandit techniques
  • Learning from Similar Tasks: transfer learning, multi-task learning, meta-learning, domain adaptation
  • Crowdsourcing: human annotation methods, design of experiments
  • Synthetic data: data augmentation, adversarial data generation
Given the Web-related focus of this conference, we will also consider, although not limit to, the following application domains.
  • Recommendation models: recommender systems, collaborative filtering, knowledge graphs
  • E-commerce: fraud and abuse mitigation, misinformation, advertising
  • Social media: misbehavior, sentiment analysis, cyberbullying
  • Information retrieval: web search, ranking
  • Time Series Analysis

Paper Submission

Authors are invited to submit papers of 2-8 pages in length. Papers should be submitted electronically in PDF format, using the ACM SIG Proceedings format, with a font size no smaller than 9pt. Submit papers through EasyChair. All submissions will be single blind and peer-reviewed. Each submission will be reviewed by at least 3 members of the PC. Papers will be evaluated according to their significance, originality, technical content, style, clarity, and relevance to the workshop. All accepted papers will be presented at the workshop. We encourage both academic and industry submissions of the following types, but not limited to:

  • Novel research papers in full or short length
  • Work-in-progress papers
  • Position papers
  • Survey papers
  • Comparison papers of existing methods and tools
  • Case studies
  • Demo papers
  • Extended abstracts

Past Workshops