1st International Workshop on

Data-Efficient Machine Learning for Web Applications (DeMaL)

World Wide Web Conference 2021 demalworkshop@gmail.com

Training, retraining and deploying web-scale machine learning models requires large amounts of high-quality data. Often, this is achieved via a time-consuming, labor intensive human annotation process. While in web-scale applications, there is an abundance of unlabeled, often extremely noisy data, there is a severe lack of high quality labeled data from which practitioners can train ML models that perform well on customer-facing applications. To this end, it is imperative that ML scientists and engineers devise innovative ways to deal with the constrained setting of small amounts of labeled data, and make the best use of limited (time and monetary) budget available to obtain annotated data. Thus, one needs to train dataefficient machine learning models. This has led to the proliferation of creative techniques such as data augmentation, transfer learning, self-supervised learning, active learning, multi-task learning to name a few. While many of these techniques have shown to work well under specific settings, web data offers additional challenges. Web data is multi-modal in nature, it has implicit signals from user-interactions, and often involves multiple agents.

Given the uniqueness, importance, and growing interest in these problems, the workshop on Data-efficient Machine Learning for Web Applications (DeMaL) is a venue to present ideas and solutions to these problems. The full day workshop aims to bring together practitioners in both academia and industry working on the collection, annotation and usage of labeled data for large scale web applications. Check out the Call for Contributions for topics of interest.


- Opening Remarks
Session 1
- Keynote by Xin Luna Dong
- An End-to-End Generative Retrieval Method for Sponsored Search; Yijiang Lian, Zhijie Chen, Jing Jia, Zhenjun You, Chao Tian, Jinlong Hu, Kefeng Zhang, Chunwei Yan, Muchenxuan Tong, Wenying Han, Hanju Guan, Ying Li, Ying Cao, Yang Yu, Zhigang Li, Xiaochun Liu and Yue Wang.
- Multimodal and Contrastive Learning for Click Fraud Detection; Weibin Li, Qiwei Zhong, Qingyang Zhao, Hongchun Zhang and Xiaonan Meng.
- OneStop QAMaker: Extract Question-Answer Pairs from Text in a One-Stop Approach; Shaobo Cui, Xintong Bao, Xinxing Zu, Yangyang Guo, Zhongzhou Zhao, Ji Zhang and Haiqing Chen.
- Break
Session 2
- Keynote by Chris Re
- A Data Augmentation Approach for Retrieving Synonymous Keywords in Sponsored Search; Yijiang Lian, Zhenjun You, Fan Wu, Wenqiang Liu and Jing Jia.
- Text Simplification for Comprehension-based Question-Answering; Kartikey Pant, Tanvi Dadu, Seema Nagar, Ferdous Barbhuiya and Kuntal Dey
- Unsupervised Perturbation based Self-Supervised Adversarial Training; Zhuoyi Wang, Yu Lin, Yifan Li, Feng Mi, Zachary Tian and Latifur Khan
- Break
Session 3
- Keynote by Christos Faloutsos
- One Backward from Ten Forward, Subsampling for Large-Scale Deep Learning; Chaosheng Dong, Xiaojie Jin, Weihao Gao, Yijia Wang, Hongyi Zhang, Xiang Wu, Jianchao Yang and Xiaobing Liu.
- Graph Convolutional Network with Node Addition and Edge Reweighting for Semi-Supervised Learning; Wen-Yu Lee
- Predicting ratings in multi-criteria recommender systems via collective factor model; Ge Fan, Chaoyun Zhang, Junyang Chen and Kaishun Wu
- Closing Remarks

Keynote Speakers (alphabetical order)

Title: Overton and Bootleg: Elements of a Software 2.0 System

Abstract: Overton is a system whose main design goal is to support engineers in building, monitoring, and improving production machine learning systems. Key challenges engineers face are monitoring fine-grained quality, diagnosing errors in sophisticated applications, and handling contradictory or incomplete supervision data. Overton automates the life cycle of model construction, deployment, and monitoring by providing a set of high-level, declarative abstractions. Overton’s vision is to shift developers to these higher-level tasks instead of lower-level machine learning tasks. Using Overton, engineers can build deep-learning-based applications without writing any code in frameworks like TensorFlow. Since 2018, Overton has been used in production to support multiple applications in both near-real-time applications, e.g. question answering, and back-of-house processing, e.g. entity resolution. In that time, Overton-based applications have answered billions of queries in multiple languages and processed trillions of records reducing errors 1.7−2.9x versus production systems. A second design goal of Overton is to natively support and maintain pretrained embeddings. This talk will describe recently open-sourced embedding model for named entity disambiguation, called Bootleg, which sets new state-of-the-art quality in named entity disambiguation, outperforms BERT-based baselines by over 50 points on entities unseen during training, and improves production use cases by up to 8% in multiple languages. We will also discuss challenges of how to build, monitor, and improve these weakly self-supervised systems over time. Bootleg is open source at http://hazyresearch.stanford.edu/bootleg/. A great deal of credit goes to the Search, Knowledge, and Platform team in Apple AI.

Bio: Christopher (Chris) Re is an associate professor in the Department of Computer Science at Stanford University. He is in the Stanford AI Lab and is affiliated with the Statistical Machine Learning Group. His recent work is to understand how software and hardware systems will change as a result of machine learning along with a continuing, petulant drive to work on math problems. Research from his group has been incorporated into scientific and humanitarian efforts, such as the fight against human trafficking, along with widely used products from technology and enterprise companies including Google Ads, GMail, YouTube, and Apple. He has cofounded four companies based on his research into machine learning systems, SambaNova and Snorkel, along with two companies that are now part of Apple, Lattice (DeepDive) in 2017 and Inductiv (HoloClean) in 2020. His research contributions have spanned database theory, database systems, and machine learning. His work has won best paper or test-of-time awards at the premier venues in each area. He still can't believe he won the MacArthur Foundation Fellowship.

Title: Anomaly Detection in Large Graphs

Abstract: Given a large graph, like who-calls-whom, or who-likes-whom, what behavior is normal and what should be surprising, possibly due to fraudulent activity? How do graphs evolve over time? We focus on these topics: (a) anomaly detection in large static graphs and (b) patterns and anomalies in large time-evolving graphs. For the first, we present a list of static and temporal laws, including advances patterns like 'eigenspokes'; we show how to use them to spot suspicious activities, in on-line buyer-and-seller settings, in FaceBook, in twitter-like networks. For the second, we show how to handle time-evolving graphs as tensors, as well as some surprising discoveries such settings.

Bio: Christos Faloutsos is a Professor at Carnegie Mellon University and an Amazon Scholar. He received the Fredkin Professorship in Artificial Intelligence (2020); the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, the SIGKDD Innovations Award (2010), the PAKDD Distinguished Contributions Award (2018), 28 ``best paper'' awards (including 7 ``test of time'' awards), and four teaching awards. Eight of his advisees or co-advisees have attracted KDD or SCS dissertation awards. He is an ACM Fellow, he has served as a member of the executive committee of SIGKDD; he has published over 400 refereed articles, 17 book chapters and three monographs. He holds 8 patents (and several more are pending), and he has given over 50 tutorials and over 25 invited distinguished lectures.

Title: Self-driving product understanding for thousands of categories

Abstract: Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph. In this talk we describe our efforts for self-driving knowledge collection for products of thousands of types. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. Our system is a) automatic, requiring little human intervention, b) multi-scalable, scalable in multiple dimensions including many domains, products, and attributes, and c) integrative, exploiting rich customer behavior logs. We describe what we learned in building this product graph and applying it to support customer-facing applications.

Bio: Xin Luna Dong is a Senior Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, was awarded ACM Distinguished Member, and VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”. She serves in VLDB endowment and PVLDB advisory committee, and is a PC co-chair for WSDM'2022, VLDB'2021, KDD'2020 ADS Invited Talk Series.

Call for Contributions

We identify broad set of techniques that can be used to learn from limited data. The topics of interest include, but are not limited to
  • Semi-supervised and Self-supervised Learning
  • Active Methods: active learning, bandit techniques
  • Learning from Similar Tasks: transfer learning, multi-task learning, meta-learning, domain adaptation
  • Crowdsourcing: human annotation methods, design of experiments
  • Synthetic data: data augmentation, adversarial data generation
Given the Web-related focus of this conference, we will also consider, although not limit to, the following application domains.
  • Recommendation models: recommender systems, collaborative filtering, knowledge graphs
  • E-commerce: fraud and abuse mitigation, misinformation, advertising
  • Social media: misbehavior, sentiment analysis, cyberbullying
  • Information retrieval: web search, ranking
  • Natural language processing: text summarization, machine translation, dialogue systems, cross-lingual learning
  • Computer vision: object detection, object tracking, video monitoring, multi-modal methods

Paper Submission

Authors are invited to submit papers of 2-8 pages in length. Papers should be submitted electronically in PDF format, using the ACM SIG Proceedings format, with a font size no smaller than 9pt. Submit papers through EasyChair. All submissions will be single blind and peer-reviewed. Each submission will be reviewed by at least 3 members of the PC. Papers will be evaluated according to their significance, originality, technical content, style, clarity, and relevance to the workshop. All accepted papers will be presented at the workshop. We encourage both academic and industry submissions of the following types, but not limited to:

  • Novel research papers in full or short length
  • Work-in-progress papers
  • Position papers
  • Survey papers
  • Comparison papers of existing methods and tools
  • Case studies
  • Demo papers
  • Extended abstracts

Important Dates

Paper Submission Deadline: March 7, 2021
Acceptance notification: April 1, 2021
Camera-ready due: April 10, 2021
Publication of Workshop Proceedings: April 14, 2021
Workshop Date: April 16, 2021

Accepted Papers

Long Papers

The following papers have been accepted as long papers to the workshop. These will have a 25min slot for presenting the paper.

Short Papers

The following papers have been accepted as short papers to the workshop. These will have a 25min slot for presenting the paper.

Organizing Committee

Program Committee