57,59 €
Millions of data enthusiasts from around the world compete on Kaggle, the most famous data science competition platform of them all. Participating in Kaggle competitions is a surefire way to improve your data analysis skills, network with an amazing community of data scientists, and gain valuable experience to help grow your career.
The first book of its kind, The Kaggle Book assembles in one place the techniques and skills you’ll need for success in competitions, data science projects, and beyond. Two Kaggle Grandmasters walk you through modeling strategies you won’t easily find elsewhere, and the knowledge they’ve accumulated along the way. As well as Kaggle-specific tips, you’ll learn more general techniques for approaching tasks based on image, tabular, textual data, and reinforcement learning. You’ll design better validation schemes and work more comfortably with different evaluation metrics.
Whether you want to climb the ranks of Kaggle, build some more data science skills, or improve the accuracy of your existing models, this book is for you.
Plus, join our Discord Community to learn along with more than 1,000 members and meet like-minded people!
Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:
Seitenzahl: 721
Veröffentlichungsjahr: 2022
The Kaggle Book
Data analysis and machine learning for competitive data science
Konrad Banachewicz
Luca Massaron
BIRMINGHAM—MUMBAI
Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers.
The Kaggle Book
Copyright © 2022 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Producer: Tushar Gupta
Acquisition Editor – Peer Reviews: Saby Dsilva
Project Editor: Parvathy Nair
Content Development Editor: Lucy Wan
Copy Editor: Safis Editing
Technical Editor: Karan Sonawane
Proofreader: Safis Editing
Indexer: Sejal Dsilva
Presentation Designer: Pranit Padwal
First published: April 2022
Production reference: 3141022
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80181-747-9
www.packt.com
I had a background in econometrics but became interested in machine learning techniques, initially as an alternative approach to solving forecasting problems. As I started discovering my interest, I found the field intimidating to enter: I didn’t know the techniques, the terminology, and didn’t have the credentials that would allow me to break in.
It was always my dream that Kaggle would allow people like me the opportunity to break into this powerful new field. Perhaps the thing I’m proudest of is the extent to which Kaggle has made data science and machine learning more accessible. We’ve had many Kagglers go from newbies to top machine learners, being hired at places like NVIDIA, Google, and OpenAI, and starting companies like DataRobot.
Luca and Konrad’s book helps make Kaggle even more accessible. It offers a guide to both how Kaggle works, as well as many of the key learnings that they have taken out of their time on the site. Collectively, they’ve been members of Kaggle for over 20 years, entered 330 competitions, made over 2,000 posts to Kaggle forums, and shared over 100 notebooks and 50 datasets. They are both top-ranked users and well-respected members of the Kaggle community.
Those who complete this book should expect to be able to engage confidently on Kaggle – and engaging confidently on Kaggle has many rewards.
Firstly, it’s a powerful way to stay on top of the most pragmatic developments in machine learning. Machine learning is moving very quickly. In 2019, over 300 peer reviewed machine learning papers were published per day. This volume of publishing makes it impossible to be on top of the literature. Kaggle ends up being a very valuable way to filter what developments matter on real-world problems – and Kaggle is useful for more than keeping up with the academic literature. Many of the tools that have become standard in the industry have spread via Kaggle. For example, XGBoost in 2014 and Keras in 2015 both spread through the community before making their way into industry.
Secondly, Kaggle offers users a way to “learn by doing.” I’ve heard active Kagglers talk about competing regularly as “weight training” for machine learning. The variety of use cases and problems they tackle on Kaggle makes them well prepared when they encounter similar problems in industry. And because of competition deadlines, Kaggle trains the muscle of iterating quickly. There’s probably no better way to learn than to attempt a problem and then see how top performers tackled the same problem (it’s typical for winners to share their approaches after the competition).
So, for those of you who are reading this book and are new to Kaggle, I hope it helps make Kaggle less intimidating. And for those who have been on Kaggle for a while and are looking to level up, I hope this book from two of Kaggle’s strongest and most respected members helps you get more out of your time on the site.
Anthony Goldbloom
Kaggle Founder and CEO
Konrad Banachewicz holds a PhD in statistics from Vrije Universiteit Amsterdam. During his period in academia, he focused on problems of extreme dependency modeling in credit risk. In addition to his research activities, Konrad was a tutor and supervised master’s students. Starting from classical statistics, he slowly moved toward data mining and machine learning (this was before the terms “data science” or “big data” became ubiquitous).
In the decade after his PhD, Konrad worked in a variety of financial institutions on a wide array of quantitative data analysis problems. In the process, he became an expert on the entire lifetime of a data product cycle. He has visited different ends of the frequency spectrum in finance (from high-frequency trading to credit risk, and everything in between), predicted potato prices, and analyzed anomalies in the performance of large-scale industrial equipment.
As a person who himself stood on the shoulders of giants, Konrad believes in sharing knowledge with others. In his spare time, he competes on Kaggle (“the home of data science”).
I would like to thank my brother for being a fixed point in a chaotic world and continuing to provide inspiration and motivation. Dzięki, Braciszku.
Luca Massaron is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is the author of bestselling books on AI, machine learning, and algorithms. Luca is also a Kaggle Grandmaster who reached no. 7 in the worldwide user rankings for his performance in data science competitions, and a Google Developer Expert (GDE) in machine learning.
My warmest thanks go to my family, Yukiko and Amelia, for their support and loving patience as I prepared this new book in a long series.
My deepest thanks to Anthony Goldbloom for kindly writing the foreword for this book and to all the Kaggle Masters and Grandmasters who have so enthusiastically contributed to its making with their interviews, suggestions, and help.
Finally, I would like to thank Tushar Gupta, Parvathy Nair, Lucy Wan, Karan Sonawane, and all of the Packt Publishing editorial and production staff for their support on this writing effort.
Dr. Andrey Kostenko is a data science and machine learning professional with extensive experience across a variety of disciplines and industries, including hands-on coding in R and Python to build, train, and serve time series models for forecasting and other applications. He believes that lifelong learning and open-source software are both critical for innovation in advanced analytics and artificial intelligence.
Andrey recently assumed the role of Lead Data Scientist at Hydroinformatics Institute (H2i.sg), a specialized consultancy and solution services provider for all aspects of water management. Prior to joining H2i, Andrey had worked as Senior Data Scientist at IAG InsurTech Innovation Hub for over 3 years. Before moving to Singapore in 2018, he worked as Data Scientist at TrafficGuard.ai, an Australian AdTech start-up developing novel data-driven algorithms for mobile ad fraud detection. In 2013, Andrey received his doctorate degree in Mathematics and Statistics from Monash University, Australia. By then, he already had an MBA degree from the UK and his first university degree from Russia.
In his spare time, Andrey is often found engaged in competitive data science projects, learning new tools across R and Python ecosystems, exploring the latest trends in web development, solving chess puzzles, or reading about the history of science and mathematics.
Dr. Firat Gonen is the Head of Data Science and Analytics at Getir. Gonen leads the data science and data analysis teams delivering innovative and cutting edge Machine Learning projects. Before Getir, Dr. Gonen was managing Vodafone Turkey’s AI teams. Prior to Vodafone Turkey, he was the Principal Data Scientist at Dogus Group (one of Turkey’s largest conglomerates). Gonen holds extensive educational qualifications including a PhD degree in NeuroScience and Neural Networks from University of Houston and is an expert in Machine Learning, Deep Learning, Visual Attention, Decision-Making & Genetic Algorithms with over more than 12 years in the field. He has authored several peer-review journal papers. He’s also a Kaggle Triple GrandMaster and has more than 10 international data competition medals. He was also selected as the 2020 Z by HP Data Science Global Ambassador.
We were fortunate enough to be able to collect interviews from 31 talented Kagglers across the Kaggle community, who we asked to reflect on their time on the platform. You will find their answers scattered across the book. They represent a broad range of perspectives, with many insightful responses that are as similar as they are different. We read each one of their contributions with great interest and hope the same is true for you, the reader. We give thanks to all of them and list them in alphabetical order below.
Abhishek Thakur, who is currently building AutoNLP at Hugging Face.
Alberto Danese, Head of Data Science at Nexi.
Andrada Olteanu, Data Scientist at Endava, Dev Expert at Weights and Biases, and Z by HP Global Data Science Ambassador.
Andrew Maranhão, Senior Data Scientist at Hospital Albert Einstein in São Paulo.
Andrey Lukyanenko, Machine Learning Engineer and TechLead at MTS Group.
Bojan Tunguz, Machine Learning Modeler at NVIDIA.
Chris Deotte, Senior Data Scientist and Researcher at NVIDIA.
Dan Becker, VP Product, Decision Intelligence at DataRobot.
Dmitry Larko, Chief Data Scientist at H2O.ai.
Firat Gonen, Head of Data Science and Analytics at Getir and Z by HP Global Data Science Ambassador.
Gabriel Preda, Principal Data Scientist at Endava.
Gilberto Titericz, Senior Data Scientist at NVIDIA.
Giuliano Janson, Senior Applied Scientist for ML and NLP at Zillow Group.
Jean-François Puget, Distinguished Engineer, RAPIDS at NVIDIA, and the manager of the NVIDIA Kaggle Grandmaster team.
Jeong-Yoon Lee, Senior Research Scientist in the Rankers and Search Algorithm Engineering team at Netflix Research.
Kazuki Onodera, Senior Deep Learning Data Scientist at NVIDIA and member of the NVIDIA KGMON team.
Laura Fink, Head of Data Science at Micromata.
Martin Henze, PhD Astrophysicist and Data Scientist at Edison Software.
Mikel Bober-Irizar, Machine Learning Scientist at ForecomAI and Computer Science student at the University of Cambridge.
Osamu Akiyama, Medical Doctor at Osaka University.
Parul Pandey, Data Scientist at H2O.ai.
Paweł Jankiewicz, Chief Data Scientist & AI Engineer as well as Co-founder of LogicAI.
Rob Mulla, Senior Data Scientist at Biocore LLC.
Rohan Rao, Senior Data Scientist at H2O.ai.
Ruchi Bhatia, Data Scientist at OpenMined, Z by HP Global Data Science Ambassador, and graduate student at Carnegie Mellon University.
Ryan Chesler, Data Scientist at H2O.ai.
Shotaro Ishihara, Data Scientist and Researcher at a Japanese news media company.
Sudalai Rajkumar, an AI/ML advisor for start-up companies.
Xavier Conort, Founder and CEO at Data Mapping and Engineering.
Yifan Xie, Co-founder of Arion Ltd, a data science consultancy firm.
Yirun Zhang, final-year PhD student at King’s College London in applied machine learning.
Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:
https://packt.link/KaggleDiscord
Preface
Part I: Introduction to Competitions
Introducing Kaggle and Other Data Science Competitions
The rise of data science competition platforms
The Kaggle competition platform
A history of Kaggle
Other competition platforms
Introducing Kaggle
Stages of a competition
Types of competitions and examples
Submission and leaderboard dynamics
Explaining the Common Task Framework paradigm
Understanding what can go wrong in a competition
Computational resources
Kaggle Notebooks
Teaming and networking
Performance tiers and rankings
Criticism and opportunities
Summary
Organizing Data with Datasets
Setting up a dataset
Gathering the data
Working with datasets
Using Kaggle Datasets in Google Colab
Legal caveats
Summary
Working and Learning with Kaggle Notebooks
Setting up a Notebook
Running your Notebook
Saving Notebooks to GitHub
Getting the most out of Notebooks
Upgrading to Google Cloud Platform (GCP)
One step beyond
Kaggle Learn courses
Summary
Leveraging Discussion Forums
How forums work
Example discussion approaches
Netiquette
Summary
Part II: Sharpening Your Skills for Competitions
Competition Tasks and Metrics
Evaluation metrics and objective functions
Basic types of tasks
Regression
Classification
Ordinal
The Meta Kaggle dataset
Handling never-before-seen metrics
Metrics for regression (standard and ordinal)
Mean squared error (MSE) and R squared
Root mean squared error (RMSE)
Root mean squared log error (RMSLE)
Mean absolute error (MAE)
Metrics for classification (label prediction and probability)
Accuracy
Precision and recall
The F1 score
Log loss and ROC-AUC
Matthews correlation coefficient (MCC)
Metrics for multi-class classification
Metrics for object detection problems
Intersection over union (IoU)
Dice
Metrics for multi-label classification and recommendation problems
MAP@{K}
Optimizing evaluation metrics
Custom metrics and custom objective functions
Post-processing your predictions
Predicted probability and its adjustment
Summary
Designing Good Validation
Snooping on the leaderboard
The importance of validation in competitions
Bias and variance
Trying different splitting strategies
The basic train-test split
Probabilistic evaluation methods
k-fold cross-validation
Subsampling
The bootstrap
Tuning your model validation system
Using adversarial validation
Example implementation
Handling different distributions of training and test data
Handling leakage
Summary
Modeling for Tabular Competitions
The Tabular Playground Series
Setting a random state for reproducibility
The importance of EDA
Dimensionality reduction with t-SNE and UMAP
Reducing the size of your data
Applying feature engineering
Easily derived features
Meta-features based on rows and columns
Target encoding
Using feature importance to evaluate your work
Pseudo-labeling
Denoising with autoencoders
Neural networks for tabular competitions
Summary
Hyperparameter Optimization
Basic optimization techniques
Grid search
Random search
Halving search
Key parameters and how to use them
Linear models
Support-vector machines
Random forests and extremely randomized trees
Gradient tree boosting
LightGBM
XGBoost
CatBoost
HistGradientBoosting
Bayesian optimization
Using Scikit-optimize
Customizing a Bayesian optimization search
Extending Bayesian optimization to neural architecture search
Creating lighter and faster models with KerasTuner
The TPE approach in Optuna
Summary
Ensembling with Blending and Stacking Solutions
A brief introduction to ensemble algorithms
Averaging models into an ensemble
Majority voting
Averaging of model predictions
Weighted averages
Averaging in your cross-validation strategy
Correcting averaging for ROC-AUC evaluations
Blending models using a meta-model
Best practices for blending
Stacking models together
Stacking variations
Creating complex stacking and blending solutions
Summary
Modeling for Computer Vision
Augmentation strategies
Keras built-in augmentations
ImageDataGenerator approach
Preprocessing layers
albumentations
Classification
Object detection
Semantic segmentation
Summary
Modeling for NLP
Sentiment analysis
Open domain Q&A
Text augmentation strategies
Basic techniques
nlpaug
Summary
Simulation and Optimization Competitions
Connect X
Rock-paper-scissors
Santa competition 2020
The name of the game
Summary
Part III: Leveraging Competitions for Your Career
Creating Your Portfolio of Projects and Ideas
Building your portfolio with Kaggle
Leveraging Notebooks and discussions
Leveraging Datasets
Arranging your online presence beyond Kaggle
Blogs and publications
GitHub
Monitoring competition updates and newsletters
Summary
Finding New Professional Opportunities
Building connections with other competition data scientists
Participating in Kaggle Days and other Kaggle meetups
Getting spotted and other job opportunities
The STAR approach
Summary (and some parting words)
Other Books You May Enjoy
Index
Cover
Index
Having competed on Kaggle for over ten years, both of us have experienced highs and lows over many competitions. We often found ourselves refocusing our efforts on different activities relating to Kaggle. Over time, we devoted ourselves not just to competitions but also to creating content and code based on the demands of the data science market and our own professional aspirations. At this point in our journey, we felt that our combined experience and still-burning passion for competitions could really help other participants who have just started, or who would like to get inspired, to get hold of the essential expertise they need, so they can start their own journey in data science competitions.
We then decided to work on this book with a purpose:
To offer, in a single place, the best tips for being competitive and approaching most of the problems you may find when participating on Kaggle and also other data science competitions.To offer enough suggestions to allow anyone to reach at least the Expert level in any Kaggle discipline: Competitions, Datasets, Notebooks, or Discussions.To provide tips on how to learn the most from Kaggle and leverage this experience for professional growth in data science.To gather in a single source the largest number of perspectives on the experience of participating in competitions, by interviewing Kaggle Masters and Grandmasters and listening to their stories.In short, we have written a book that demonstrates how to participate in competitions successfully and make the most of all the opportunities that Kaggle offers. The book is also intended as a practical reference that saves you time and effort, through its selection of many competition tips and tricks that are hard to learn about and find on the internet or on Kaggle forums. Nevertheless, the book doesn’t limit itself to providing practical help; it also aspires to help you figure out how to boost your career in data science by participating in competitions.
Please be aware: this book doesn’t teach you data science from the basics. We don’t explain in detail how linear regression or random forests or gradient boosting work, but how to use them in the best way and obtain the best results from them in a data problem. We expect solid foundations and at least a basic proficiency in data science topics and Python usage from our readers. If you are still a data science beginner, you need to supplement this book with other books on data science, machine learning, and deep learning, and train up on online courses, such as those offered by Kaggle itself or by MOOCs such as edX or Coursera.
If you want to start learning data science in a practical way, if you want to challenge yourself with tricky and intriguing data problems and simultaneously build a network of great fellow data scientists as passionate about their work in data as you are, this is indeed the book for you. Let’s get started!
At the time of completion of this book, there are 96,190 Kaggle novices (users who have just registered on the website) and 67,666 Kaggle contributors (users who have just filled in their profile) enlisted in Kaggle competitions. This book has been written for all of them and for anyone else wanting to break the ice and start taking part in competitions on Kaggle and learning from them.
Chapter 1, Introducing Kaggle and Other Data Science Competitions, discusses how competitive programming evolved into data science competitions. It explains why the Kaggle platform is the most popular site for these competitions and provides you with an idea about how it works.
Chapter 2, Organizing Data with Datasets, introduces you to Kaggle Datasets, the standard method of data storage on the platform. We discuss setup, gathering data, and utilizing it in your work on Kaggle.
Chapter 3, Working and Learning with Kaggle Notebooks, discusses Kaggle Notebooks, the baseline coding environment. We talk about the basics of Notebook usage, as well as how to leverage the GCP environment, and using them to build up your data science portfolio.
Chapter 4, Leveraging Discussion Forums, allows you to familiarize yourself with discussion forums, the primary manner of communication and idea exchange on Kaggle.
Chapter 5, Competition Tasks and Metrics, details how evaluation metrics for certain kinds of problems strongly influence the way you can operate when building your model solution in a data science competition. The chapter also addresses the large variety of metrics available in Kaggle competitions.
Chapter 6, Designing Good Validation, will introduce you to the importance of validation in data competitions, discussing overfitting, shake-ups, leakage, adversarial validation, different kinds of validation strategies, and strategies for your final submissions.
Chapter 7, Modeling for Tabular Competitions, discusses tabular competitions, mostly focusing on the more recent reality of Kaggle, the Tabular Playground Series. Tabular problems are standard practice for the majority of data scientists around and there is a lot to learn from Kaggle.
Chapter 8, Hyperparameter Optimization, explores how to extend the cross-validation approach to find the best hyperparameters for your models – in other words, those that can generalize in the best way on the private leaderboard – under the pressure and scarcity of time and resources that you experience in Kaggle competitions.
Chapter 9, Ensembling with Blending and Stacking Solutions, explains ensembling techniques for multiple models such as averaging, blending, and stacking. We will provide you with some theory, some practice, and some code examples you can use as templates when building your own solutions on Kaggle.
Chapter 10, Modeling for Computer Vision, we discuss problems related to computer vision, one of the most popular topics in AI in general, and on Kaggle specifically. We demonstrate full pipelines for building solutions to challenges in image classification, object detection, and image segmentation.
Chapter 11, Modeling for NLP, focuses on the frequently encountered types of Kaggle challenges related to natural language processing. We demonstrate how to build an end-to-end solution for popular problems like open domain question answering.
Chapter 12, Simulation and Optimization Competitions, provides an overview of simulation competitions, a new class of contests gaining popularity on Kaggle over the last few years.
Chapter 13, Creating Your Portfolio of Projects and Ideas, explores ways you can stand out by showcasing your work on Kaggle itself and other sites in an appropriate way.
Chapter 14, Finding New Professional Opportunities, concludes the overview of how Kaggle can positively affect your career by discussing the best ways to leverage all your Kaggle experience in order to find new professional opportunities.
The Python code in this book has been designed to be run on a Kaggle Notebook, without any installation on a local computer. Therefore, don’t worry about what machine you have available or what version of Python packages you should install.
All you need is a computer with access to the internet and a free Kaggle account. In fact, to run the code on a Kaggle Notebook (you will find instructions about the procedure in Chapter 3), you first need to open an account on Kaggle. If you don’t have one yet, just go to www.kaggle.com and follow the instructions on the website.
We link out to many different resources throughout the book that we think you will find useful. When referred to a link, explore it: you will find code available on public Kaggle Notebooks that you can reuse, or further materials to illustrate concepts and ideas that we have discussed in the book.
The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/The-Kaggle-Book. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801817479_ColorImages.pdf.
There are a few text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example; “ The dataset will be downloaded to the Kaggle folder as a .zip archive – unpack it and you are good to go.”
A block of code is set as follows:
from google.colab import drive drive.mount('/content/gdrive')Any command-line input or output is written as follows:
I genuinely have no idea what the output of this sequence of words will be - it will be interesting to find out what nlpaug can do with this!Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes. For example: “ The specific limits at the time of writing are 100 GB per private dataset and a 100 GB total quota.”
Further notes, references, and links to useful places appear like this.
Tips and tricks appear like this.
Feedback from our readers is always welcome.
General feedback: Email [email protected], and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.
Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.
Once you’ve read The Kaggle Book, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
Thanks for purchasing this book!
Do you like to read on the go but are unable to carry your print books everywhere? Is your eBook purchase not compatible with the device of your choice?
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
Follow these simple steps to get the benefits:
Scan the QR code or visit the link belowhttps://packt.link/free-ebook/9781801817479
Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directlyData science competitions have long been around and they have experienced growing success over time, starting from a niche community of passionate competitors, drawing more and more attention, and reaching a much larger audience of millions of data scientists. As longtime competitors on the most popular data science competition platform, Kaggle, we have witnessed and directly experienced all these changes through the years.
At the moment, if you look for information about Kaggle and other competition platforms, you can easily find a large number of meetups, discussion panels, podcasts, interviews, and even online courses explaining how to win in such competitions (usually telling you to use a variable mixture of grit, computational resources, and time invested). However, apart from the book that you are reading now, you won’t find any structured guides about how to navigate so many data science competitions and how to get the most out of them – not just in terms of score or ranking, but also professional experience.
In this book, instead of just packaging up a few hints about how to win or score highly on Kaggle and other data science competitions, our intention is to present you with a guide on how to compete better on Kaggle and get back the maximum possible from your competition experiences, particularly from the perspective of your professional life. Also accompanying the contents of the book are interviews with Kaggle Masters and Grandmasters. We hope they will offer you some different perspectives and insights on specific aspects of competing on Kaggle, and inspire the way you will test yourself and learn doing competitive data science.
By the end of this book, you’ll have absorbed the knowledge we drew directly from our own experiences, resources, and learnings from competitions, and everything you need to pave a way for yourself to learn and grow, competition after competition.
As a starting point, in this chapter, we will explore how competitive programming evolved into data science competitions, why the Kaggle platform is the most popular site for such competitions, and how it works.
We will cover the following topics:
The rise of data science competition platformsThe Common Task Framework paradigm The Kaggle platform and some other alternativesHow a Kaggle competition works: stages, competition types, submission and leaderboard dynamics, computational resources, networking, and moreCompetitive programming has a long history, starting in the 1970s with the first iterations of the ICPC, the International Collegiate Programming Contest. In the original ICPC, small teams from universities and companies participated in a competition that required solving a series of problems using a computer program (at the beginning, participants coded in FORTRAN). In order to achieve a good final rank, teams had to display good skills in team working, problem solving, and programming.
The experience of participating in the heat of such a competition and the opportunity to stand in a spotlight for recruiting companies provided the students with ample motivation and it made the competition popular for many years. Among ICPC finalists, a few have become renowned: there is Adam D’Angelo, the former CTO of Facebook and founder of Quora, Nikolai Durov, the co-founder of Telegram Messenger, and Matei Zaharia, the creator of Apache Spark. Together with many other professionals, they all share the same experience: having taken part in an ICPC.
After the ICPC, programming competitions flourished, especially after 2000 when remote participation became more feasible, allowing international competitions to run more easily and at a lower cost. The format is similar for most of these competitions: there is a series of problems and you have to code a solution to solve them. The winners are given a prize, but also make themselves known to recruiting companies or simply become famous.
Typically, problems in competitive programming range from combinatorics and number theory to graph theory, algorithmic game theory, computational geometry, string analysis, and data structures. Recently, problems relating to artificial intelligence have successfully emerged, in particular after the launch of the KDD Cup, a contest in knowledge discovery and data mining, held by the Association for Computing Machinery’s (ACM’s) Special Interest Group (SIG) during its annual conference (https://kdd.org/conferences).
The first KDD Cup, held in 1997, involved a problem about direct marketing for lift curve optimization and it started a long series of competitions that continues today. You can find the archives containing datasets, instructions, and winners at https://www.kdd.org/kdd-cup. Here is the latest available at the time of writing: https://ogb.stanford.edu/kddcup2021/. KDD Cups proved quite effective in establishing best practices, with many published papers describing solutions, techniques, and competition dataset sharing, which have been useful for many practitioners for experimentation, education, and benchmarking.
The successful examples of both competitive programming events and the KDD Cup inspired companies (such as Netflix) and entrepreneurs (such as Anthony Goldbloom, the founder of Kaggle) to create the first data science competition platforms, where companies can host data science challenges that are hard to solve and might benefit from crowdsourcing. In fact, given that there is no golden approach that works for all the problems in data science, many problems require a time-consuming approach that can be summed up as try all that you can try.
In fact, in the long run, no algorithm can beat all the others on all problems, as stated by the No Free Lunch theorem by David Wolpert and William Macready. The theorem tells you that each machine learning algorithm performs if and only if its hypothesis space comprises the solution. Consequently, as you cannot know beforehand if a machine learning algorithm can best tackle your problem, you have to try it, testing it directly on your problem before being assured that you are doing the right thing. There are no theoretical shortcuts or other holy grails of machine learning – only empirical experimentation can tell you what works.
For more details, you can look up the No Free Lunch theorem for a theoretical explanation of this practical truth. Here is a complete article from Analytics India Magazine on the topic: https://analyticsindiamag.com/what-are-the-no-free-lunch-theorems-in-data-science/.
Crowdsourcing proves ideal in such conditions where you need to test algorithms and data transformations extensively to find the best possible combinations, but you lack the manpower and computer power for it. That’s why, for instance, governments and companies resort to competitions in order to advance in certain fields:
On the government side, we can quote DARPA and its many competitions surrounding self-driving cars, robotic operations, machine translation, speaker identification, fingerprint recognition, information retrieval, OCR, automatic target recognition, and many others.On the business side, we can quote a company such as Netflix, which entrusted the outcome of a competition to improve its algorithm for predicting user movie selection.The Netflix competition was based on the idea of improving existing collaborative filtering. The purpose of this was simply to predict the potential rating a user would give a film, solely based on the ratings that they gave other films, without knowing specifically who the user was or what the films were. Since no user description or movie title or description were available (all being replaced with identity codes), the competition required entrants to develop smart ways to use the past ratings available. The grand prize of US $1,000,000 was to be awarded only if the solution could improve the existing Netflix algorithm, Cinematch, above a certain threshold.
The competition ran from 2006 to 2009 and saw victory for a team made up of the fusion of many previous competition teams: a team from Commendo Research & Consulting GmbH, Andreas Töscher and Michael Jahrer, quite renowned also in Kaggle competitions; two researchers from AT&T Labs; and two others from Yahoo!. In the end, winning the competition required so much computational power and the ensembling of different solutions that teams were forced to merge in order to keep pace. This situation was also reflected in the actual usage of the solution by Netflix, who preferred not to implement it, but simply took the most interesting insight from it in order to improve its existing Cinematch algorithm. You can read more about it in this Wired article: https://www.wired.com/2012/04/netflix-prize-costs/.
At the end of the Netflix competition, what mattered was not the solution per se, which was quickly superseded by the change in business focus of Netflix from DVDs to online movies. The real benefit for both the participants, who gained a huge reputation in collaborative filtering, and the company, who could transfer its improved recommendation knowledge to its new business, were the insights that were gained from the competition.
Companies other than Netflix have also benefitted from data science competitions. The list is long, but we can quote a few examples where the company running the competition reported a clear benefit from it. For instance:
The insurance company Allstate was able to improve its actuarial models built by their own experts, thanks to a competition involving hundreds of data scientists (https://www.kaggle.com/c/ClaimPredictionChallenge)As another well-documented example, General Electric was able to improve by 40% on the industry-standard performance (measured by the root mean squared error metric) for predicting arrival times of airline flights, thanks to a similar competition (https://www.kaggle.com/c/flight)The Kaggle competition platform has to this day held hundreds of competitions, and these two are just a couple of examples of companies that used them successfully. Let’s take a step back from specific competitions for a moment and talk about the Kaggle company, which is the common thread through this book.
Kaggle took its first steps in February 2010, thanks to Anthony Goldbloom, an Australian trained economist with a degree in Economics and Econometrics. After working at Australia’s Department of the Treasury and the Research department at the Reserve Bank of Australia, Goldbloom interned in London at The Economist, the international weekly newspaper on current affairs, international business, politics, and technology. At The Economist, he had occasion to write an article about big data, which inspired his idea to build a competition platform that could crowdsource the best analytical experts to solve interesting machine learning problems (https://www.smh.com.au/technology/from-bondi-to-the-big-bucks-the-28yearold-whos-making-data-science-a-sport-20111104-1myq1.html). Since the crowdsourcing dynamics played a relevant part in the business idea for this platform, he derived the name Kaggle, which recalls by rhyme the term gaggle, a flock of geese, the goose also being the symbol of the platform.
After moving to Silicon Valley in the USA, his Kaggle start-up received $11.25 million in Series A funding from a round led by Khosla Ventures and Index Ventures, two renowned venture capital firms. The first competitions were rolled out, the community grew, and some of the initial competitors came to be quite prominent, such as Jeremy Howard, the Australian data scientist and entrepreneur, who, after winning a couple of competitions on Kaggle, became the President and Chief Scientist of the company.
Jeremy Howard left his position as President in December 2013 and established a new start-up, fast.ai (www.fast.ai), offering machine learning courses and a deep learning library for coders.
At the time, there were some other prominent Kagglers (the name indicating frequent participants of competitions held by Kaggle) such as Jeremy Achin and Thomas de Godoy. After reaching the top 20 global rankings on the platform, they promptly decided to retire and to found their own company, DataRobot. Soon after, they started hiring their employees from among the best participants in the Kaggle competitions in order to instill the best machine learning knowledge and practices into the software they were developing. Today, DataRobot is one of the leading companies in developing AutoML solutions (software for automatic machine learning).
The Kaggle competitions claimed more and more attention from a growing audience. Even Geoffrey Hinton, the “godfather” of deep learning, participated in (and won) a Kaggle competition hosted by Merck in 2012 (https://www.kaggle.com/c/MerckActivity/overview/winners). Kaggle was also the platform where François Chollet launched his deep learning package Keras during the Otto Group Product Classification Challenge (https://www.kaggle.com/c/otto-group-product-classification-challenge/discussion/13632) and Tianqi Chen launched XGBoost, a speedier and more accurate version of gradient boosting machines, in the Higgs Boson Machine Learning Challenge (https://www.kaggle.com/c/higgs-boson/discussion/10335).
Besides Keras, François Chollet has also provided the most useful and insightful perspective on how to win a Kaggle competition in an answer of his on the Quora website: https://www.quora.com/Why-has-Keras-been-so-successful-lately-at-Kaggle-competitions.
Fast iterations of multiple attempts, guided by empirical (more than theoretical) evidence, are actually all that you need. We don’t think that there are many more secrets to winning a Kaggle competition than the ones he pointed out in his answer.
Notably, François Chollet also hosted his own competition on Kaggle (https://www.kaggle.com/c/abstraction-and-reasoning-challenge/), which is widely recognized as being the first general AI competition in the world.
Competition after competition, the community revolving around Kaggle grew to touch one million in 2017, the same year as, during her keynote at Google Next, Fei-Fei Li, Chief Scientist at Google, announced that Google Alphabet was going to acquire Kaggle. Since then, Kaggle has been part of Google.
Today, the Kaggle community is still active and growing. In a tweet of his (https://twitter.com/antgoldbloom/status/1400119591246852096), Anthony Goldbloom reported that most of its users, other than participating in a competition, have downloaded public data (Kaggle has become an important data hub), created a public Notebook in Python or R, or learned something new in one of the courses offered:
Figure 1.1: A bar chart showing how users used Kaggle in 2020, 2019, and 2018
Through the years, Kaggle has offered many of its participants even more opportunities, such as:
Creating their own companyLaunching machine learning software and packagesGetting interviews in magazines (https://www.wired.com/story/solve-these-tough-data-problems-and-watch-job-offers-roll-in/)Writing machine learning books (https://twitter.com/antgoldbloom/status /745662719588589568)Finding their dream jobAnd, most importantly, learning more about the skills and technicalities involved in data science.
Though this book focuses on competitions on Kaggle, we cannot forget that many data competitions are held on private platforms or on other competition platforms. In truth, most of the information you will find in this book will also hold for other competitions, since they essentially all operate under similar principles and the benefits for the participants are more or less the same.
Although many other platforms are localized in specific countries or are specialized only for certain kinds of competitions, for completeness we will briefly introduce some of them, at least those we have some experience and knowledge of:
DrivenData (https://www.drivendata.org/competitions/) is a crowdsourcing competition platform devoted to social challenges (see https://www.drivendata.co/blog/intro-to-machine-learning-social-impact/). The company itself is a social enterprise whose aim is to bring data science solutions to organizations tackling the world’s biggest challenges, thanks to data scientists building algorithms for social good. For instance, as you can read in this article, https://www.engadget.com/facebook-ai-hate-speech-covid-19-160037191.html, Facebook has chosen DrivenData for its competition on building models against hate speech and misinformation.Numerai (https://numer.ai/) is an AI-powered, crowdsourced hedge fund based in San Francisco. It hosts a weekly tournament in which you can submit your predictions on hedge fund obfuscated data and earn your prizes in the company’s cryptocurrency, Numeraire.CrowdANALYTIX (https://www.crowdanalytix.com/community) is a bit less active now, but this platform used to host quite a few challenging competitions a short while ago, as you can read from this blog post: https://towardsdatascience.com/how-i-won-top-five-in-a-deep-learning-competition-753c788cade1. The community blog is quite interesting for getting an idea of what challenges you can find on this platform: https://www.crowdanalytix.com/jq/communityBlog/listBlog.html.Signate (https://signate.jp/competitions) is a Japanese data science competition platform. It is quite rich in contests and it offers a ranking system similar to Kaggle’s (https://signate.jp/users/rankings).Zindi (https://zindi.africa/competitions) is a data science competition platform from Africa. It hosts competitions focused on solving Africa’s most pressing social, economic, and environmental problems.Alibaba Cloud (https://www.alibabacloud.com/campaign/tianchi-competitions) is a Chinese cloud computer and AI provider that has launched the Tianchi Academic competitions, partnering with academic conferences such as SIGKDD, IJCAI-PRICAI, and CVPR and featuring challenges such as image-based 3D shape retrieval, 3D object reconstruction, and instance segmentation.Analytics Vidhya (https://datahack.analyticsvidhya.com/) is the largest Indian community for data science, offering a platform for data science hackathons.CodaLab (https://codalab.lri.fr/) is a French-based data science competition platform, created as a joint venture between Microsoft and Stanford University in 2013. They feature a free cloud-based notebook called Worksheets (https://worksheets.codalab.org/) for knowledge sharing and reproducible modeling.Other minor platforms are CrowdAI (https://www.crowdai.org/) from École Polytechnique Fédérale de Lausanne in Switzerland, InnoCentive (https://www.innocentive.com/), Grand-Challenge (https://grand-challenge.org/) for biomedical imaging, DataFountain (https://www.datafountain.cn/business?lang=en-US), OpenML (https://www.openml.org/), and the list could go on. You can always find a large list of ongoing major competitions at the Russian community Open Data Science (https://ods.ai/competitions) and even discover new competition platforms from time to time.
You can see an overview of running competitions on the mlcontests.com website, along with the current costs for renting GPUs. The website is often updated and it is an easy way to get a glance at what’s going on with data science competitions across different platforms.
Kaggle is always the best platform where you can find the most interesting competitions and obtain the widest recognition for your competition efforts. However, picking up a challenge outside of it makes sense, and we recommend it as a strategy, when you find a competition matching your personal and professional interests. As you can see, there are quite a lot of alternatives and opportunities besides Kaggle, which means that if you consider more competition platforms alongside Kaggle, you can more easily find a competition that might interest you because of its specialization or data.
In addition, you can expect less competitive pressure during these challenges (and consequently a better ranking or even winning something), since they are less known and advertised. Just expect less sharing among participants, since no other competition platform has reached the same richness of sharing and networking opportunities as Kaggle.
At this point, we need to delve more deeply into how Kaggle in particular works. In the following paragraphs, we will discuss the various aspects of the Kaggle platform and its competitions, and you’ll get a flavor of what it means to be in a competition on Kaggle. Afterward, we’ll come back to discuss many of these topics in much more detail, with more suggestions and strategies in the remaining chapters of the book.
A competition on Kaggle is arranged into different steps. By having a look at each of them, you can get a better understanding of how a data science competition works and what to expect from it.
When a competition is launched, there are usually some posts on social media, for instance on the Kaggle Twitter profile, https://twitter.com/kaggle, that announce it, and a new tab will appear in the Kaggle section about Active Competitions on theCompetitions page (https://www.kaggle.com/competitions). If you click on a particular competition’s tab, you’ll be taken to its page. At a glance, you can check if the competition will have prizes (and if it awards points and medals, a secondary consequence of participating in a competition), how many teams are currently involved, and how much time is still left for you to work on a solution:
Figure 1.2: A competition’s page on Kaggle
There, you can explore the Overview menu first, which provides information about:
The topic of the competitionIts evaluation metric (that your models will be evaluated against)The timeline of the competitionThe prizesThe legal or competition requirementsUsually the timeline is a bit overlooked, but it should be one of the first things you check; it doesn’t tell you simply when the competition starts and ends, but it will provide you with the rule acceptance deadline, which is usually from seven days to two weeks before the competition closes. The rule acceptance deadline marks the last day you can join the competition (by accepting its rules). There is also the team merger deadline: you can arrange to combine your team with another competitor’s one at any point before that deadline, but after that it won’t be possible.
The Rules menu is also quite often overlooked (with people just jumping to Data), but it is important to check it because it can tell you about the requirements of the competition. Among the key information you can get from the rules, there is:
Your eligibility for a prizeWhether you can use external data to improve your scoreHow many submissions (tests of your solution) a day you getHow many final solutions you can chooseOnce you have accepted the rules, you can download any data from the Data menu or directly start working on Kaggle Notebooks (online, cloud-based notebooks) from the Code menu, reusing code that others have made available or creating your own code from scratch.
If you decide to download the data, also consider that you have a Kaggle API that can help you to run downloads and submissions in an almost automated way. It is an important tool if you are running your models on your local computer or on your cloud instance. You can find more details about the API at https://www.kaggle.com/docs/api and you can get the code from GitHub at https://github.com/Kaggle/kaggle-api.
If you check the Kaggle GitHub repo closely, you can also find all the Docker images they use for their online notebooks, Kaggle Notebooks:
Figure 1.3: A Kaggle Notebook ready to be coded
At this point, as you develop your solution, it is our warm suggestion not to continue in solitude, but to contact other competitors through the Discussion forum, where you can ask and answer questions specific to the competition. Often you will also find useful hints about specific problems with the data or even ideas to help improve your own solution. Many successful Kagglers have reported finding ideas on the forums that have helped them perform better and, more importantly, learn more about modeling in data science.
Once your solution is ready, you can submit it to the Kaggle evaluation engine, in adherence to the specifications of the competition. Some competitions will accept a CSV file as a solution, others will require you to code and produce results in a Kaggle Notebook. You can keep submitting solutions throughout the competition.
Every time you submit a solution, soon after, the leaderboard will provide you with a score and a position among the competitors (the wait time varies depending on the computations necessary for the score evaluation). That position is only roughly indicative, because it reflects the performance of your model on a part of the test set, called the public test set, since your performance on it is made public during the competition for everyone to know.
Before the competition closes, each competitor can choose a number (usually two) of their solutions for the final evaluation.
Figure 1.4: A diagram demonstrating how data turns into scores for the public and private leaderboard
Only when the competition closes, based on the models the contestants have decided to be scored, is their score on another part of the test set, called the private test set, revealed. This new leaderboard, the private leaderboard, constitutes the final, effective scores for the competition, but it is still not official and definitive in its rankings. In fact, the Kaggle team will take some time to check that everything is correct and that all contestants have respected the rules of the competition.
After a while (and sometimes after some changes in the rankings due to disqualifications), the private leaderboard will become official and definitive, the winners will be declared, and many participants will unveil their strategies, their solutions, and their code on the competition discussion forum. At this point, it is up to you to check the other solutions and try to improve your own. We strongly recommend that you do so, since this is another important source of learning in Kaggle.
Kaggle competitions are categorized based on competition categories, and each category has a different implication in terms of how to compete and what to expect. The type of data, difficulty of the problem, awarded prizes, and competition dynamics are quite diverse inside the categories, therefore it is important to understand beforehand what each implies.
Here are the official categories that you can use to filter out the different competitions:
FeaturedMastersAnnualsResearchRecruitmentGetting StartedPlaygroundAnalyticsCommunityFeatured are the most common type of competitions, involving a business-related problem from a sponsor company and a prize for the top performers. The winners will grant a non-exclusive license of their work to the sponsor company; they will have to prepare a detailed report of their solution and sometimes even participate in meetings with the sponsor company.
There are examples of Featured competitions every time you visit Kaggle. At the moment, many of them are problems relating to the application of deep learning methods to unstructured data like text, images, videos, or sound. In the past, tabular data competitions were commonly seen, that is, competitions based on problems relating to structured data that can be found in a database. First by using random forests, then gradient boosting methods with clever feature engineering, tabular data solutions derived from Kaggle could really improve an existing solution. Nowadays, these competitions are run much less often, because a crowdsourced solution won’t often be much better than what a good team of data scientists or even AutoML software can do. Given the spread of better software and good practices, the increase in result quality obtainable from competitions is indeed marginal. In the unstructured data world, however, a good deep learning solution could still make a big difference. For instance, pre-trained networks such as BERT brought about double-digit increases in previous standards for many well-known NLP task benchmarks.
Masters are less common now, but they are private, invite-only competitions. The purpose was to create competitions only for experts (generally competitors ranked as Masters or Grandmasters, based on Kaggle medal rankings), based on their rankings on Kaggle.
Annuals are competitions that always appear during a certain period of the year. Among the Annuals, we have the Santa Claus competitions (usually based on an algorithmic optimization problem) and the March Machine Learning Mania competition, run every year since 2014 during the US College Basketball Tournaments.
Research competitions imply a research or science purpose instead of a business one, sometimes for serving the public good. That’s why these competitions do not always offer prizes. In addition, these competitions sometimes require the winning participants to release their solution as open-source.
Google has released a few Research competitions in the past, such as Google Landmark Recognition 2020 (https://www.kaggle.com/c/landmark-recognition-2020), where the goal was to label famous (and not-so-famous) landmarks in images.
Sponsors that want to test the ability of potential job candidates hold Recruitment competitions. These competitions are limited to teams of one and offer to best-placed competitors an interview with the sponsor as a prize. The competitors have to upload their CV at the end of the competition if they want to be considered for being contacted.
Examples of Recruitment competitions have been:
The Facebook Recruiting Competition (https://www.kaggle.com/c/FacebookRecruiting); Facebook have held a few of this kindThe Yelp Recruiting Competition (https://www.kaggle.com/c/yelp-recruiting)Getting Started competitions do not offer any prizes, but friendly and easy problems for beginners to get accustomed to Kaggle principles and dynamics. They are usually semi-permanent competitions whose leaderboards are refreshed from time to time. If you are looking for a tutorial in machine learning, these competitions are the right places to start, because you can find a highly collaborative environment and there are many Kaggle Notebooks available showing you how to process the data and create different types of machine learning models.
Famous ongoing Getting Started competitions are:
Digit Recognizer (https://www.kaggle.com/c/digit-recognizer)Titanic — Machine Learning from Disaster (https://www.kaggle.com/c/titanic)House Prices — Advanced Regression Techniques (https://www.kaggle.com/c/house-prices-advanced-regression-techniques)Playground competitions are a little bit more difficult than the Getting Started ones, but they are also meant for competitors to learn and test their abilities without the pressure of a fully-fledged Featured competition (though in Playground competitions sometimes the heat of the competition may also turn quite high). The usual prizes for such competitions are just swag (an acronym for “Stuff We All Get,” such as, for instance, a cup, a t-shirt, or socks branded by Kaggle; see https://www.kaggle.com/general/68961) or a bit of money.
One famous Playground competition is the original Dogs vs. Cats competition (https://www.kaggle.com/c/dogs-vs-cats), where the task is to create an algorithm to distinguish dogs from cats.
Mentions should be given to Analytics competitions, where the evaluation is qualitative and participants are required to provide ideas, drafts of solutions, PowerPoint slides, charts, and so on; and Community (previously known as InClass) competitions, which are held by academic institutions as well as Kagglers. You can read about the launch of the Community competitions at https://www.kaggle.com/product-feedback/294337 and you can get tips about running one of your own at https://www.kaggle.com/c/about/host and at https://www.kaggle.com/community-competitions-setup-guide.
Parul Pandey
https://www.kaggle.com/parulpandey
We spoke to Parul Pandey, Kaggle Notebooks Grandmaster, Datasets Master, and data scientist at H2O.ai, about her experience with Analytics competitions and more.
What’s your favorite kind of competition and why? In terms of techniques and solving approaches, what is your specialty on Kaggle?
I really enjoy the Data Analytics competitions, which require you to analyze the data and provide a comprehensive analysis report at the end. These include the Data Science for Good competitions (DS4G), sports analytics competitions (NFL etc.), and the general survey challenges. Unlike the traditional competitions, these competitions don’t have a leaderboard to track your performance compared to others; nor do you get any medals or points.
On the other hand, these competitions demand end-to-end solutions touching on multi-faceted aspects of data science like data cleaning, data mining, visualizations, and conveying insights. Such problems provide a way to mimic real-life scenarios and provide your insights and viewpoints. There may not be a single best answer to solve the problem, but it gives you a chance to deliberate and weigh up potential solutions, and imbibe them into your solution.
How do you approach a Kaggle competition? How different is this approach to what you do in your day-to-day work?
My first step is always to analyze the data as part of EDA (exploratory data analysis). It is something that I also follow as part of my work routine. Typically, I explore the data to look for potential red flags like inconsistencies in data, missing values, outliers, etc., which might pose problems later. The next step is to create a good and reliable cross-validation strategy. Then I read the discussion forums and look at some of the Notebooks shared by people. It generally acts as a good starting point, and then I can incorporate things in this workflow from my past experiences. It is also essential to track the model performance.
For an Analytics competition, however, I like to break down the problem into multiple steps. For instance, the first part could be related to understanding the problem, which may require a few days. After that, I like to explore the data, followed by creating a basic baseline solution. Then I continue enhancing this solution by adding a piece at a time. It might be akin to adding Lego bricks one part at a time to create that final masterpiece.
Tell us about a particularly challenging competition you entered, and what insights you used to tackle the task.
As I mentioned, I mostly like to compete in Analytics competitions, even though occasionally I also try my hand in the regular ones too. I’d like to point out a very intriguing Data Science for Good competition titled Environmental Insights Explorer (https://www.kaggle.com/c/ds4g-environmental-insights-explorer). The task was to use remote sensing techniques to understand environmental emissions instead of calculating emissions factors from current methodologies.
What really struck me was the use case. Our planet is grappling with climate change issues, and this competition touched on this very aspect. While researching for my competition, I was amazed to find the amount of progress being made in this field of satellite imagery and it gave me a chance to understand and dive more deeply into the topic. It gave me a chance to understand how satellites like Landsat, Modis, and Sentinel worked, and how they make the satellite data available. This was a great competition to learn about a field I knew very little about before the competition.
In your experience, what do inexperienced Kagglers often overlook? What do you know now that you wish you’d known when you first started?
I will cite some of the mistakes that I made in my initial years on Kaggle.
