Developing Kaggle Notebooks - Gabriel Preda - E-Book

Developing Kaggle Notebooks E-Book

Gabriel Preda

0,0
29,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Developing Kaggle Notebooks introduces you to data analysis, with a focus on using Kaggle Notebooks to simultaneously achieve mastery in this fi eld and rise to the top of the Kaggle Notebooks tier. The book is structured as a sevenstep data analysis journey, exploring the features available in Kaggle Notebooks alongside various data analysis techniques.
For each topic, we provide one or more notebooks, developing reusable analysis components through Kaggle's Utility Scripts feature, introduced progressively, initially as part of a notebook, and later extracted for use across future notebooks to enhance code reusability on Kaggle. It aims to make the notebooks' code more structured, easy to maintain, and readable.
Although the focus of this book is on data analytics, some examples will guide you in preparing a complete machine learning pipeline using Kaggle Notebooks. Starting from initial data ingestion and data quality assessment, you'll move on to preliminary data analysis, advanced data exploration, feature qualifi cation to build a model baseline, and feature engineering. You'll also delve into hyperparameter tuning to iteratively refi ne your model and prepare for submission in Kaggle competitions. Additionally, the book touches on developing notebooks that leverage the power of generative AI using Kaggle Models.

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 395

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Developing Kaggle Notebooks

Pave your way to becoming a Kaggle Notebooks Grandmaster

Gabriel Preda

BIRMINGHAM—MUMBAI

Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers.

Developing Kaggle Notebooks

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Lead Senior Publishing Product Manager: Tushar Gupta

Acquisition Editor – Peer Reviews: Gaurav Gavas

Project Editor: Amisha Vathare

Content Development Editor: Tanya D’cruz

Copy Editor: Safis Editing

Technical Editor: Aniket Shetty

Proofreader: Safis Editing

Indexer: Rekha Nair

Presentation Designer: Ganesh Bhadwalkar

Developer Relations Marketing Executive: Monika Sangwan

First published: December 2023

Production reference: 1191223

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-851-9

www.packt.com

Forewords

When I entered the world of AI and ML over twenty years ago, it was hard to describe to the people in my life what this field was. The ideas of finding patterns in data made it sound like I was hunting around in the attic with a flashlight. Telling family members about creating models that made useful predictions seemed to bring to mind children’s toys, or maybe some sort of fortune-telling. And the suggestion that machines might learn or be made to act with some form of observable intelligence was seen as the sort of thing that serious people left to the realm of science fiction.

Here we are in 2023, and the world has changed dramatically. The world of AI and ML has made stunning advances and has become – at least in my opinion – one of the most important technologies in existence. Predictive models are a tightly integrated part of nearly every computational platform, system, or application and impact business, trade, health, education, transportation, nearly every scientific field, and creative fields from visual art to music to writing. Indeed, AI and ML have become so important that the topics of governance, policy, and regulation are also emerging areas of rapid development themselves, and it seems that there is a new development almost every week.

Much of the most recent focus of attention has been on Generative AI, driven by LLMs and related methods, all of which draw on the last decade of advances in scaling up deep learning methods. For these models, it can feel like bigger is always better, and the scale of resources – computation, data, expertise – needed to contribute to this field makes it inaccessible to anyone outside of a small number of large players in the space. Personally, I reject this viewpoint.

I think that what the world really needs in this moment of massive change and development is for as many people as possible to learn how AI and ML models and systems work. We need as many people as possible to be able to train models, yes, but also to tweak and change them, to evaluate them, to understand their strengths and weaknesses, and to help identify ways that they can be more reliable, more efficient, less biased, more useful, and more accessible to everyone across the globe. Doing this within a broad, worldwide community helps to make sure that the things we learn together are shared broadly but are also stress-tested and re-evaluated by others.

This spirit of sharing is something that I think Gabriel Preda has embodied for many years, as a leading Kaggle Grandmaster. His dedication to our Kaggle community has been amazing, and his willingness to share his expertise serves as an example for all of us. This is one of the reasons why I think that this book itself is so important. Creating and sharing notebooks is the best way to make sure that the things we think are true can be checked, verified, and built upon by others.

So what does the world of AI and ML need right now, in this incredible moment of possibility? It needs you.

Welcome to Kaggle!

D. Sculley

Kaggle CEO December 2023

My background is in econometrics, and I became interested in machine learning initially as an alternative approach to solving forecasting problems. My personal experience with the new field was not entirely positive at the start: I lacked familiarity with the techniques, terminology, and credentials that typically facilitate entry.

We created Kaggle with the hope that the platform would allow people like me the opportunity to break into this powerful new field much easier than it was for me. Perhaps the thing that makes me most proud is the extent to which Kaggle has made data science and machine learning more accessible for a wide audience. Kaggle has seen newcomers evolve into top machine learners, securing positions at renowned companies like NVIDIA, Google, Hugging Face, and OpenAI, and even launching their ventures such as DataRobot.

Started as a machine learning competition platform, Kaggle has evolved to host datasets, notebooks, and discussions. Through Kaggle Learn, it offers easy-to-follow learning modules for beginners and the advanced alike. Currently, over 300,000 of the 15 million Kaggle users are actively publishing and ranked in the notebooks tier. Notebooks are an excellent way to share knowledge through exploration and analysis of datasets, prototyping machine learning models, collecting dataset data, and preparing training and inference scripts for competition.

Gabriel’s book will make Kaggle more accessible, especially for those interested in learning how to create detailed data analysis notebooks, refine their presentation skills, and create powerful narratives with data. It also offers examples of using notebooks to iteratively build models to prepare for competition submissions, and introduces users to the newest available features on Kaggle, including a chapter that shows how to leverage the power of Generative AI through Kaggle Models for prototyping applications with large language models to generate code, create chains of tasks, or build retrieval augmented generation systems.

Gabriel is a triple Kaggle Grandmaster, with a seven-year tenure on Kaggle. He ranked 2nd in Datasets and 3rd in Notebooks. Some of his notebooks and datasets are the most upvoted by the community in the past. With his immense expertise filled within the pages of this book, those who complete this book should be able to confidently create great notebooks with a high impact on the community and, therefore, share their knowledge and engage with the community.

Machine learning and artificial intelligence are moving extremely fast, especially in recent years. Being active on Kaggle keeps you connected with the community that filters from the vast number of publications, new technologies, libraries, frameworks, and models what is useful and applicable to solving real-world problems. Many of the tools that have become standard in the industry have spread through Kaggle, after being validated by the Kaggle community.

Moreover, Kaggle offers users a way to “learn by doing.” Especially in notebooks, there is a lot of feedback from the community, and contributors are stimulated to continuously improve the content that they are sharing.

So, for those of you who are reading this book and are new to Kaggle, I hope it helps make Kaggle, and especially writing Kaggle notebooks, less intimidating. And for those who have been on Kaggle for a while and are looking to level up, I hope this book from one of Kaggle’s most respected members helps you get more out of your time on the platform.

Anthony Goldbloom

Founder and former CEO of Kaggle

Contributors

About the author

Gabriel Preda has a PhD in computational electromagnetics and started his career in academic and private research. Twenty-five years ago, he authored his first paper that included the use of an AI technique – Neural Networks – to solve inverse problems in nondestructive testing and evaluation. Soon after, he moved from academia to private research and worked for a few years as a researcher for a high-tech company in Tokyo. After he moved back to Europe, he co-founded two technology start-ups and has worked for several product companies and software service corporations in software development for more than 20 years, holding development and management positions. Currently, Gabriel is a Principal Data Scientist at Endava, working for a range of industries from banking and insurance to telecom, logistics, and healthcare. He is a high-profile contributor in the world of competitive machine learning and currently one of the few triple Kaggle Grandmasters (in Datasets, Notebooks, and Discussions).

My warmest thanks go to my family, Midori and Cristina, for their support and patience while I prepared this book. I am grateful to Anthony Goldbloom, co-founder and former CEO, and to D. Sculley, current CEO of Kaggle, for their forewords to this book. Finally, I would like to thank Tushar Gupta, Amisha Vathare, Tanya D’cruz, Monika Sangwan, Aniket Shetty, and all the Packt Publishing editorial and production staff for their support on this writing effort.

About the reviewers

Konrad Banachewicz is a data science manager with experience stretching longer than he likes to ponder on. He holds a PhD in statistics from Vrije Universiteit Amsterdam, where he focused on problems of extreme dependency modeling in credit risk. He slowly moved from classic statistics toward machine learning and into the business applications world. He worked in a variety of financial institutions on an array of data problems and visited all the stages of a data product cycle: from translating business requirements (“What do they really need?”), through data acquisition (“Spreadsheets and flat files? Really?”), wrangling, modeling, and testing (the actually fun part), all the way to presenting the results to people allergic to mathematical terminology (which is most of the business). He is currently the principal data scientist at IKEA.

As a person who stood on the shoulders of giants, I believe in sharing knowledge with others: it is very important to know how to approach practical problems with data science methods, but also how not to do it.

Marília Prata is a retired dental doctor who worked for 28 years in her private clinic, provided dental audit services for Petrobras (Petróleo Brasilero S/A), and served as a public servant in the Civil Police of Rio de Janeiro. She also completed two specializations in dental prosthesis and occupational dentistry. She is currently a triple Kaggle Grandmaster at the time of publishing, ranking second in Notebooks.

I’m very grateful to the Kaggle platform and its users (Kagglers) because it’s a perfect place to start learning programming languages hands-on. Special thanks to Gabriel Preda for trusting my ability to review his invaluable work in this riveting field of data science.

Dr. Firat Gonen, PhD, orchestrates the Data and Analytics division at Allianz, propelling a Fortune 50 company with pioneering machine learning initiatives. His expertise, built on a foundation laid during his academic tenure culminating in a PhD from the University of Houston, now guides Allianz’s data-driven strategies. His role at Allianz was preceded by leadership positions at Getir – Turkish Decacorn App – and Vodafone, where he honed his prowess in managing adept data teams. Dr. Gonen’s extensive educational background and academic diligence are reflected in multiple peer-reviewed publications and complemented by his status as a Kaggle Triple Grandmaster, further adorned with numerous international data competition accolades. As the Z by HP Global Data Science Ambassador, Dr. Gonen advocates for the transformative power of data, underscoring the symbiotic relationship between cutting-edge technology and industry-leading insights. He was recently awarded the title of LinkedIn Top Data Science and Artificial Intelligence Voice. He also reviewed the Kaggle Book.

I would like to thank Deniz for her help, guidance, love, and her constant support along the way.

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 5000 members at:

https://packt.link/kaggle

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Introducing Kaggle and Its Basic Functions

The Kaggle platform

Kaggle Competitions

Kaggle Datasets

Kaggle Code

Kaggle Discussions

Kaggle Learn

Kaggle Models

Summary

Getting Ready for Your Kaggle Environment

What is a Kaggle Notebook?

How to create notebooks

Exploring notebook capabilities

Basic capabilities

Advanced capabilities

Setting a notebook as a utility script or adding utility scripts

Adding and using secrets

Using Google Cloud services in Kaggle Notebooks

Upgrading your Kaggle Notebook to Google Cloud AI Notebooks

Using a Notebook to automatically update a Dataset

Using the Kaggle API to create, update, download, and monitor your notebooks

Summary

Starting Our Travel – Surviving the Titanic Disaster

A closer look at the Titanic

Conducting data inspection

Understanding the data

Analyzing the data

Performing univariate analysis

Performing multivariate analysis

Extracting meaningful information from passenger names

Creating a dashboard showing multiple plots

Building a baseline model

Summary

References

Take a Break and Have a Beer or Coffee in London

Pubs in England

Data quality check

Data exploration

Starbucks around the world

Preliminary data analysis

Univariate and bivariate data analysis

Geospatial analysis

Pubs and Starbucks in London

Data preparation

Geospatial analysis

Summary

References

Get Back to Work and Optimize Microloans for Developing Countries

Introducing the Kiva analytics competition

More data, more insights – analyzing the Kiva data competition

Understanding the borrower demographic

Exploring MPI correlation with other factors

Radar visualization of poverty dimensions

Final remarks

Telling a different story from a different dataset

The plot

The actual history

Conclusion

Summary

References

Can You Predict Bee Subspecies?

Data exploration

Data quality checks

Exploring image data

Locations

Date and time

Subspecies

Health

Others

Conclusion

Subspecies classification

Splitting the data

Data augmentation

Building a baseline model

Iteratively refining the model

Summary

References

Text Analysis Is All You Need

What is in the data?

Target feature

Sensitive features

Analyzing the comments text

Topic modeling

Named entity recognition

POS tagging

Preparing the model

Building the vocabulary

Embedding index and embedding matrix

Checking vocabulary coverage

Iteratively improving vocabulary coverage

Transforming to lowercase

Removing contractions

Removing punctuation and special characters

Building a baseline model

Transformer-based solution

Summary

References

Analyzing Acoustic Signals to Predict the Next Simulated Earthquake

Introducing the LANL Earthquake Prediction competition

Formats for signal data

Exploring our competition data

Solution approach

Feature engineering

Trend feature and classic STA/LTA

FFT-derived features

Features derived from aggregate functions

Features derived using the Hilbert transform and Hann window

Features based on moving averages

Building a baseline model

Summary

References

Can You Find Out Which Movie Is a Deepfake?

Introducing the competition

Introducing competition utility scripts

Video data utils

Face and body detection utils

Metadata exploration

Video data exploration

Visualizing sample files

Performing object detection

Summary

References

Unleash the Power of Generative AI with Kaggle Models

Introducing Kaggle Models

Prompting a foundation model

Model evaluation and testing

Model quantization

Building a multi-task application with Langchain

Code generation with Kaggle Models

Creating a RAG system

Summary

References

Closing Our Journey: How to Stay Relevant and on Top

Learn from the best: observe successful Grandmasters

Revisit and refine your work periodically

Recognize other’s contributions, and add your personal touch

Be quick: don’t wait for perfection

Be generous: share your knowledge

Step outside your comfort zone

Be grateful

Summary

References

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Share your thoughts

Once you’ve read Developing Kaggle Notebooks, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

1

Introducing Kaggle and Its Basic Functions

Kaggle is currently the main platform for competitive predictive modeling. Here, those who are passionate about machine learning, both experts and beginners, have a collaborative and competitive environment to learn, win recognition, share knowledge, and give back to the community. The company was launched in 2010, offering only machine learning competitions. Currently, it is a data platform that includes sections titled Competitions, Datasets, Code, Discussions, Learn, and, most recently, Models.

In 2011, Kaggle went through an investment round, valuing the company above $25 million. In 2017, it was acquired by Google (now Alphabet Inc.), becoming associated with Google Cloud. The most notable key persons from Kaggle are co-founders Anthony Goldbloom (long-time CEO until 2022) and Ben Hammer (CTO). Recently, D. Sculley, the legendary Google engineer, became Kaggle’s new CEO, after Anthony Goldbloom stepped down to become involved in the development of a new start-up.

In this first chapter, we’ll explore the main sections that the Kaggle platform offers its members. We will also learn how to create an account, how the platform is organized, and what its main sections are. In short, this chapter will cover the following topics:

The Kaggle platformKaggle CompetitionsKaggle DatasetsKaggle CodeKaggle DiscussionsKaggle LearnKaggle Models

If you are familiar with the Kaggle platform, you probably know about these features already. You can choose to continue reading the following sections to refresh your knowledge about the platform or you can skip them and go directly to the next chapter.

The Kaggle platform

To start using Kaggle, you will have to create an account. You can register with your email and password or authenticate using your Google account directly. Once registered, you can start by creating a profile with your name, picture, role, and current organization. You then can add your location, which is optional, and a short personal presentation as well. After you perform an SMS verification and add some minimal content on the platform (run one notebook or script, make one competition submission, make one comment, or give one upvote), you will also be promoted from Novice to Contributor. The following figure shows a checklist for how to become a contributor. As you can see, all items are checked, which means that the user has already been promoted to the Contributor tier.

Figure 1.1: Checklist to become a contributor

With the entire Contributor checklist completed, you are ready to start your Kaggle journey.

The current platform contains multiple features. The most important are:

Competitions: This is where Kagglers can take part in competitions and submit their solutions to be scored.Datasets: In this section, users can upload datasets.Code: This is one of the most complex features of Kaggle. Also known as Kernels or Notebooks, it allows users to add code (independently or connected to datasets and competitions), modify it, run it to perform analysis, prepare models, and generate submission files for competitions.Discussions: In this section, contributors on the platform can add topics and comments to competitions, Notebooks, or datasets. Topics can also be added independently and linked to themes such as Getting Started.

Each of these sections allows you to gain medals, according to Kaggle’s progression system. Once you start to contribute to one of these sections, you can also be ranked in the overall Kaggle ranking system for the respective section. There are two main methods to gain medals: by winning top positions in competitions and by getting upvotes for your work in the Datasets, Code, and Discussions sections.

Besides Competitions, Datasets, Code, and Discussions, there are two more sections with content on Kaggle:

Learn: This is one of the coolest features of Kaggle. It contains a series of lectures and tutorials on various topics, from a basic introduction to programming languages to advanced topics like computer vision, model interpretability, and AI ethics. You can use all the other Kaggle resources as support materials for the lectures (Datasets, Competitions, Code, and Discussions).Models: This is the newest feature introduced on Kaggle. It allows you to load a model into your code, in the same way that you currently add datasets.

Now that we’ve had a quick overview of the various features of the Kaggle platform, the following sections will give you an in-depth view of Competitions, Datasets, Code, Discussions, Learn, and Models. Let’s get started!

Kaggle Competitions

It all started with Competitions more than 12 years ago. The first competition had just a few participants. With the growing interest in machine learning and the increased community around Kaggle, the complexity of the competitions, the number of participants, and the interest around competitions increased significantly.

To start a competition, the competition host prepares a dataset, typically split between train and test. In the most common form, the train set has labeled data available, while the test set only contains the feature data. The host also adds information about the data and a presentation of the competition objective. This includes a description of the problem to set the background for the competitors. The host also adds information about the metrics used to evaluate the solutions to the competition. The terms and conditions of the competitions are also specified.

Competitors are allowed to submit a limited number of solutions per day and, at the end, the best two solutions (evaluated based on a portion of the test set used to calculate the public score) will be selected. Competitors also have the option to select two solutions themselves based on their own judgment. Then, these two selected solutions will be evaluated on the reserved subset of test data to generate the private score. This will be the final score used to rank the competitors.

There are several types of competitions:

Featured competitions: The most important are the featured competitions. Currently, featured competitions might reunite several thousand teams, with tens or even hundreds of thousands of solutions submitted. Featured competitions are typically hosted by companies but also sometimes by research organizations or universities, and are usually aimed at solving a difficult problem related to a company or a research topic. The organizer turns to the large Kaggle community to bring their knowledge and skills, and the competitive aspect of the setup accelerates the development of a solution. Usually, a featured competition will also have a significant prize, which will be distributed according to the competition rules to the top competitors. Sometimes, the host will not include a prize but will offer a different incentive, such as recruiting the top competitors to work for them (with high-profile companies, this might be more interesting than a prize), vouchers for using cloud resources, or acceptance of the top solutions to be presented at high-profile conferences. Besides the Featured competitions, there are also Getting Started, Research, Community, Playground, Simulations, and Analytics competitions.Getting Started competitions: These are aimed at mostly beginners and tackle easily approachable machine learning problems to help build basic skills. These competitions are restarted periodically and the leaderboard is reset. The most notable ones are Titanic – Machine Learning for Disaster, Digit Recognizer, House Prices – Advanced Regression Techniques, and Natural Language Processing with Disaster Tweets.Research competitions: In Research competitions, the themes are related to finding the solution to a difficult scientific problem in various domains such as medicine, genetics, cell biology, and astronomy by applying a machine learning approach. Some of the most popular competitions in recent years were from this category and with the rising use of machine learning in many fields of fundamental and applied research, we can expect that this type of competition will be more and more frequent and popular.Community competitions: These are created by Kagglers and are either open to the public or private competitions, where only those invited can take part. For example, you can host a Community competition as a school or university project, where students are invited to join and compete to get the best grades.

Kaggle offers the infrastructure, which makes it very simple for you to define and start a new competition. You have to provide the training and test data, but this can be as simple as two files in CSV format. Additionally, you need to add a submission sample file, which gives the expected format for submissions. Participants in the competition have to replace the prediction in this file with their own prediction, save the file, and then submit it. Then, you have to choose a metric to assess the performance of a machine learning model (no need to define one, as you have a large collection of predefined metrics). At the same time, as the host, you will be required to upload a file with the correct, expected solution to the competition challenge, which will serve as reference against which all competitors’ submissions will be checked. Once this is done, you just need to edit the terms and conditions, choose a start and end date for the competition, write the data description and objectives, and you are good to go. Other options that you can choose from are whether participants can team up or not, and whether joining the competition is open to everybody or just to people who receive the competition link.

Playground competitions: Around three years ago, a new section of competitions was launched: Playground competitions. These are generally simple competitions, like the Getting Started ones, but will have a shorter lifespan (it was initially one month, but currently it is from one to four weeks). These competitions will be of low or medium difficulty and will help participants gain new skills. Such competitions are highly recommended to beginners but also to competitors with more experience who want to refine their skills in a certain domain.Simulation competitions: If the previous types are all supervised machine learning competitions, Simulations competitions are, in general, optimization competitions. The most well known are those around Christmas and New Year (Santa competitions) and also the Lux AI Challenge, which is currently in the third season. Some of the Simulation competitions are also recurrent and will qualify for an additional category: Annual competitions. Examples of such competitions that are of both the Simulations type and Annual are the Santa competitions.Analytics competitions: These are different in both the objective and the modality of scoring the solutions. The objective is to perform a detailed analysis of the competition dataset to get insights from the data. The score is based, in general, on the judgment of the organizers and, in some cases, on the popularity of the solutions that compete; in this case, the organizers will grant parts of the prizes to the most popular notebooks, based on the upvotes of Kagglers. In Chapter 5, we will analyze the data from one of the first Analytics competitions and also provide some insights into how to approach this type of competition.

For a long time, competitions required participants to prepare a submission file with the predictions for the test set. No other constraints were imposed on the method to prepare the submissions; the competitors were supposed to use their own computing resources to train models, validate them, and prepare the submission. Initially, there were no available resources on the platform to prepare a submission. After Kaggle started to provide computational resources, where you could prepare your model using Kaggle Kernels (later named Notebooks and now Code), you could submit directly from the platform, but there was no limitation imposed on this. Typically, the submission file will be evaluated on the fly and the result will be displayed almost instantly. The result (i.e., the score according to the competition metric) will be calculated only for a percentage of the test set. This percentage is announced at the start of the competition and is fixed. Also, the subset of test data used during the competition to calculate the displayed score (the public score) is fixed. After the end of the competition, the final score is calculated with the rest of the test data, and this final score (also known as the private score) is the final score for each competitor. The percentage of the test data used during the competition to evaluate the solution and provide the public score could be anything from a few percent to more than 50%. In most competitions, it tends to be less than 50%.

The reason Kaggle uses this approach is to prevent one unwanted phenomenon. Rather than improving their models for enhanced generalization, competitors might be inclined to optimize their solution to predict the test set as perfectly as possible, without considering the cross-validation score on their train data. In other words, the competitors might be inclined to overfit their solution on the test set. By splitting this data and only providing the score for a part of the test set – the public score – the organizers intend to prevent this.

With more and more complex competitions (sometimes with very large train and test sets), some participants with greater computational resources might gain an advantage, while others with limited resources may struggle to develop advanced models. Especially in featured competitions, the goal is often to create robust, production-compatible solutions. However, without setting restrictions on how solutions are obtained, achieving this goal may be difficult, especially if solutions with unrealistic resource use become prevalent. To limit the negative unwanted consequences of the “arms race” for better and better solutions, a few years ago, Kaggle introduced Code competitions. This kind of competition requires that all solutions be submitted from a running notebook on the Kaggle platform. In this way, the infrastructure to run the solution became fully controllable by Kaggle.

Also, not only are the computing resources limited in such competitions but there are also additional constraints: the duration of the run and internet access (to prevent the use of additional computing power through the use of external APIs or other remote computing resources).

Kagglers discovered quite fast that this was a limitation just for the inference part of the solution and an adaptation appeared: competitors started to train offline, large models that would not fit within the limits of computing power and time of run imposed by the Code competitions. Then, they uploaded the offline trained models (sometimes using very large computational resources) as datasets and loaded these models in the inference code that observed the limits for memory and computation time for the Code competitions.

In some cases, multiple models trained offline were loaded as datasets and inference combined these multiple models to create more precise solutions. Over time, Code competitions have become more refined. Some of them will only expose a few rows from the test set and not reveal the size of the real test set used for the public or future private test set. Therefore, Kagglers have to resort to clever probing techniques to estimate the limitations that might be incurred while running the final, private test set, to avoid a case where their code will fail due to surpassing memory or runtime limits.

Currently, there are also Code competitions that, after the active part of the competition (i.e., when competitors are allowed to continue to refine their solutions) ends, will not publish the private score, but will rerun the code with several new sets of test data, and reevaluate the setwo selected solutions against these new datasets, which have never been seen before. Some of these competitions are about the stock market, cryptocurrency valuation, or credit performance predictions and they use real data. The evolution of Code competitions ran in parallel with the evolution of available computational resources on the platform, to provide users with the required computational power.

Some of the competitions (most notably the Featured competitions and the Research competitions) grant ranking points and medals to the participants. Ranking points are used to calculate the relative position of Kagglers in the general leaderboard of the platform. The formula to calculate the ranking points awarded for a competition hasn’t changed since May 2015:

Figure 1.2: Formula for calculating ranking points

The number of points decreases with the square root of the number of teammates in the current competition team. More points are awarded for competitions with a larger number of teams. The number of points will also decrease over time, to keep the ranking up to date and competitive.

Medals are counted to get a promotion in the Kaggle progression system for competitions. Medals for competitions are obtained based on the position at the top of the competition leaderboard. The actual system is a bit more complicated but, generally, the top 10% will get a bronze medal, the top 5% will get a silver medal, and the top 1% will get a gold medal. The actual number of medals granted will be larger with an increased number of participants, but this is the basic principle.