The Kaggle Workbook - Konrad Banachewicz - E-Book

The Kaggle Workbook E-Book

Konrad Banachewicz

0,0
20,42 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

Move up the Kaggle leaderboards and supercharge your data science and machine learning career by analyzing famous competitions and working through exercises.



Purchase of the print or Kindle book includes a free eBook in PDF format.

Key Features



  • Challenge yourself to start thinking like a Kaggle Grandmaster
  • Fill your portfolio with impressive case studies that will come in handy during interviews
  • Packed with exercises and notes pages for you to enhance your skills and record key findings

Book Description



More than 80,000 Kaggle novices currently participate in Kaggle competitions. To help them navigate the often-overwhelming world of Kaggle, two Grandmasters put their heads together to write The Kaggle Book, which made plenty of waves in the community. Now, they've come back with an even more practical approach based on hands-on exercises that can help you start thinking like an experienced data scientist.



In this book, you'll get up close and personal with four extensive case studies based on past Kaggle competitions. You'll learn how bright minds predicted which drivers would likely avoid filing insurance claims in Brazil and see how expert Kagglers used gradient-boosting methods to model Walmart unit sales time-series data. Get into computer vision by discovering different solutions for identifying the type of disease present on cassava leaves. And see how the Kaggle community created predictive algorithms to solve the natural language processing problem of subjective question-answering.



You can use this workbook as a supplement alongside The Kaggle Book or on its own alongside resources available on the Kaggle website and other online communities. Whatever path you choose, this workbook will help make you a formidable Kaggle competitor.

What you will learn



  • Take your modeling to the next level by analyzing different case studies
  • Boost your data science skillset with a curated selection of exercises
  • Combine different methods to create better solutions
  • Get a deeper insight into NLP and how it can help you solve unlikely challenges
  • Sharpen your knowledge of time-series forecasting
  • Challenge yourself to become a better data scientist

Who this book is for



If you're new to Kaggle and want to sink your teeth into practical exercises, start with The Kaggle Book, first. A basic understanding of the Kaggle platform, along with knowledge of machine learning and data science is a prerequisite.



This book is suitable for anyone starting their Kaggle journey or veterans trying to get better at it. Data analysts/scientists who want to do better in Kaggle competitions and secure jobs with tech giants will find this book helpful.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 165

Veröffentlichungsjahr: 2023

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



The Kaggle Workbook

Self-learning exercises and valuable insights for Kaggle data science competitions

Konrad Banachewicz

Luca Massaron

BIRMINGHAM—MUMBAI

Packt and this book are not officially connected with Kaggle. This book is an effort from the Kaggle community of experts to help more developers.

The Kaggle Workbook

Copyright © 2023 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Lead Senior Publishing Product Manager: Tushar Gupta

Acquisition Editor – Peer Reviews: Gaurav Gavas

Project Editor: Parvathy Nair

Content Development Editor: Bhavesh Amin

Copy Editor: Safis Editing

Technical Editor: Karan Sonawane

Proofreader: Safis Editing

Indexer: Pratik Shirodkar

Presentation Designer: Rajesh Shirsath

Developer Relations Marketing Executive: Monika Sangwan

First published: February 2023

Production reference: 2200223

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-80461-121-0

www.packt.com

Contributors

About the authors

Konrad Banachewicz is a data science manager with experience that goes back more than he would care to mention. He holds a PhD in statistics from Vrije Universiteit Amsterdam, where he focused on problems of extreme dependency modeling in credit risk. He slowly moved from classic statistics toward machine learning and into the business applications world.

Konrad worked in a variety of financial institutions on an array of data problems and visited all stages of the data product cycle, from translating business requirements (“what do they really need”); through data acquisition (“spreadsheets and flat files? Really?”), wrangling, modeling, and testing (the actual fun part), all the way to presenting the results to people allergic to mathematical terminology (which is the majority of business). He has visited different ends of the frequency spectrum in finance (from high-frequency trading to credit risk, and everything in between), predicted potato prices, analyzed anomalies in industrial equipment, and optimized recommendations. He is a staff data scientist at Adevinta.

As a person who himself stood on the shoulders of giants, Konrad believes in sharing knowledge with others: it is very important to know how to approach practical problems with data science methods, but also how not to do it.

Luca Massaron is a data scientist with more than a decade of experience in transforming data into smarter artifacts, solving real-world problems, and generating value for businesses and stakeholders. He is the author of best-selling books on AI, machine learning, and algorithms. Luca is also a Kaggle Grandmaster, who reached number 7 in the worldwide user rankings for his performance in data science competitions, and a Google Developer Expert (GDE) in machine learning.

My warmest thanks go to my family, Yukiko and Amelia, for their support and loving patience as I prepared this new book in a long series.

About the reviewers

Laura Fink works as a senior data scientist for H2O.ai, and her main interest is in building AI tools for unstructured data. Within the field of machine learning, she is especially interested in deep learning and unsupervised methods.

Before joining H2O.ai, she was the Head of Data Science at the software development company Micromata, building a data science team to open up new business opportunities. Her mission has always been to help customers to make data-driven decisions by using data science and machine learning to solve business problems.

Laura holds a master’s degree in physics from the Ludwig Maximilian University of Munich with a focus on biophysics and nonlinear dynamical systems. She had her first contact with machine learning during her master’s thesis in 2015 and has been fascinated by its potential ever since. After joining Kaggle in the same year, she was immediately hooked by the platform and its awesome community. As a Notebooks Grandmaster, she enjoys sharing her insights and learning experiences with the community by writing tutorials and detailed exploratory analyses.

Gabriel Preda is a principal data scientist at Endava. He worked for more than 20 years in software engineering, holding both development and management positions. He is passionate about data science and machine learning and is constantly contributing to Kaggle, being currently a triple Kaggle Grandmaster.

Pietro Marinelli has consistently been ranked among the top data scientists in the world on the Google AI platform, Kaggle. He has reached 3rd position among Italian data scientists and 141st among 150,000 data scientists around the world. Due to his work on Kaggle, he has been honored to participate as a speaker at Paris Kaggle Day, January 2019.

He has been working with artificial intelligence, text analytics, and many other data science techniques for many years, and has more than 15 years of experience in designing products based on data for different industries. He has produced a variety of algorithms, ranging from predictive modeling to advanced simulation, to support top management’s business decisions for a variety of multinational companies. He is currently collaborating as a reviewer for Packt, reviewing AI books.

Due to his achievements in the AI field in February 2020, he has been honored to participate as a speaker at the Italian Chamber of Deputies to talk about the role of AI in the new global landscape.

Join our book’s Discord space

Join our Discord community to meet like-minded people and learn alongside more than 2000 members at:

https://packt.link/KaggleDiscord

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

The Most Renowned Tabular Competition – Porto Seguro’s Safe Driver Prediction

Understanding the competition and the data

Understanding the evaluation metric

Examining the top solution ideas from Michael Jahrer

Building a LightGBM submission

Setting up a denoising autoencoder and a DNN

Ensembling the results

Summary

The Makridakis Competitions – M5 on Kaggle for Accuracy and Uncertainty

Understanding the competition and the data

Understanding the Evaluation Metric

Examining the 4th place solution’s ideas from Monsaraida

Computing predictions for specific dates and time horizons

Assembling public and private predictions

Summary

Vision Competition – Cassava Leaf Disease Competition

Understanding the data and metrics

Building a baseline model

Learning from top solutions

Pretraining

Test time augmentation

Transformers

Ensembling

A complete solution

Summary

NLP Competition – Google Quest Q&A Labeling

The baseline solution

Learning from top solutions

Summary

Other Books You May Enjoy

Index

Landmarks

Cover

Index

Preface

When we started planning The Kaggle Book, we had more than 85,117 novices (as they have at least just registered) and 57,426 contributors (as they have at least filled their profile) present at the time on the Kaggle platform. We wondered how to help them break the ice with data science competitions on Kaggle. We then decided to provide them with the best available information about Kaggle and data science competitions and help them to start their journey in the best possible way, thanks to hints and suggestions by over 30 Kaggle Masters and Grandmasters.

Only when we had completed our work, we realized that there was little space left in the book for anything else and that, regrettably, we had to exclude some practical demonstrations and examples. However, practice sometimes is as important as theory (we know about it very well since we are applied data scientists!), and theory cannot be considered complete without any practice. Finally, The Kaggle Workbook is here to supplement The Kaggle Book by providing you with guided exercises to put some of the ideas found in The Kaggle Book into practice.

Strictly speaking, in this workbook, you will find:

The exploration of an emblematic selection of competitions (tabular, forecasting, computer vision, and natural language processing) where we demonstrate how a simple and effective solution can be derived for each of them.Reference to concepts and ideas to be found in the original Kaggle Book.Some challenges for the readers, as we pose some questions (and exercises) help hone your skills on the same proposed competitions or in analogous ones.

First, by reading The Kaggle Book and then practicing on The Kaggle Workbook, you’ll have all the skills, both theory-based and hands-on, necessary to compete on Kaggle for glory, fun, or learning, and gathering interesting, applied projects to present in a job interview, or for your own portfolio!

Let’s not wait; let’s start practicing now!

Who this book is for

This book has been written for all the readers of The Kaggle Book and for all the Kaggle novices and contributors who want practical experience in past competitions to reinforce their learning before delving into competitions on Kaggle.

What this book covers

Chapter 1, The Most Renowned Tabular Competition – Porto Seguro’s Safe Driver Prediction. In this competition, you are asked to solve a common problem in insurance to figure out who is going to have an auto insurance claim in the next year. We guide you in properly using LightGBM, denoising autoencoders, and how to effectively blend them.

Chapter 2, The Makridakis Competitions – M5 on Kaggle for Accuracy and Uncertainty. In this competition based on Walmart’s daily sales time series of items hierarchically arranged into departments, categories, and stores spread across three U.S. states, we recreate the 4th-place solution’s ideas from Monsaraida to demonstrate how we can effectively use LightGBM for this time series problem.

Chapter 3, Vision Competition – Cassava Leaf Disease Classification. In this contest, the participants were tasked with classifying crowdsourced photos of cassava plants grown by farmers in Uganda. We use the multiclass problem to demonstrate how to build a complete pipeline for image classification and show how this baseline can be utilized to construct a competitive solution using a vast array of possible extensions.

Chapter 4, NLP Competition – Google Quest Q&A Labeling, discusses a contest focused on predicting human responders’ evaluations of subjective aspects of a question-answer pair, where an understanding of context was crucial. Casting the challenge as a multiclass classification problem, we build a baseline solution exploring the semantic characteristics of a corpus, followed by an examination of more advanced methods that were necessary for leaderboard ascent.

To get the most out of this book

The Python code proposed in this book has been designed to run on a Kaggle Notebook without any installation on a local computer. Therefore, don’t worry about what machine you have available and about what version of Python package you have to install. All you need is a computer with access to the internet and a free Kaggle account. (you will find instructions about the procedures in Chapter 3 of The Kaggle Book). If you don’t have a free Kaggle account yet, just go to www.kaggle.com and follow the instructions on the website.

When referred to a link, just explore it: you can find code available on public Kaggle Notebooks that you can reuse or further materials to illustrate concepts and ideas that we outlined in the book.

Download the example code files

The code bundle for the book is hosted on GitHub at https://github.com/PacktPublishing/The-Kaggle-Workbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://packt.link/Rgb6B.

Conventions used

There are a few text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. For example, “An important component of our feature extraction pipeline is the TfidfVectorizer.”

A block of code is set as follows:

!pip install transformers import transformer

Any command-line input or output is written as follows:

LightGBM CV Gini Normalized Score: 0.289 (0.015)

Bold: Indicates a new term, an important word, or words that you see on the screen, for example, in menus or dialog boxes, also appear in the text like this. For example: “ We will evaluate the performance of our baseline model using Out-Of-Fold (OOF) cross validation.”

Link: Indicates a hyperlink to a web page containing additional information on a topic or to a resource on Kaggle.

Exercises are displayed as follows:

Exercise Number

Exercise Notes (write down any notes or workings that will help you):

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: Email [email protected] and mention the book’s title in the subject of your message. If you have questions about any aspect of this book, please email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you could report this to us. Please visit http://www.packtpub.com/submit-errata, select your book, click on the Submit Errata link, and enter the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit http://authors.packtpub.com.

Share your thoughts

Once you’ve read The Kaggle Workbook, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.

Download a free PDF copy of this book

Thanks for purchasing this book!

Do you like to read on the go but are unable to carry your print books everywhere?Is your eBook purchase not compatible with the device of your choice?

Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.

Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application. 

The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily

Follow these simple steps to get the benefits:

Scan the QR code or visit the link below

https://packt.link/free-ebook/9781804611210

Submit your proof of purchaseThat’s it! We’ll send your free PDF and other benefits to your email directly

1

The Most Renowned Tabular Competition – Porto Seguro’s Safe Driver Prediction

Learning how to reach the top on the leaderboard in any Kaggle competition requires patience, diligence, and many attempts to learn the best way to compete and achieve top results. For this reason, we have thought of a workbook that can help you build those skills faster by trying some Kaggle competitions of the past and learning how to reach the top of the leaderboard by reading discussions, reusing notebooks, engineering features, and training various models.

We start with one of the most renowned tabular competitions, Porto Seguro’s Safe Driver Prediction. In this competition, you are asked to solve a common problem in insurance and figure out who is going to have a car insurance claim in the next year. Such information is useful to increase the insurance fee for drivers more likely to have a claim and to lower it for those less likely to.

In illustrating the key insights and technicalities necessary for cracking this competition, we will show you the necessary code and ask you to study topics and answer questions found in The Kaggle Book itself. Therefore, without much more ado, let’s start this new learning path of yours.

In this chapter, you will learn:

How to tune and train a LightGBM modelHow to build a denoising autoencoder and how to use it to feed a neural networkHow to effectively blend models that are quite different from each other

All the code files for this chapter can be found at Change to https://packt.link/kwbchp1

Understanding the competition and the data

Porto Seguro is the third largest insurance company in Brazil (it operates in Brazil and Uruguay), offering car insurance coverage as well as many other insurance products, having used analytical methods and machine learning for the past 20 years to tailor their prices and make auto insurance coverage more accessible to more drivers. To explore new ways to achieve their task, they sponsored a competition (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction), expecting Kagglers to come up with new and better methods of solving some of their core analytical problems.

The competition is aimed at having Kagglers build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year, which is a quite common kind of task (the sponsor mentions it as a “classical challenge for insurance”). This kind of information about the probability of filing a claim can be quite precious for an insurance company. Without such a model, insurance companies may only charge a flat premium to customers irrespective of their risk, or, if they have a poorly performing model, they may charge a mismatched premium to them. Inaccuracies in profiling the customers’ risk can therefore result in charging a higher insurance cost to good drivers and reducing the price for the bad ones. The impact on the company would be two-fold: good drivers will look elsewhere for their insurance and the company’s portfolio will be overweighed with bad ones (technically, the company would have a bad loss ratio: https://www.investopedia.com/terms/l/loss-ratio.asp). Instead, if the company can correctly estimate the claim likelihood, they can ask for a fair price from their customers, thus increasing their market share, having more satisfied customers and a more balanced customer portfolio (better loss ratio), and managing their reserves better (the money the company sets aside for paying claims).

To do so, the sponsor provided training and test datasets, and the competition was ideal for anyone since the dataset was not very large and was very well prepared.

As stated on the page of the competition devoted to presenting the data (https://www.kaggle.com/competitions/porto-seguro-safe-driver-prediction/data):

Features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc).

In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target column signifies whether or not a claim was filed for that policy holder.

The data preparation for the competition was carefully conducted to avoid any leak of information, and although secrecy has been maintained about the meaning of the features, it is quite clear that the different used tags refer to specific kinds of features commonly used in motor insurance