E-Book
21,59 €

Cracking the Data Science Interview E-Book

Leondra R. Gonzalez

0,0

21,59 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Lebensstil
Sprache: Englisch

Beschreibung

The data science job market is saturated with professionals of all backgrounds, including academics, researchers, bootcampers, and Massive Open Online Course (MOOC) graduates. This poses a challenge for companies seeking the best person to fill their roles. At the heart of this selection process is the data science interview, a crucial juncture that determines the best fit for both the candidate and the company.
Cracking the Data Science Interview provides expert guidance on approaching the interview process with full preparation and confidence. Starting with an introduction to the modern data science landscape, you’ll find tips on job hunting, resume writing, and creating a top-notch portfolio. You’ll then advance to topics such as Python, SQL databases, Git, and productivity with shell scripting and Bash. Building on this foundation, you'll delve into the fundamentals of statistics, laying the groundwork for pre-modeling concepts, machine learning, deep learning, and generative AI. The book concludes by offering insights into how best to prepare for the intensive data science interview.
By the end of this interview guide, you’ll have gained the confidence, business acumen, and technical skills required to distinguish yourself within this competitive landscape and land your next data science job.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

MOBI

Seitenzahl: 595

Veröffentlichungsjahr: 2024

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Cracking the Data Science Interview

Unlock insider tips from industry experts to master the data science field

Leondra R. Gonzalez

Aaren Stubberfield

Cracking the Data Science Interview

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Group Product Manager: Niranjan Naikwadi

Publishing Product Manager: Nitin Nainani

Senior Editor: Hayden Edwards

Technical Editor: Simran Haresh Udasi

Copy Editor: Safis Editing

Project Coordinator: Aishwarya Mohan

Proofreader: Safis Editing

Indexer: Rekha Nair

Production Designer: Prashant Ghare

Marketing Coordinators: Vinishka Kalra

First published: March 2024

Production reference: 1160224

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB

ISBN 978-1-80512-050-6

www.packtpub.com

Foreword

The data science landscape is ever-evolving and has been that way since its conception. Though it is a rewarding field with many opportunities, navigating it can be a challenge, especially when you’re just getting started.

During my career, I have found that various companies can interpret data science differently depending on their business needs or understanding of data science. When I first began my data science journey in 2015, I was employed as a health data analyst with a start-up. It was there that I was exposed to data science, as my role was not purely data analytics or data science, but a mixture somewhere in between. I wanted to continue learning and advancing, but I did not know where to focus my energy to gain the information needed to thrive in this field. So, I curated a list of lessons I needed to learn in order to be competent enough to enter and advance in the field. I learned Python, data science with Python, R programming, linear algebra, and calculus, and as time went on, it became more and more daunting, the list of lessons becoming even longer than what was required for a graduate degree. Unfortunately, even after all of my hard work, during interviews, I found there were still concepts that I was unaware of. This has been the issue that I, as well as others, have noted with this field – there is so much information, but it can be unclear where to begin and what information is necessary to know.

On top of this, the data science interview is universally dreaded and challenging for various reasons that I have already alluded to. For instance, candidates are usually unsure of what that particular company considers data science. Plus, take-home assignments can take hours to complete – and once that time has been invested in completing the assignment, the company may choose to not offer feedback or, even worse, disappear completely when they’ve decided they aren’t interested. After experiencing this devastating outcome more than once, I became highly selective in what companies I chose to do a take-home assignment for. Many companies had a habit of immediately asking candidates to complete a take-home assignment before an interview, which I have learned rarely works in the candidate’s favor.

This book will address and outline the concepts that are necessary to begin or progress in a data science role. Because this field is ever-evolving, our understanding of concepts will continue as well, however this book can be used as a reference for those that are experienced in the field, or for those that are in data science adjacent roles and want to keep their knowledge current. This book will include imperative information so that candidates can be successful during a data science interview, as well as removing some of the guesswork in what companies are expecting.

It is widely accepted that data science candidates have an online portfolio to showcase their talent and application of knowledge – for this reason, there is information on how to build a portfolio and create a resume that will get you noticed. Salary and benefits negotiation is also outlined to streamline the process for you – a process many of us had to learn completely uninformed in the past, is now disseminated for the benefit of others.

We are certain that you will find this book helpful in your data science journey. Cheers!

Angela Baltes, PhD

Data Scientist, UnitedHealth Group

Contributors

About the authors

Leondra R. Gonzalez is a senior data and applied scientist at Microsoft with a decade of experience in data science, analytics, and corporate strategy. In addition to her work as a data scientist, Leondra has led teams in the entertainment, media, and advertising space to produce advanced e-commerce models for top brands, including NBC Peacock, First Aid Beauty, Procter & Gamble, HBO Max, Toyota, Whirlpool, and Tubi.

Academically, Leondra graduated from Carnegie Mellon University’s Heinz College of Information Systems Management with a master’s in entertainment industry management, with a focus on business analytics; Quantic School of Business and Technology with an MBA, including a specialization in statistics; and Otterbein University with a bachelor’s in music and business. Leondra is currently pursuing a PhD in information technology with a specialization in artificial intelligence at the University of the Cumberlands, and she has researched deep learning architectures as a PhD computer science apprentice at Google.

To my loving husband, Chris, my parents, my sister, and my unborn son who kicked my bump every day while writing this book.

Aaren Stubberfield is a senior data scientist for Microsoft’s digital advertising business and the author of three popular courses on DataCamp. He graduated with an MS in predictive analytics and has over 10 years of experience in various data science and analytical roles, focused on finding insights for business-related questions.

With his experience, he has led numerous teams of data scientists and has been instrumental in the successful completion of many projects. Aaren’s technical skills include the use of AI, like LLMs, Python, and various other tools necessary for the execution of data science projects.

I want to thank the people who have been close to me and supported me, especially my wife, Pam, and my family.

About the reviewer

Vishal Kumar, a seasoned data scientist, has over seven years of experience with a premium credit card company, where he has made indelible contributions to the realms of AI and ML. He has a master’s degree in statistics from Delhi University.

Throughout his career, he has garnered a plethora of accolades, stemming from his adeptness in constructing cutting-edge decision science tools that have steered various organizations’ success. His commitment to continuous learning is evidenced by his embrace of new technologies, such as generative AI, to stay at the forefront of the ever-evolving data science landscape.

Beyond his professional pursuits, his creativity extends into his personal life, as he likes to paint and play ukulele.

Preface

Part 1: Breaking into the Data Science Field

1 Exploring Today’s Modern Data Science Landscape

What is data science?

Exploring the data science process

Data collection

Data exploration

Data modeling

Model evaluation

Model deployment and monitoring

Dissecting the flavors of data science

Data engineer

Dashboarding and visual specialist

ML specialist

Domain expert

Reviewing career paths in data science

The traditionalist

Domain expert

Off-the-beaten path-er

Tackling the experience bottleneck

Academic experience

Work experience

Understanding expected skills and competencies

Hard (technical) skills

Soft (communication) skills

Exploring the evolution of data science

New models

New environments

New computing

New applications

Summary

References

2 Finding a Job in Data Science

Searching for your first data science job

Preparing for the road ahead

Finding job boards

Beginning to build a standout portfolio

Applying for jobs

Constructing the Golden Resume

The perfect resume myth

Understanding automated resume screening

Crafting an effective resume

Formatting and organization

Using the correct terminology

Prepping for landing the interview

Moore’s Law

Research, research, research

Branding

References

Part 2: Manipulating and Managing Data

3 Programming with Python

Using variables, data types, and data structures

Answers

Indexing in Python

Using string operations

Initializing a string

String indexing

Answers

Using Python control statements, loops, and list comprehensions

Conditional statements such as if, elif, and else

Loop statements such as for and while

List comprehension

Using user-defined functions

Breaking down the user-defined function syntax

Doing “stuff” with user-defined functions

Getting familiar with lambda functions

Creating good functions

Answers

Handling files in Python

Opening files with pandas

Answers

Wrangling data with pandas

Handling missing data

Selecting data

Sorting data

Merging data

Aggregation with groupby()

Summary

References

4 Visualizing Data and Data Storytelling

Understanding data visualization

Bar charts

Line charts

Scatter plots

Histograms

Density plots

Quantile-quantile plots (Q-Q plots)

Box plots

Pie charts

Surveying tools of the trade

Power BI

Tableau

Shiny

ggplot2 (R)

Matplotlib (Python)

Seaborn (Python)

Developing dashboards, reports, and KPIs

Developing charts and graphs

Bar chart – Matplotlib

Bar chart – Seaborn

Scatter plot – Matplotlib

Scatter plot – Seaborn

Histogram plot – Matplotlib

Histogram plot – Seaborn

Applying scenario-based storytelling

Summary

5 Querying Databases with SQL

Introducing relational databases

Mastering SQL basics

The SELECT statement

The WHERE clause

The ORDER BY clause

Aggregating data with GROUP BY and HAVING

The GROUP BY statement

The HAVING clause

Creating fields with CASE WHEN

Analyzing subqueries and CTEs

Subqueries in the SELECT clause

Subqueries in the FROM clause

Subqueries in the WHERE clause

Subqueries in the HAVING clause

Distinguishing common table expressions (CTEs) from subqueries

Merging tables with joins

Inner joins

Left and right join

Full outer join

Multi-table joins

Calculating window functions

OVER, ORDER BY, PARTITION, and SET

LAG and LEAD

ROW_NUMBER

RANK and DENSE_RANK

Using date functions

Approaching complex queries

Process and answer

Summary

6 Scripting with Shell and Bash Commands in Linux

Introducing operating systems

Navigating system directories

Introducing basic command-line prompts

Understanding directory types

Filing and directory manipulation

Scripting with Bash

Introducing control statements

Creating functions

Processing data and pipelines

Using pipes

Using cron

Summary

7 Using Git for Version Control

Introducing repositories (repos)

Creating a repo

Cloning an existing remote repository

Creating a local repository from scratch

Linking local and remote repositories

Detailing the Git workflow for data scientists

Using Git tags for data science

Understanding Git tags

Using tagging as a data scientist

Understanding common operations

Summary

Part 3: Exploring Artificial Intelligence

8 Mining Data with Probability and Statistics

Describing data with descriptive statistics

Measuring central tendency

Measuring variability

Introducing populations and samples

Defining populations and samples

Representing samples

Reducing the sampling error

Understanding the Central Limit Thereom (CLT)

The CLT

Demonstrating the assumption of normality

Shaping data with sampling distributions

Probability distributions

Uniform distribution

Normal and student’s t-distributions

The binomial distribution

The Poisson distribution

Exponential distribution

Geometric distribution

The Weibull distribution

Testing hypotheses

Understanding one-sample t-tests

Understanding two-sample t-tests

Understanding paired sample t-tests

Understanding ANOVA and MANOVA

Chi-squared test

A/B tests

Understanding Type I and Type II errors

Type I error (false positive)

Type II error (false negative)

Striking a balance

Summary

References

9 Understanding Feature Engineering and Preparing Data for Modeling

Understanding feature engineering

Avoiding data leakage

Handling missing data

Scaling data

Applying data transformations

Introducing data transformations

Logarithm transformations

Power transformations

Box-Cox transformations

Exponential transformations

Engineering categorical data and other features

One-hot encoding

Label encoding

Target encoding

Calculated fields

Performing feature selection

Types of feature selection

Recursive feature elimination

L1 regularization

Tree-based feature selection

The variance inflation factor

Working with imbalanced data

Understanding imbalanced data

Treating imbalanced data

Reducing the dimensionality

Principal component analysis

Singular value decomposition

t-SNE

Autoencoders

Summary

10 Mastering Machine Learning Concepts

Introducing the machine learning workflow

Problem statement

Model selection

Model tuning

Model predictions

Getting started with supervised machine learning

Regression versus classification

Linear regression – regression

Logistic regression

k-nearest neighbors (k-NN)

Random forest

Extreme Gradient Boosting (XGBoost)

Getting started with unsupervised machine learning

K-means

Density-based spatial clustering of applications with noise (DBSCAN)

Other clustering algorithms

Evaluating clusters

Summarizing other notable machine learning models

Understanding the bias-variance trade-off

Tuning with hyperparameters

Grid search

Random search

Bayesian optimization

Summary

11 Building Networks with Deep Learning

Introducing neural networks and deep learning

Weighing in on weights and biases

Introduction to weights

Introduction to biases

Activating neurons with activation functions

Common activation functions

Choosing the right activation function

Unraveling backpropagation

Gradient descent

What is backpropagation?

Loss functions

Gradient descent steps

The vanishing gradient problem

Using optimizers

Optimization algorithms

Network tuning

Understanding embeddings

Word embeddings

Training embeddings

Listing common network architectures

Common networks

Tools and packages

Introducing GenAI and LLMs

Unveiling language models

Transformers and self-attention

Transfer Learning

GPT in action

Summary

12 Implementing Machine Learning Solutions with MLOps

Introducing MLOps

A model pipeline overview

Understanding data ingestion

Learning the basics of data storage

Reviewing model development

Packaging for model deployment

Identifying requirements

Virtual environments

Tools and approaches for environment management

Deploying a model with containers

Using Docker

Validating and monitoring the model

Validating the model deployment

Model monitoring

Thinking about governance

Using Azure ML for MLOps

Summary

Part 4: Getting the Job

13 Mastering the Interview Rounds

Mastering early interactions with the recruiter

Mastering the different interview stages

The hiring manager stage

The technical interview

Coding questions, step by step

The panel stage

Summary

References

14 Negotiating Compensation

Understanding the compensation landscape

Negotiating the offer

Negotiation considerations

Responding to the offer

Maximum negotiable compensation and situational value

Summary

Final words

Index

Other Books You May Enjoy

Part 1: Breaking into the Data Science Field

In the first part of this book, you will learn about the data science profession as it exists in the modern day, and how this relates to your endeavors in the field. This will serve as an introduction to various career paths and help to set expectations in terms of the skills and competencies required to be successful.

This part includes the following chapters:

Chapter 1, Exploring Today’s Modern Data Science LandscapeChapter 2, Finding a Job in Data Science

1 Exploring Today’s Modern Data Science Landscape

If you’ve picked up this book, chances are that you’ve already heard of data science. It’s arguably one of the fastest-growing, most discussed professions within the tech and STEM space, all while maintaining its relative edge and mystique. That is, many people have heard of data scientists, but very few know what they do, how a data scientist produces value, or how to break into the field from scratch.

In this chapter, we will verify the definition of data science with a practical description. Then, we will discuss what most data science jobs entail, while spending some time describing the distinction between different flavors of data science. We’ll then dive into the various paths into data science and what makes it so challenging to land your first job. We’ll finish the chapter with an overview of the non-negotiable competencies expected of data scientists.

By the end of this chapter, you will have a firm understanding of the modern data scientist, the various paths to getting the job, and what to expect in your journey to becoming one.

With this gentle introduction, you’ll have a better understanding of the job of a data scientist, which path to becoming a data scientist best fits your journey, the barriers to expect in your journey, and which skills you should master.

In this chapter, we will cover the following topics:

What is data science?Exploring the data science processDissecting the flavors of data scienceReviewing career paths in data scienceTacking the experience bottleneckUnderstanding expected skills and competenciesExploring the evolution of data science

What is data science?

To begin, let’s offer a definition of data science. According to Wikipedia, data science “is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data”[1]. It encompasses various techniques, procedures, and tools to process, analyze, and visualize data, enabling businesses and organizations to make data-driven decisions and predictions. The primary goal of data science is to identify patterns, relationships, and trends within data to support decision-making and create actionable insights.

You are not alone in your interest in data science – it was called by the Harvard Business Review one of the sexiest jobs in the 21st century [2], and stories of data scientists earning enormous salaries in the six-figure range are not uncommon. Data scientists are often looked at as oracles within an organization, answering complex business questions such as, “If we increase our offering to this group of customers, can we increase our revenues?” or “What are the common causes of customer churn?”

Within organizations, the demand for the skills of data scientists has continued to grow. The U.S. Bureau of Labor Statistics estimated that in 2022, the number of jobs for data scientists will increase by roughly 36% over the next 10 years [3]. This growth in the demand for data scientists is being fuelled by several factors, which are shown here:

Figure 1.1: Reasons for the increased demand for data scientists

The first is the proliferation of data. The exponential growth of data generated by digital devices, social media, and various other sources has made it essential for organizations to harness this data for decision-making and innovation. This data growth is expected to continue in the future, with the International Data Corporation (IDC) expecting that by 2025, we will generate 175 zettabytes of data annually [4]. That is a staggering amount of data!

Organizations want to take advantage of this explosion in data availability to generate insights for decision-making. As the world becomes more interconnected and complex, the need for evidence-based decision-making has grown, leading to an increased demand for skilled data scientists who can transform data into actionable insights. Organizations and businesses increasingly rely on data-driven insights to gain a competitive edge in the market, optimize operations, and improve customer experiences.

Finally, transforming data into insights couldn't be accomplished without advancements in computational power and the advancement of tools and platforms. The increased computing power and the development of advanced algorithms, especially in machine learning (ML) and deep learning (DL), have made it possible to efficiently process and analyze massive amounts of data. In addition, the development of open source tools, libraries, and platforms has made data science more accessible to a broader audience, fostering the growth of the profession.

Hence, data science is still an evolving field that is only expected to grow in parallel with computational and technological advancements (such as generative AI). Furthermore, as companies continue to embrace the digital age with an increased interest in maximizing their utility of data and capitalizing on its underlying insights for a competitive advantage, the demand for data scientists will also expand.

However, although data science is often regarded and described as a monolithic function, you’ll soon learn that it’s a multi-faceted discipline that often varies by team, department, or even company. Naturally, the data scientist job profile is also an ever-evolving description, but we will cover all our bases for the most common tasks.

Exploring the data science process

Performing data science work is often an iterative process, where the data scientist needs to return to earlier steps if they run into challenges. There are many ways to categorize the data science process, but it often includes:

Data collectionData explorationData modelingModel evaluationModel deployment and monitoring

Let’s briefly touch on each step and discuss what’s expected of the data scientist during them.

Data collection

Data collection and preprocessing involves gathering data from various sources (such as databases, APIs, and web scraping), then cleaning and transforming the data to prepare it for analysis. This step involves dealing with missing, inconsistent, or noisy data and converting it into a structured format. Depending on the organization, a team of data engineers support this step of the data science process; however, it is common for the data scientist to manage this process as well. This requires them to have intimate knowledge of the data sources and the ability to write Structured Query Language (SQL) queries, code that can query databases, or custom tools such as web scrapers to gather the needed data.

Data exploration

Data exploration involves conducting exploratory data analysis (EDA) to better understand the data, detect anomalies, and identify relationships between variables. The key to this step is to look for correlations and understand the distribution of the data. This involves using descriptive statistics and visualization techniques to summarize the data and gain insights; therefore, the data scientist should be able to use summary statistics, program descriptive visualizations, or utilize reporting tools such as Power BI or Tableau to create robust charts.

Data modeling

Using what was learned in the data exploration step, data modeling is the step when the data scientist builds their predictive or descriptive models using ML and statistical techniques that identify patterns and relationships in the data. Here, the data scientist selects the appropriate algorithms, trains the models on historical data, and validates their performance.

Model evaluation

Model evaluation and optimization involves assessing the performance of models using metrics such as accuracy, RMSE, precision, recall, AUC, or F1 scores. Based on these evaluations, data scientists may refine the models or try alternative algorithms to improve their performance. Understanding the underlying reasons behind a model’s predictions is crucial for building trust in its results and ensuring that it aligns with the domain knowledge. Therefore, the data scientist must be sure the model solves the organizational/business goal. Here, the data scientist needs to be able to communicate their findings to possible technical and non-technical individuals.

Model deployment and monitoring

Model deployment and monitoring involves implementing the models in real-world applications, monitoring their performance, and maintaining them to ensure their continued accuracy and relevance. For example, the data scientist might work with a data engineering team or use tools such as containers to implement the model. Once deployed, the data scientist may also need to develop dashboards to monitor the model’s performance over time and flag stakeholders if it goes outside the expected performance range.

As you can see, data science is a profession that incorporates many data-related tasks – particularly those that involve the acquisition, prepping, and delivery of data in one format or another. While data modeling makes up most of the glitz and glamour associated with the job, it is really everything else that takes up roughly 80% of the gig. This does not include non-data-related tasks, such as interfacing with stakeholders, gathering requirements, debugging software, checking emails, and research. However, those tasks are not necessarily unique to data scientists.

Now that you understand the common tasks associated with the job, let’s explore the different types or flavors of data science.

Dissecting the flavors of data science

Now that we have defined some of the critical aspects of the role of a data scientist, it is clear that the role often covers many different skills. Data scientists are frequently asked to perform a variety of data-related tasks, including designing database tables to collect data, programming ML algorithms, understanding statistics, and creating stunning visuals to help explain interesting findings to others, but it is difficult for any single person to master all of these skill areas.

Therefore, we often see data scientists who are particularly skilled in one or two areas and have basic competencies in the others. Their talents could be considered T-shaped, where they are proficient across many areas such as the horizontal line of a T, while they have deep knowledge and expertise in a few areas such as the vertical portion of the letter:

Figure 1.2: Example of the ‘T of Competencies’

While this example shows an example of someone who is adequate in data engineering and visualization principles but exceptional in ML, you can expect to see every possible combination of skills among data scientists. These competencies are often aligned with a person’s unique experiences or interests. Perhaps they were a statistics major and took a liking to ML, or perhaps they’re a former business intelligence (BI) engineer with considerable experience in data extraction, transformation, and loading (ETL), allowing them to grasp data engineering concepts much faster.

Whatever the reason, it’s natural for someone to grasp some concepts better than others. This is important to remember as you navigate this book. While you are not expected to specialize in every facet of data science, you are expected to master the fundamentals. However, you will almost certainly discover your T of Competencies – a trinity of top skill sets that will solidify your identity in the data science space.

While there are countless combinations of skill proficiencies, let’s review some of the most common that you will encounter:

The data engineerThe dashboarding and visual specialistThe ML specialistThe domain expert

Let’s take a look at these now.

Data engineer

As we discussed earlier, data engineering is a crucial aspect of the data science process that involves data collection, storage, processing, and management. It focuses on designing, developing, and maintaining scalable data infrastructure, ensuring the availability of high-quality data for analysis and modeling. Data engineers are most known for their oversight of the ETL process of data pipelines. On some data scientist teams, especially within smaller organizations, the data engineering responsibilities sit within the data science team. Therefore, the data scientist specializing in this area can help support team projects with data collection and storage, understanding the needs of the ML process, such as structuring the data so that it can be fed efficiently to a DL algorithm.

Data engineers have a wealth of tools to choose from. It is not expected for any single data engineer to know all of these technologies, especially at the same level of competencies. In fact, the more senior the engineer, the more competent they are in their tools of choice. Furthermore, this is not a comprehensive list. However, you can expect to see the following on data engineer resumes:

Programming languages: Python, SQL, Scala, R, C++Data storage: Relational databases (for example, MySQL, PostgreSQL, Oracle), NoSQL databases (for example, MongoDB, Cassandra, DynamoDB), data warehouses (for example, Snowflake, Redshift, BigQuery), distributed filesystems (for example, Hadoop Distributed File System (HDFS), Apache Cassandra)Data processing and analysis: Apache Spark, Apache Flink, Apache Storm, Apache Beam, MapReduce, Hadoop, Hive, Apache Kafka, Amazon KinesisData integration and ETL: Apache NiFi, Talend, Apache Airflow, AWS Glue, Google Cloud Dataflow, dbtData version control and collaboration: Git, GitHub, GitLab, Bitbucket, Azure DevOpsData visualization and BI: Tableau, Power BI, Looker, QlikView, DomoCloud platforms and infrastructure: Microsoft Azure, Google Cloud Platform (GCP), Amazon Web Services (AWS)Containers: Docker, Kubernetes

Dashboarding and visual specialist

Data visualization is the graphical representation of data and information using visual elements such as charts, graphs, and maps. It enables stakeholders to understand complex patterns, trends, and relationships in data, allowing for more informed decision-making. Data visualization helps simplify complex data and present it in an easily digestible format, identify patterns, trends, and correlations in data, support data-driven decision-making, and communicate insights and findings effectively to a broad audience. Combining data visualizations with a compelling narrative can become a powerful motivator to drive organizational actions. Many news organizations hire phenomenal data scientists specializing in data visualization to communicate complex information to their audience.

Dashboarding and visual specialists have different designations depending on the organization, but some of the most common names you’ll hear include BI engineer, data analyst, data visualization expert, data storyteller, and many others. They are commonly individuals with a strong background in descriptive statistics, data storytelling, and developing keyperformance indicators (also known as KPIs). The most common tools you will see used by dashboarding and visual specialists include:

Programming languages: Python, SQL, R, JavaScriptData storage: Relational databases (for example, MySQL, PostgreSQL, Oracle), NoSQL databases (for example, MongoDB, Cassandra, DynamoDB), data warehouses (for example, Snowflake, Redshift, BigQuery)Frameworks: Dask, Plotly, ggplot2, Shiny, Matplotlib, Seaborn, DB.jsData visualization and BI: Tableau, Power BI, Looker, QlikView, Domo, Funnel, ExcelCloud platforms and infrastructure: Microsoft Azure, GCP, AWS

ML specialist

When most people think about data scientists, they think about someone who designs and implements ML algorithms. ML specialists and engineers utilize computers to learn and improve from experience without explicit programming by developing algorithms and models to analyze data, identify patterns, and make predictions or decisions based on those patterns. They play a critical role in building intelligent applications and systems. ML specialists have a strong sense of which learning algorithms to use and how to adjust their parameters to achieve the best performance.

As a result, they have a strong propensity toward research to stay current on the latest methods of quantitative problem-solving and are specifically skilled in ML development, deployment, and maintenance tasks. They have a robust toolset as they are highly proficient in software development principles. While it certainly isn’t a rule, many ML specialists tend to have a strong background in statistics, operations research, computer science, and/or information systems. Tools used by ML specialists might include:

Programming languages: Python, SQL, R, Java, C++Frameworks: TensorFlow, Keras, scikit-learn, PyTorch, H2O, Hugging FaceData storage: Relational databases (for example, MySQL, PostgreSQL, Oracle), NoSQL databases (for example, MongoDB, Cassandra, DynamoDB), data warehouses (for example, Snowflake, Redshift, BigQuery), distributed filesystems (for example, HDFS, Apache Cassandra)Data processing and analysis: Apache Spark, Apache Flink, Apache Storm, Apache Beam, MapReduce, Apache KafkaData integration and ETL: Apache NiFi, Talend, Apache Airflow, AWS Glue, Google Cloud DataflowData version control and collaboration: Git, GitHub, GitLab, BitbucketCloud platforms and infrastructure: Microsoft Azure, GCP, AWSDeployment: Docker, Kubernetes, Flask

Domain expert

Domain experts are data scientists with in-depth knowledge and expertise in specific domains within the industry or field; for example, someone who has gained much knowledge and expertise working on computer vision (CV) or natural language (NL) problems. They leverage their domain knowledge to develop custom ML models and data analysis techniques tailored to their domain’s unique challenges and requirements. However, there are also non-technical domain experts who gained a deep familiarity with a particular industry or business problem given their professional history. For example, someone with a background in digital marketing may have an edge for a data science role that requires an understanding of media mix modeling or data-driven attribution, whereas someone with aviation experience may have an advantage in route optimization models.

Because domain experts tend to carry domain-specific expertise, they often are already familiar with the tools of their specific industry. For example, a digital marketing professional is bound to have some experience with a myriad of MarTech platforms, including Google Analytics, Adobe Analytics, HubSpot, and more.

These are just some of the flavors or different areas to specialize in within data science. You will not need to be an expert in all of these areas, but you will need to show some level of competency and willingness to grow in all of these areas. Often when working on data science projects, you will gravitate to one of these areas out of necessity or passion; gaining practical experience will be key here and strengthen your candidacy for a role where the hiring manager is looking for someone with that skill set.

If you haven’t noticed, many of these data science flavors are the consequence of one’s prior experience, either in tech or otherwise. For example, a software engineer may be well suited to transition into ML or data engineering, while a data analyst may find an easier time transitioning to data engineer or BI engineer. As you’ve seen, there is a considerable overlap in skills, tools, and tasks with all flavors of data science.

This brings us to the paths to data science. You may have already envisioned where you fit into the equation given some of the prior descriptions. Let’s take the time to explicitly discuss some common paths to the data science profession.

Reviewing career paths in data science

The field of data science is rapidly evolving, drawing professionals from various backgrounds and disciplines. This dynamic landscape has given rise to a multitude of career paths in data science, each bringing their unique perspectives, skills, and experiences to the table. In this section, we will explore three primary types of data scientists: the traditionalist, the domain expert, and the off-the-beaten path-er. Does one of these career paths best fit you?

The traditionalist

The traditionalist data scientist has followed a more conventional educational path toward data science. They typically possess a strong background in computer science or mathematics, often with a minor in the other. Other common majors include operations research, statistics, physics, and engineering. These individuals often go on to earn an advanced degree in these fields, including a master’s degree or even a Ph.D. Their rigorous academic training equips them with a deep understanding of statistical methodologies, programming languages, and advanced algorithms.

The traditionalist data scientist has a comprehensive understanding of the underlying mathematical and statistical principles that govern the field of data science. They are well-versed in probability theory, linear algebra, calculus, and optimization techniques, which form the basis for many ML algorithms and statistical modeling. This theoretical foundation enables them to grasp the nuances of various methods and research the most appropriate approach for a given problem.

Equipped with a background in computer science, traditionalists are adept at programming languages commonly used in data science, such as Python and R. Their programming skills allow them to manipulate data, implement ML algorithms, and develop custom solutions tailored to specific problems. Furthermore, they are skilled in using specialized libraries and frameworks, such as TensorFlow, PyTorch, and scikit-learn, to expedite the development of data science projects.

In brief, the traditionalist data scientist is characterized by their strong STEM academic background, comprehensive understanding of statistical principles, and proficiency in programming and data manipulation. If your background is traditionalist, we suggest positioning yourself in job interviews as someone with deep expertise in ML. In addition, highlight any research experience you have.

Domain expert

Domain expert data scientists are professionals who initially started their careers in a specific industry, such as marketing, finance, healthcare, or supply chain, before branching out into data science. With a strong understanding of their domain, these individuals have gradually acquired data analysis and programming skills to supplement their expertise (for example, a company controller uses domain expertise and knowledge to develop an ML algorithm that flags fraudulent transactions). Domain experts possess a unique ability to leverage their domain knowledge to uncover relevant insights from data, enabling organizations to make data-driven decisions that drive growth and efficiency.

Domain experts have a comprehensive understanding of the intricacies and nuances of their industry, making them invaluable assets in data-driven projects. Their knowledge of industry-specific challenges, trends, and best practices enables them to identify critical business problems and frame data-driven solutions that are relevant and impactful. Armed with extensive domain knowledge and analytical skills, domain expert data scientists excel at developing solutions tailored to their industry. In addition, they have a keen ability to translate business questions into data-driven hypotheses and use their understanding of the sector’s unique characteristics to guide their analysis. This targeted approach allows them to generate insights that directly address the needs and priorities of their industry.

Additionally, domain experts are well versed in the analytical tools and software commonly used in their respective fields. These specialized tools, which may include industry-specific data platforms, visualization software, or ML frameworks, allow them to efficiently process and analyze data unique to their domain. Their expertise with these tools enables them to deliver insights more quickly and effectively than their counterparts who lack industry-specific knowledge.

Finally, one of the critical strengths of domain expert data scientists is their ability to communicate complex data insights to non-technical stakeholders within their industry. In addition, they understand the context and terminology of their domain, enabling them to present findings in a manner that resonates with their business partners. This skill is critical for driving data-driven decision-making and ensuring that the value of their work is recognized and understood by their organization.

In summary, if you have specialized knowledge of the field you are interviewing for, we suggest positioning yourself as a domain expert data scientist. Highlight your deep understanding of the industry and their challenges, enabling you to deliver targeted and impactful data-driven solutions. Additionally, highlight that you can communicate complex insights effectively using industry terminology. Your domain knowledge and data science techniques will make you a valuable asset to any organization in their field.

Off-the-beaten path-er

The off-the-beaten path-er data scientist is an individual who has ventured into data science from what’s deemed as a non-traditional background. These professionals may come from diverse fields with less focus on quantitative tasks, such as psychology, music, or even journalism. This unconventional background can provide them with unique perspectives and creative problem-solving abilities, enriching the field of data science with their varied experiences.

Off-the-beaten path-ers possess a wide range of educational and professional backgrounds, which equip them with diverse skills and knowledge. They may have initially pursued a career in a different domain before discovering their passion for data science. This varied experience often results in a broader, interdisciplinary approach to problem-solving, allowing them to draw connections and insights that might be overlooked by their more traditionally trained peers. For example, off-the-beaten path-ers might approach the problems within ML and artificial intelligence (AI) ethics (a topic of increasing relevance within AI) differently than the traditionalist or domain expert. They may also regard ML and AI as tools to create a better world by tackling humanitarian issues such as disaster response, public health, food security, and human rights. Furthermore, AI may also be of interest to civil engineers with an interest in smart cities or political science majors with detecting implicit biases in the criminal justice system.

With their unconventional backgrounds, off-the-beaten path-ers bring a unique perspective to data science, enabling them to tackle problems from a different angle. Their creativity and innovative thinking can lead to the development of new methods, models, or visualizations that challenge the status quo and push the boundaries of what is possible in data science. This outside-the-box thinking is valuable, especially when addressing complex or novel challenges.

Also, with their unique backgrounds, off-the-beaten path-ers are well equipped to collaborate with professionals from various disciplines, leveraging their distinct perspectives to solve complex problems. Their ability to work effectively with interdisciplinary teams can lead to the development of innovative solutions that combine the strengths of multiple fields, driving growth and success for the organization. To facilitate working with different backgrounds, they often have to communicate complex ideas and insights effectively to diverse audiences. Off-the-beaten path-ers often understand the importance of storytelling in data science, using data visualizations and narratives to convey their findings clearly and compellingly. This skill enables them to bridge the gap between technical experts and non-technical stakeholders, facilitating collaboration.

In conclusion, if you have come to data science as an off-the-beaten path-er, we recommend positioning yourself in job interviews as someone who is adaptive and can bring your unique perspective to facilitate creative problem-solving. Additionally, highlight any abilities to communicate and collaborate.

As the field of data science continues to expand, the diversity of its professionals will only increase. The traditionalist, domain expert, and off-the-beaten path-er each bring unique strengths and perspectives. Of course, these are just generic groupings of data science professionals and you may be a mix of all of these profiles. Embracing your individual strengths will allow you to best position yourself in a data science interview.

Nonetheless, while all of these paths have their benefits, none of them are without barriers. A common misconception in data science is there is a perfect path, or one that’s comprehensive such that the path with be without bottlenecks. While it is true that some paths have advantages over others, they each have gaps to address. While some of these gaps are flavor- or path-specific, they all share one: getting the first data science job.

Tackling the experience bottleneck

So, you want to be a data scientist? Welcome to The Hunger Games: Data Science Edition!

While that may sound like an exaggeration, the increasing demand for data scientists has turned the interview process into a battleground for candidates with various backgrounds and expertise.

But fear not – just as with The Hunger Games, the odds can be in your favor.

The fact that there is competition should not scare you away from entering the field. You’ve already shown your interest and commitment by reading this book, and as you progress through it, you’ll learn how to prepare for data science interviews, regardless of your background. In addition, we will share strategies to fill gaps in your experience to make you a stronger candidate. Remember – you have your own set of strengths and weaknesses. You can come out on top by focusing on your gaps and understanding your unique skills.

Believe it or not, it's incredibly common for candidates to have gaps in their experience. In the next couple of sections, we will review two familiar sources of experience gaps: academic and work experience gaps. In addition to noting these gap areas, we will give you suggestions on how to close them.

Academic experience

One common gap in a job candidate’s experience is their academic background. Employers may favor candidates with formal degrees in data science, computer science, or a related field, making it challenging for those without a traditional academic background to stand out. You may not be an engineer or a programmer by trade, but you understand math or computers but have yet to get into the details of hypothesis testing. There’s no need to worry. The first step in addressing gaps in your academic background is identifying them. Reflect on your education and experience, and ask yourself the following questions:

In which areas of data science do I feel the least confident?To which technologies or concepts do I need more exposure?Which topics or tasks do I struggle with the most during interviews or when working on projects?What models are commonly needed for the job that I want?

Once you’ve identified your gaps, you can create an action plan to address them effectively. Here are several methods to help you fill the academic experience gap and strengthen your data science candidacy:

Pursue relevant certifications: Obtain certifications in data science, ML, AI, or related fields from reputable organizations or platforms (for example, DataCamp, Codeacademy, Sololearn, Alison, Udemy, Udacity, Google certifications, and so on). These certifications can help you gain credibility, showcase your expertise, and demonstrate your commitment to learning.Attend workshops and boot camps: Participate in workshops, boot camps, or short-term courses that provide hands-on experience in data science techniques and tools. For example, Meetup.com and LinkedIn are useful sites for identifying local or virtual data science groups. This will not only help you enhance your skills but also allow you to connect with other professionals in the field.Leverage Massive Open Online Courses (MOOCs): Enroll in MOOCs from top universities or platforms to learn data science concepts and techniques. Common websites include Coursera and edX. These courses can help you build a strong foundation in the subject and supplement your non-traditional academic background.Build a strong portfolio: Create a robust portfolio that showcases your data science projects, coding skills, and problem-solving abilities. Highlight your unique perspective and how your non-traditional background has contributed to your approach to data science.Network with data science professionals: Connect with professionals in the data science field through networking events, online forums, or social media platforms such as LinkedIn. This can help you gain insights into the industry, learn about job opportunities, and build relationships that can lead to mentorship or job referrals.

Resources, such as books, online courses, and tutorials, help you gain the necessary knowledge. Develop a realistic timeline for completing any of these activities and don't become overwhelmed by the vast availability of online courses. Setting achievable goals and being patient with yourself is important when developing your learning plan. Remember – data science is a vast field, and it takes time to become proficient. Set a dedicated time to work on your learning plan. In addition, engage with the data science community through forums, social media, and networking events to learn from others and stay motivated.

Work experience

Another common experience gap for candidates is related to work experience. Entering the data science field can be challenging, particularly when faced with the work experience bottleneck. Employers often seek candidates with prior experience, creating a catch-22 for aspiring data scientists: you need experience to get a job, but you need a job to gain experience! This section will explore common reasons for gaps in a work background and provide strategies to help you overcome the work experience bottleneck.

There are several reasons why your work background might not perfectly align with what an employer is looking for, such as a career transition from a different field; you may be a recent graduate with limited or no full-time experience, or you may have employment gaps due to personal reasons (for example, caregiving, health, travel) or have done freelance or contract work, which may not be perceived as consistent or relevant experience.

Understanding the reasons behind work background gaps is essential for crafting a compelling narrative and demonstrating your value to potential employers. Here are several methods to help you fill the work experience gap and strengthen your data science candidacy:

Personal projects: Develop and showcase personal projects demonstrating your skills, creativity, and problem-solving abilities. Choose projects that align with your career interests or target industries. This will help build your portfolio and show your passion and commitment to the field.Internships, co-ops, fellowships, and apprenticeships: Seek internships, co-ops, or apprenticeships to gain hands-on experience and make valuable connections in the industry. These opportunities can provide a foot in the door, allowing you to learn from experienced professionals and build a network that can lead to future job prospects. There are even some online internships. For example, Forage offers virtual experiences hosted by top companies including JPMorgan Chase, Walmart, KPMG, Lyft, Red Bull, PWC, Accenture, Deloitte, GE, and more. Many tech companies such as Microsoft, Amazon, and Google offer many apprenticeships for recent graduates and professionals. Some organizations offer online fellowships, such as Correlation One and Insight Fellows.Freelance and consulting work: Offer freelance or consulting services to businesses and organizations, even if on a pro bono basis. This allows you to gain practical experience, enhance your skills, and build a track record of success. In addition, it demonstrates your ability to work with clients and solve real-world problems. Websites include Upwork, Fiverr, FlexJobs, and so on.Online competitions and hackathons: Participate in data science competitions and hackathons, such as those hosted on Kaggle or DrivenData. These events allow you to work on challenging problems, collaborate with others, and showcase your skills to potential employers.Open source contributions: Contribute to open source projects related to data science, ML, or AI. This improves your technical skills and demonstrates your ability to collaborate with others and contribute to the broader data science community.

By employing these strategies, you can overcome the work experience bottleneck and position yourself as a strong candidate in the data science job market. Remember – persistence and adaptability are key to success. Stay focused on your goals, seize opportunities to learn and grow, and, ultimately, you’ll break through the work experience barrier to land your dream data science job.

Now that you’ve had a proper introduction to bottlenecks that you might encounter, as well as methods and resources to address them, let’s gain a better understanding of the skills and competencies that are expected of you. After reviewing both hard skills and underrated soft skills, you will be able to isolate your competency gaps, which will not only help you identify which resources to leverage but will also help you navigate this book in a more pointed and goal-oriented fashion. While it is encouraged to review the book in its entirety, you can prepare for sections that might require more attention.

Understanding expected skills and competencies

Here’s the deal – the interview is a critical component of the data science job application process, where you can showcase your skills, knowledge, and personality to potential employers. The interview process is crucial for several reasons:

Employers can assess your technical skills, problem-solving abilities, and critical thinkingIt lets you demonstrate your communication skills, teamwork, and cultural fitIt allows you to ask questions and gather information about the company and role to ensure it aligns with your career goals and valuesPreparing for the interview is essential to stand out in the competitive job market and secure your dream role

Preparing for the data science interview is essential to success. In fact, it’s one of the most useful activities that you can do for your career. This is not only true for prospective data scientists looking to land their first job in the field but also for well-seasoned data scientists who wish to stay on top of new techniques and technologies. In later sections of this book, we will help you prepare by reviewing the most common data science interview topics, including technical and case study questions. In addition, we will give you problems to practice your problem-solving skills, coding, and data manipulation techniques. Including these activities, you should also prepare by researching the company, its culture, products, and industry trends. Additionally, prepare questions to ask the interviewer to demonstrate your interest and engagement.

For now, know that most data science interviews consist of two primary areas: technical (hard) skills and non-technical (soft) skills. Each area serves a different purpose and requires distinct preparation strategies. The technical portion assesses your knowledge and skills in data science, programming, statistics, and ML. For example, it may include coding exercises or algorithmic questions, data manipulation and cleaning tasks, statistical analysis or hypothesis testing questions, and ML model selection and evaluation problems. Meanwhile, the non-technical portion evaluates your communication skills, problem-solving skills, and ability to work in a team. It may involve questions about your past experiences and accomplishments, situational or problem-solving scenarios, discussion of your strengths, weaknesses, and work style, and exploration of your motivations and career aspirations.

Mastering the data science interview is a crucial skill that can make or break your career. While we don’t win them all, studying for these interviews can feel like preparing for a marathon. This is especially true when you have to prepare for multiple interviews and/or take-home assignments. The key to breaking into the data science field is building strong foundations in expected skills and competencies. By excelling in the interview process, you can leave a lasting impression on potential employers and increase your chances of receiving a job offer. Furthermore, understanding the interview’s structure thoroughly prepares you for both technical and non-technical portions, and by effectively highlighting your strengths and skills, you’ll be well on your way to success in the data science field.

Let’s take a deeper look into what’s included in the hard and soft skills expected of a prospective data scientist. After the review, you will have a clearer concept of the proficiencies you will learn throughout this book.

Tausende von E-Books und Hörbücher

Ihre Zahl wächst ständig und Sie haben eine Fixpreisgarantie.

Sie haben über uns geschrieben:

Cracking the Data Science Interview E-Book

Leondra R. Gonzalez

Cracking the Data Science Interview

Foreword

Contributors

About the authors

About the reviewer

Table of Contents

Preface

Part 1: Breaking into the Data Science Field

1

Exploring Today’s Modern Data Science Landscape

What is data science?

Exploring the data science process

Data collection

Data exploration

Data modeling

Model evaluation

Model deployment and monitoring

Dissecting the flavors of data science

Data engineer

Dashboarding and visual specialist

ML specialist

Domain expert

Reviewing career paths in data science

The traditionalist

Domain expert

Off-the-beaten path-er

Tackling the experience bottleneck

Academic experience

Work experience

Understanding expected skills and competencies

Hard (technical) skills

Soft (communication) skills

Exploring the evolution of data science

New models

New environments

New computing

New applications

Summary

References

2

Finding a Job in Data Science

Searching for your first data science job

Preparing for the road ahead

Finding job boards

Beginning to build a standout portfolio

Applying for jobs

Constructing the Golden Resume

The perfect resume myth

Understanding automated resume screening

Crafting an effective resume

Formatting and organization

Using the correct terminology

Prepping for landing the interview

Moore’s Law

Research, research, research

Branding

References

Part 2: Manipulating and Managing Data

3

Programming with Python

Using variables, data types, and data structures

Answers

Indexing in Python

Using string operations

Initializing a string

String indexing

Answers

Answers

Using Python control statements, loops, and list comprehensions

Conditional statements such as if, elif, and else

Loop statements such as for and while

List comprehension

Using user-defined functions

Breaking down the user-defined function syntax

Doing “stuff” with user-defined functions

Getting familiar with lambda functions

Creating good functions

Answers