E-Book
40,81 €

Hands-On Data Science with SQL Server 2017 E-Book

Marek Chmel

0,0

40,81 €

Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.

Herausgeber: Packt Publishing
Kategorie: Wissenschaft und neue Technologien
Sprache: Englisch

Beschreibung

Find, explore, and extract big data to transform into actionable insights

Key Features

Perform end-to-end data analysis—from exploration to visualization

Real-world examples, tasks, and interview queries to be a proficient data scientist

Understand how SQL is used for big data processing using HiveQL and SparkSQL

Book Description

SQL Server is a relational database management system that enables you to cover end-to-end data science processes using various inbuilt services and features.

Hands-On Data Science with SQL Server 2017 starts with an overview of data science with SQL to understand the core tasks in data science. You will learn intermediate-to-advanced level concepts to perform analytical tasks on data using SQL Server. The book has a unique approach, covering best practices, tasks, and challenges to test your abilities at the end of each chapter. You will explore the ins and outs of performing various key tasks such as data collection, cleaning, manipulation, aggregations, and filtering techniques. As you make your way through the chapters, you will turn raw data into actionable insights by wrangling and extracting data from databases using T-SQL. You will get to grips with preparing and presenting data in a meaningful way, using Power BI to reveal hidden patterns. In the concluding chapters, you will work with SQL Server integration services to transform data into a useful format and delve into advanced examples covering machine learning concepts such as predictive analytics using real-world examples.

By the end of this book, you will be in a position to handle the growing amounts of data and perform everyday activities that a data science professional performs.

What you will learn

Understand what data science is and how SQL Server is used for big data processing

Analyze incoming data with SQL queries and visualizations

Create, train, and evaluate predictive models

Make predictions using trained models and establish regular retraining courses

Incorporate data source querying into SQL Server

Enhance built-in T-SQL capabilities using SQLCLR

Visualize data with Reporting Services, Power View, and Power BI

Transform data with R, Python, and Azure

Who this book is for

Hands-On Data Science with SQL Server 2017 is intended for data scientists, data analysts, and big data professionals who want to master their skills learning SQL and its applications. This book will be helpful even for beginners who want to build their career as data science professionals using the power of SQL Server 2017. Basic familiarity with SQL language will aid with understanding the concepts covered in this book.

Details

Das E-Book können Sie in Legimi-Apps oder einer beliebigen App lesen, die das folgende Format unterstützen:

EPUB

Seitenzahl: 563

Veröffentlichungsjahr: 2018

Bewertungen

0,0

Rezensionen(0 Rezensionen)

Leseprobe

Hands-On Data Science with SQL Server 2017

Perform end-to-end data analysis to gain efficient data insight

Marek Chmel

Vladimír Mužný

BIRMINGHAM - MUMBAI

Hands-On Data Science with SQL Server 2017

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Commissioning Editor: Sunith ShettyAcquisition Editor: Tushar GuptaContent Development Editor: Snehal KolteTechnical Editor: Dharmendra YadavCopy Editor: Safis EditingProject Coordinator: Manthan PatelProofreader: Safis EditingIndexer: Mariammal ChettiyarGraphics: Jisha ChirayilProduction Coordinator: Deepika Naik

First published: November 2018

Production reference: 1281118

Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK.

ISBN 978-1-78899-634-1

www.packtpub.com

mapt.io

Mapt is an online digital library that gives you full access to over 5,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.

Why subscribe?

Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals

Improve your learning with Skill Plans built especially for you

Get a free eBook or video every month

Mapt is fully searchable

Copy and paste, print, and bookmark content

Packt.com

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.

Contributors

About the authors

Marek Chmel is an IT consultant and trainer with more than 10 years' experience. He is a frequent speaker, focusing on Microsoft SQL Server, Azure, and security topics. Marek writes for Microsoft's TechnetCZSK blog and has been an MVP: Data Platform since 2012. He has earned numerous certifications, including MCSE: Data Management and Analytics, EC Council Certified Ethical Hacker, and several eLearnSecurity certifications.

Marek earned his MSc (business and informatics) degree from Nottingham Trent University. He started his career as a trainer for Microsoft server courses. Later, he joined AT&T, as a principal database administrator specializing in MSSQL Server, Data Platform, and Machine Learning.

I would like to thank my family— my wife and son, for their understanding and support during the time spent working on this book.

Vladimír Mužný has been a freelance IT consultant, developer, and Microsoft data platform trainer since 2000. He is also a frequent speaker on local events in Czech Republic and Slovakia. His most favorite topics are not only MS SQL Server, but also data integration, data science or NoSQL topics. During his career, Vladimír has earned certifications such as MCSE: Data Management and Analytics, MVP: Data Platform and MCT.

Nowadays, Vladimír is a data science enthusiast and works on data migration/integration projects also with output to machine learning models.

I wish to thank my colleague, Marek, my friend, and coauthor; without him, this book would never have been written. I would also like to thank all the team members behind this book for their never-ending passion.Last but not least, I would like to say thank you to my family for their support, not just during the writing of this book, but also throughout my career.

About the reviewers

Sahaj Pathak has more than 4 years of experience in IT industry with architecting giant products. He has been involved with the design workflow for the frameworks and applications. He is a strong team player, speedy and versatile. He has developed several financial web applications with cutting edge technologies.

He has mastered all the stages of software development. He can interpret the business and client requirements and helped team to build scalable technical solution. His expertise involved in both front end and back end technologies such as Java, Spring, Hibernate, AngularJs, Node.js, and JavaScript.

He is also good at delivering training on latest technologies and server-based solutions.

I would like to express my heartfelt thanks to Packt Publishing for giving me this opportunity. Also I would like to thank my parents and friends for supporting me unconditionally.

Subhash Shah works as a Head of Technology at AIMDek Technologies Pvt. Ltd. He is an experienced solutions architect with over 12 years of experience. He holds a degree in Information Technology from a reputable university. He is an advocate of open source development and its use to solve critical business problems at reduced cost. His interests include micro-services, data analysis, machine learning, artificial intelligence, and databases. He is an admirer of quality code and TDD. His technical skills include translating business requirements into scalable architecture, designing sustainable solutions and project delivery. He is a co-author of MySQL 8 Administrator's Guide and Hands on High Performance with Spring 5.

Packt is searching for authors like you

If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

Title Page

Hands-On Data Science with SQL Server 2017

About Packt

Why subscribe?

Packt.com

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Reviews

Data Science Overview

Introducing data science

Data science project life cycle

Business understanding

Getting data

Modelling and analysis

Deployment and visualization 

Final acceptance

Data science domains

Math and statistics

Visualizing the types of data

Statistics 101 

Central tendency

Skewness

Variability

Machine learning

SQL Server and machine learning

Choosing the right algorithm

Big data

SQL Server and big data

Summary

SQL Server 2017 as a Data Science Platform

Technical requirements

SQL Server evolution

What's available in the pack?

History of SQL Server

SQL Server in the cloud

Azure SQL Database

Azure SQL Data Warehouse

SQL Server Services and their use with data science

SQL Server Integration Services

SQL Server Analysis Services

Tabular Mode

Multidimensional mode

PowerPivot Mode

Querying languages

Reporting Services

Development tools for Reporting Services

Power BI Report Server

Machine Learning Services

Summary

Data Sources for Analytics

Technical requirements

Getting data from databases

Importing data from SQL Server

Importing data from other database systems

Importing flat files

Working with XML data

Working with JSON 

Retrieve data as JSON

Processing stored JSON data

External data with PolyBase

Installing and configuring

Summary

Data Transforming and Cleaning with T-SQL

Technical requirements

The need for data transformation

Database architectures for data transformations

Direct source for data analysis

Staging–target scenario

Landing–staging–target scenario

Tools eligible for data movement

Distributed queries

SQL Server Integration Services

Why should we use SSIS?

What is needed to develop an SSIS solution?

Where should SSIS be used?

Is there an alternative to SSIS?

Transforming data

Full data load

Incremental data load

The MERGE statement

CHECKSUM

Temporal tables

Denormalizing data

Relational normalization

First normal form

Second normal form

Third normal form

Need for denormalization

Ways of denormalization

Computed columns

Denormalization using joins

Using views and stored procedures

Database applications

Using views

Using stored procedures

Performance considerations

Writing correct code

Using indexes

B-tree indexes

COLUMNSTORE INDEX

Summary

Questions

Data Exploration and Statistics with T-SQL

Technical requirements

T-SQL aggregate queries

Common properties of aggregate functions

Aggregate functions

COUNT, COUNT(*), and COUNT_BIG

MIN and MAX

SUM

AVG

VAR and VARP

STDEV and STDEVP

Using groups

Using the HAVING clause

Ranking, framing, and windowing

Ranking functions

ROW_NUMBER

RANK

DENSE_RANK

NTILE

Running aggregates

Using aggregate functions in running aggregates

Using aggregate functions

Using the LEAD and LAG functions

Calculating with percentiles

The PERCENT_RANK and CUME_DIST functions

The PERCENTILE_CONT and PERCENTILE_DISC functions

Summary

Questions

Custom Aggregations on SQL Server

Technical requirements

Overview of SQLCLR

Use cases of using SQLCLR

How to work with SQLCLR

Instance and database configurations to use with SQLCLR

Creating CLR aggregations

Example goal and assignment

Skeleton of CLR aggregation

Implementing methods

Implementing custom serialization

Implementing the Init method

Implementing the Accumulate method

Implementing the Merge method

Implementing the terminate method

Deployment and testing

Limitations and performance considerations

Development issues and risks

Maintenance issues and risks

Performance issues and risks

Summary

Questions

Data Visualization

Technical requirements

Data visualization – preparation phase

Power BI Report Server

Starting with Power BI Desktop

Defining the data source

Adding visualizations to the Report

Visual interactions

Publishing reports

SQL Server Reporting Services

Adding charts to Reports

Using SQL Server Data Tools

Summary

Data Transformations with Other Tools

Technical requirements

Categorization, missing values, and normalization

Categorization

Missing values

Normalization

Z-score

Feature-scaling

Using Integration Services for data transformation

Setting up a SSIS project

Categorizing the products

Using R for data transformation

Preparing client R environment

R Syntax first steps

Working example of Z-score computed in R

Using Data Factory for data transformation

Creating Azure Data Factory

Creating simple copy data with ADF

Summary

Questions

Predictive Model Training and Evaluation

Technical requirements

Preparing SQL Server

Setting up and configuring ML services

Preparing to install our own R packages

Creating data structures

The concept of machine learning in databases

Creating physical data structures

Creating common objects

Creating objects using filestreams

Creating objects using temporal tables

Deploying, training, and evaluating a predictive model

Saving our machine learning model to filestreams

Saving a machine learning model to temporal tables

Summary

Questions

Making Predictions

Technical requirements

Reading models from a database

Reading the model from a common table

Reading the model from a temporal table

Submitting values to an external script

Submitting values into the external script

Deserializing a predictive model

Making the prediction

Using the PREDICT keyword

Making the predictive model self-training

Re-calculating a predictive model regularly

Re-calculating a predictive model asynchronously

Creating a message type

Creating a contract

Creating queues and services

Sending a request to train a new model

Consuming requests and sending responses

Testing the asynchronous solution

Summary

Questions

Getting It All Together - A Real-World Example

Technical requirements

Assignment and preparation

SQL Server

Data description

Data exploration

Exploring data using T-SQL

Exploring data using the SSIS Data Profiling Task

Exploring the SourceData.Actions table

Exploring data using R

Data transformation

Training and using predictive models for estimations

Preparing the schema for the model

Training the model

Using the rxLinMod function and finishing the model

Using the model in predictions

Summary

Questions

Next Steps with Data Science and SQL

Data science next steps

Next steps with SQL Server

Big data clusters

Machine learning

Machine learning services on Linux

Machine learning high availability

Data science in the cloud

Summary

Other Books You May Enjoy

Leave a review - let other readers know what you think

Preface

SQL Server is a relational database management system that enables you to cover end-to-end data science processes using various inbuilt services and features.Hands-On Data Science with SQL Server 2017 starts with an overview of data science with SQL so that you understand the core tasks in data science. You will learn intermediate to advanced level concepts so that you can perform analytical tasks on data using SQL Server. The book has a unique approach, covering best practices, tasks, and challenges to test your abilities at the end of each chapter. You will explore the ins and outs of performing various key tasks, such as data collection, cleaning, manipulation, aggregations, and filtering techniques. As you make your way through the chapters, you will turn raw data into actionable insights by wrangling and extracting data from databases using T-SQL. You will get to grips with preparing and presenting data in a meaningful way using Power BI to reveal hidden patterns. In the concluding chapters, you will work with SQL Server integration services to transform data into a useful format, and delve into advanced examples covering machine learning concepts such as predictive analytics using real-world examples.By the end of this book, you will be ready to handle growing amounts of data and perform everyday activities in the same way as a data science professional.

Who this book is for

What this book covers

Chapter 1, Data Science Overview, covers what the term data science means, the need for data science, the difference compared with traditional BI/DWH, and the competencies and knowledge required in order to be a data scientist.

Chapter 2, SQL Server 2017 as a Data Science Platform, explains the architecture of SQL Server from a data science perspective: in-memory OLTP for data acquisition; integration services as a transformation feature set; reporting services for visualization of input as well as output data; and, probably most importantly of all, T-SQL as a language for data exploration and transformation and machine learning services for making models themselves.

Chapter 3, Data Sources for Analytics, covers relational databases and NoSQL concepts side-by-side as valuable sources of data with a different approach to use. It also provides an overview of technologies such as HDInsight, Apache Hadoop, and Cosmos DB, and querying against such data sources.

Chapter 4, Data Transforming and Cleaning with T-SQL, demonstrates T-SQL techniques that are useful for making data consumable and complete for further utilization in data science, along with database architectures that are useful for transform/cleansing tasks.

Chapter 5, Data Exploration and Statistics with T-SQL, takes a deep dive into T-SQL capabilities, including common grouping and aggregations, framing/windowing, running aggregates, and (if needed) features such as custom CLR aggregates (with performance considerations).

Chapter 6, Custom Aggregations on SQL Server, explains how to create your own aggregations in order to enhance core T-SQL functionality.

Chapter 7, Data Visualization, explains the importance of visualizing data to reveal hidden patterns therein, along with examples of reporting services, PowerView, and PowerBI. By way of an alternative, an overview of R/Python visualization features is also provided (as these languages will play a vital role later in the book).

Chapter 8, Data Transformations with Other Tools, explains how to use integration services, probably R or Python, to transform data into a useful format, replacing missing values, detecting mistakes in datasets, normalization and its purpose, categorization, and finally data denormalization for better analytic purposes using views.

Chapter 9, Predictive Model Training and Evaluation, concerns a wide set of predictive models (clustering, N-point Bayes machines, recommenders) and their implementations via Machine Learning Studio, R, or Python.

Chapter 10, Making Predictions, explains how to use models created, evaluated, and scored in previous chapters. We will also learn how to make the model self-learning from the predictions made.

Chapter 11, Getting It All Together – a Real-World Example, demonstrates how to use certain features to grab, transform, and analyze data for a successful data science case.

Chapter 12, Next Steps with Data Science and SQL, summarizes the main points of all the preceding chapters and concludes outcomes. The chapter also provides ideas of how to continue working with data science, which trends are probably awaited in the future, and which other technologies will play strong roles in data science.

To get the most out of this book

Prior knowledge of SQL Server 2017 (Evaluation Edition) is required, as is an understanding of SQL Server Management Studio (version 17.7 was used at the time of writing as the latest edition).

Download the example code files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

www.packt.com

Select the

SUPPORT

tab.

Click on

Code Downloads & Errata

Enter the name of the book in the

box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows

Zipeg/iZip/UnRarX for Mac

7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Hands-On-Data-Science-with-SQL-Server-2017. In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions used

There are a number of text conventions used throughout this book.

CodeInText: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "The same sample query using the OPENQUERY function."

A block of code is set as follows:

SELECT * FROM REMOTESRV..Enums.Accounts

Any command-line input or output is written as follows:

sudo apt-get update sudo apt-get install mssql-tools unixodbc-dev

Bold: Indicates a new term, an important word, or words that you see on screen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Its correct name is added to the Local Login column."

Warnings or important notes appear like this.

Tips and tricks appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Data Science Overview

SQL Server is far more than a regular database management system. It's a huge ecosystem of different services that work together to provide very complex data platform management tasks. With the addition of numerous features in the version of SQL Server 2016 and SQL Server 2017, the capabilities of the system have enlarged again toward modern ways of working with data such as big data, machine learning and data science. You're about to enter a new world with this book, which will allow you to grasp the data-related tasks from a different point of view and get more insight into your data.

Data science project life cycle

There are different data science life cycles available, which can fit different projects. We'll focus most on the Team Data Science Process (TDSP) life cycle, which is defined by Microsoft and can be applied to data science with Microsoft SQL Server. The TDSP provides a recommended life cycle that you can use to structure your data science projects. The life cycle outlines the steps, from start to finish, that projects usually follow when they are executed.

Business understanding

When we work with a data science project, this project usually has several phases. Each data science project begins with the business problem, or identifying the question. There are key tasks addressed in this first stage of the project:

Define the goal or objective

: identify the required business variable that needs to be predicted by the analysis

Identify the data sources

: find the required data that you will use for the data science project

When you work with stakeholders to identify the key questions and project goals, you should use sharp questions that will help you identify the required data for your analysis.

If your question is, What will my stock's sale price be next month? then you'll need to analyze the data, which includes your stock's price history over months. Also, you will need to analyze the sales of your stock from those months. And, in a similar manner, you need to think about the business problem definition and ask specific questions that include the numbers, names, categories, and so on.

Based on the questions that you'll be trying to answer, you can also foresee the data science methods that you'll use to address such a question. Typical questions for data science projects would include the following:

How much, or how many?

Which category?

Which group?

Is this weird?

Which option should be taken?

During the first phase of the project, you're also usually building the TDSP, which will consist of various personnel members, each specializing in a particular subject, which will be essential to the success of the project:

A data scientist

: A highly educated and skilled person who can solve complex data problems by employing deep expertise in scientific disciplines (mathematics, statistics, or computer science).

A data professional

: A skilled person who creates or maintains data systems, data solutions, or implements predictive modelling. These people will work in professions such as database administration, database development, or BI Development.

A software developer

: A person who designs and develops programming logic, and who can apply machine learning to integrate predictive functionality into applications.

A project leader

: A

project leader manages the daily activities of individual data scientists and other project contributors on a specific data science project.

Getting data

The second phase of the project is related to data identification, acquisition, and understanding. Data comes from various data sources that provide data in a structured, a semi-structured, and an unstructured format. Data that we have on input may come with different quality and integrity, based on the data source that is used for storing the information. For the data analysis, we need to ingest the data into the target analytic environment, either an on-premise one, or in the cloud. These can include numerous services from Microsoft such as SQL Server (ideally with PolyBase to access external data) or cloud services such as Azure Storage Account, HDInsight, and Azure Data Lake.

Considering we'll load the data into Microsoft SQL Server, we need a good way to break down the dataset of the information into individual rows and columns. Each row in the table will present one event, instance, or item for our analysis. Each column on this table will represent an attribute of the row. Different projects will have data with a different level of detail collected, based on the available data sources and our ability to process such data.

When we talk about the initial loading of the data into SQL Server, this is usually referred to as a staging database. Since the data can be loaded from numerous different databases and repositories, dumping all the data from the source into a centralized repository is usually the first step before building the analytical storage. The next stage would be the data warehouse. Data warehouse is a common term for an enterprise system used for reporting and data analysis, which is usually a core of the enterprise business intelligence solution. While a data warehouse is an enterprise-wide repository of data, extracting insights from such a huge repository might be a challenging task. We can segregate the data according to the department or the category that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. These smaller portions of the data warehouse are then referenced as data marts.

Data in the source systems may need a lot of work before and during loading it into a database or other analytical storage, where we can properly analyze the data. In general, one of the many steps in data science projects is data wrangling, a process of acquiring raw data and mapping and transforming the data into another format that is suitable for its end use, for us, the data analysts. Data wrangling basically has three steps:

Getting and reading the data

Cleaning the data

Shaping and structuring the data

Reading the data sounds simple, but in the end, it's a complex task in the data science project, where one part of the project is a data flow and a pipeline definition on how to connect to the data, read the data with the proper tools, and move the data to the analytics store. This can end up with complex integration work as part of the data science project just to interconnect various data sources together and shape the data from various sources, so you can run powerful analytics on the data to get the insights. The Microsoft SQL Server includes very important services, such as SQL Server Integration Services, which, together with SQL Server Data Tools, can be used as one of the tools available for data wrangling with all three steps.

Once the data is loaded into the analytical store, we need to explore and visualize the available data with the toolset available to get the idea of the structure and develop initial understanding of the data. An initial understanding of the data can be achieved via numerous tools, but if we focus on Microsoft SQL Server, then the choices would include SQL Service Integration Services—Data Profiling Task and SQL Server Management Studio.

When you explore the data, you're looking for basic information such as this:

Is the data organized?

Are there any missing values?

What does each row represent?

What do columns represent?

Is the data stored as a categorical or a numerical feature?

Are there any transformations required?

Modelling and analysis

This part of the project might be the most creative one, since it includes numerous tasks, which have to be taken to deliver the final product. The list of tasks can be very long, and may include these:

Data mining

Text analytics

Model building

Feature engineering and extraction

Model testing

Microsoft SQL Server has tools built in, which can provide a delivery platform for most of the tasks. When we talk about data mining, there are several different methodologies or frameworks to follow, where so far the Cross Industry Standard Process for Data Mining (CRISP-DM) is the most frequently used one, based on several different methods of research regarding the methodology usage. In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics, which refined and extended CRISP-DM. CRISP-DM is an open-standard process model that describes common approaches used by data-mining experts, and it's still the most widely used analytics model. CRISP-DM breaks the process of data mining into six major phases. The sequence of the phases is not strict and moves back and forth between different phases, as it is always required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data-mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data-mining processes will benefit from the experiences of the previous ones:

The purpose of data mining is to put structured and unstructured data in relation to each other so as to easily interface them and provide the workers in the sector with a system that is easy to use. The experts of each specified area of business will therefore have access to a complex data system that is able to process information at different levels. This has the advantage of bringing to light the relationships among data, predictive analysis, assessments for specific business decisions, and much more.

Data mining can be used for solving many business problems and to prepare the data for a more advanced approach, such as machine learning, which can be used for:

Searching for anomalies

Churn analysis

Customer segmentation

Forecasting

Market basket analysis

Network intrusion detection

Targeted advertisement

Deployment and visualization

Once you have developed a functioning data science solution or some learning model, you're ready for deployment into production. Many of the systems have two primary modes of operations for a data science solution, either real-time operations or batch operation. In this part, we usually consider the data visualization and the proper toolset to deliver the results to our users. Tools such as Power BI or Tableau will help you bring interesting insights to your data in a visual way, which is usually best for the end users.

Final acceptance

The last step of the project is the final presentation and acceptance from the client or the customer. You'll present the insights and translate the findings into the language appropriate for your audience. In this part of the project, you'll work with the customer or the internal team, who will run and support the project once it gets into production, and you'll verify that the outcome meets the required needs.

Data science domains

Data science is linked to numerous other modern buzzwords such as big data and machine learning, but data science itself is built from numerous domains, where you can get your expertise. These domains include the following:

Statistics

Visualization

Data mining

Machine learning

Pattern recognition

Data platform operations

Artificial intelligence

Programming

Math and statistics

Statistics and other math skills are essential in several phases of the data science project. Even in the beginning of data exploration, you'll be dividing the features of your data observations into categories:

Categorical

Numeric:

Discrete

Continuous

Categorical values describe the item and represent an attribute of the item. Imagine you have a dataset about cars: car brand would be a typical categorical value, and color would be another.

On the other side, we have numerical values that can be split into two different categories—discrete and continuous. Discrete values describe the amount of observations, such as how many people purchased a product, and so on. Continuous values have an infinite number of possible values and use real numbers for the representation. In a nutshell, discrete variables are like points plotted on a chart, and a continuous variable can be plotted as a line.

Another classification of the data is the measurement-level point of view. We can split data into two primary categories:

Qualitative:

Nominal

Ordinal

Quantitative:

Interval

Ratio

Nominal variables can't be ordered and only describe an attribute. An example would be the color of a product; this describes how the product looks, but you can't put any ordering scheme on the color saying that red is bigger than green, and so on. Ordinal variables describe the feature with a categorical value and provide an ordering system; for example: Education—elementary, high school, university degree, and so on.

With quantitative values, it's a different story. The major difference is that ratio has a true zero. Imagine the attribute was a length. If the length is 0, you know there's no length. But this does not apply to temperature, since there's an interval of possible values for the temperature, where 0°C or 0°F does not mean the beginning of the scale for the temperature (as absolute zero, or beginning of the scale is 273.15° C or -459.67° F). With °K, it would actually be a ratio type of the quantitative value, since the scale really begins with 0°K. So, as you can see, any number can be an interval or a ratio value, but it depends on the context!

Visualizing the types of data

Visualizing and communicating data is incredibly important, especially with young companies that are making data-driven decisions for the first time, or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings, or the way techniques work to audiences, both technical and non-technical. Different types of data have different ways of representation. When we talk about the categorical values, the ideal representation visuals would be these:

Bar charts

Pie charts

Pareto diagrams

Frequency distribution tables

A bar chart would visually represent the values stored in the frequency distribution tables. Each bar would represent one categorical value. A bar chart is also a base line for a pareto diagram, which includes the relative and cumulative frequency for the categorical values:

Bar chart representing the relative and cumulative frequency for the categorical values

If we'll add the cumulative frequency to the bar chart, we will have a pareto diagram of the same data:

Pareto diagram representing the relative and cumulative frequency for the categorical values

Another very useful type of visualization for categorical data is the pie chart. Pie charts display the percentage of the total for each categorical value. In statistics, this is called the relative frequency. The relative frequency is the percentage of the total frequency of each category. This type of visual is commonly used for market-share representations:

Pie chart representing the market share for Volkswagen

All the values are imaginary and are used just for demonstration purposes; these numbers don't represent a real market share by different brands in Volkswagen around the world, or in any city.

For numeric data, the ideal start would be a frequency distribution table, which will contain ordered or unordered values. Numeric data is very frequently displayed with histograms or scatter plots. When using intervals, the rule of thumb is to use 5 to 20 intervals, to have a meaningful representation of the data.

Let's create a table with 20 discrete data points, which we'll display visually. To create the table, we can use the following T-SQL script:

CREATE TABLE [dbo].[dataset]( [datapoint] [int] NOT NULL) ON [PRIMARY]

To insert new values into the table, let's use the script:

INSERT [dbo].[dataset] ([datapoint]) VALUES (7)INSERT [dbo].[dataset] ([datapoint]) VALUES (28)INSERT [dbo].[dataset] ([datapoint]) VALUES (50)etc. with more values to have 20 values in total

The table will include numbers in the range of 0 to 300, and the content of the table can be retrieved with this:

SELECT * FROM [dbo].[dataset]ORDER BY datapoint

To visualize a descrete values dataset, we'll need to build a histogram. The histogram will have six intervals, and the interval length can be calculated as a (largest value − smallest value) / number of intervals. When we build the frequency distribution table and the intervals for the histogram, we'll end up with the following results:

A histogram based on the absolute frequency of the discrete values will look such as this one:

Statistics 101

A good understanding of statistics is vital for a data scientist. You should be familiar with statistical tests, distributions, maximum likelihood estimators, and so on. This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren't) a valid approach. Statistics is important for all types of companies, especially data-driven companies where stakeholders depend on your help to make decisions and design and evaluate experiments.

Central tendency

There are three descriptive measures of central tendency—the mean, the median, and the mode, but SQL Server does not have a way to calculate anything other than the mean directly. The arithmetic mean, or simply the mean (there are more types of mean in the central tendency), is the sum of all measurements divided by the number of observations in the dataset. The median is the middle value that separates the higher half from the lower half of the dataset. The median and the mode are the only measures of the central tendency that can be used for ordinal data, in which values are ranked relative to one another but are not measured absolutely. The mode is the most frequent value in the dataset. This is the only central tendency measure that can be used with nominal data, which has purely qualitative category assignments. For looking into such values with SQL Server, we will either need to define our own assembly with a custom aggregate or use complex T-SQL constructs to bypass the missing statistical functions. Another option would be to use the Python code or the R code; the code can be running inside the SQL Server, and you can pass the result set as the argument to the Python code or the R code to work on the descriptive statistics.

Variability

If we would like to understand the variability of the data, there are three important measures that we can use to get a better understanding of our data set, and those include the following:

Variance

Standard deviation

Coefficient of variation

Variance measures the distance of the data points around their mean value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself.

Standard deviation is a measure of how spread-out numbers are, and based on the number describing the standard deviation, you can see the extent to which a set of numbers lies apart. In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. Unlike variance, the standard deviation is expressed in the same units as the data, which is very useful.

The coefficient of variation is a measure of the dispersion of probability, or a frequency distribution. It's usually a percentage value, which is defined as a ratio of the standard deviation to the mean.

SQL Server has several built-in functions that can be used to calculate variance and standard deviation. To find out the variability measures from our table, we can run the following query:

SELECT AVG(datapoint),VARP(datapoint), STDEVP(datapoint) FROM dataset

Don't get confused with those function names, as for variance there are are two functions VAR() and VARP() and the same for the standard deviation, STDEV() and STDEVP(). This is very important from a statistics perspective, depending on whether we work with the whole population or just a sample of the population.

Variance for sample and population are not the same, as if you would check the formulas behind these functions have different denominators:

The full query to get the variability measures on our table, which would include both population and sample measures, would look such as this one:

SELECT AVG(datapoint) as Mean,VAR(datapoint) as 'Sample variation',VARP(datapoint) as 'Population variation', STDEV(datapoint) as 'Sample standard deviation',STDEVP(datapoint) as 'Population standard deviation'FROM dataset

In such cases, the sample variability measures will be higher than the population measures, due to a lower denominator value. In the case of a sample, where you don't have the whole population data points, the concentrations of extremes and central tendencies can be different based on the whole range of values. When using such functions, you need to know whether you're working with just a sample of the data or the whole population of the data available, so the results are not skewed and you have the proper outcomes of these functions!

Machine learning

A very important part of data science is machine learning. Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it.

Big data

Big data is another modern buzzword that you can find around the data management and analytics platforms. The big really does not have to mean that the data volume is extremely large, although it usually is.

There are different imperatives linked to big data, which describe the theorem. These would include the following:

Volume

: Volume really describes the quantity of the data. There's a big potential to get value and insights from large-volume datasets. The main challenge is that the data sets are so big and complex that the traditional data-processing application software's are inadequate to deal with them.

Variety

: Data is not strictly a relational database anymore, and data can be stored in text files, images, and social feeds from a social network.

Velocity

: While we want to have the data available in real-time, the speed of the data generation is challenging for regular DMBS systems and requires specialized forms of deployment and software.

Veracity

: With a large amount of possible data sources, the quality of the data can vary, which can affect the data analysis and insights gained from the data.

Here are some big data statistics that are interesting:

100 terabytes of data are uploaded to Facebook every day

Every hour, Walmart customers' transactions provide the company with about 2.5 petabytes of data

Twitter generates 12 terabytes of data every day

YouTube users upload eight years worth of new video content every day

SQL Server and big data

Let's face reality. SQL Server is not a big-data system. However, there's a feature on the SQL Server that allows us to interact with other big-data systems, which are deployed in the enterprise. This is huge!

This allows us to use the traditional relational data on the SQL Server and combine it with the results from the big-data systems directly or even run the queries towards the big-data systems from the SQL Server. The answer to this problem is a technology called PolyBase:

PolyBase is a bridge between SQL Server and big-data systems such as Hadoop, which can run in numerous different configurations. You can have your own Hadoop deployment, or utilize some Azure services such as HDInsight or Azure Data Lake, which are implementations of Hadoop and HDFS filesystem from the Hadoop framework. We'll get deeper into PolyBase in Chapter 4, Data Sources for Analytics. If you would like to test drive Hadoop with SQL Server, there are several appliances ready for testing and evaluation, such as Hortonworks Data Platform or Cloudera.

You can download prebuilt virtual machines, which you can connect to from SQL Server with the PolyBase feature to evaluate how the big-data Integration is working. For Hortonworks, you can check out https://hortonworks.com/products/data-platforms/hdp/ For Cloudera Quickstart VMs, you can check out https://www.cloudera.com/downloads/quickstart_vms/5-13.html

Hadoop itself is external to SQL Server and is described as a collection of software tools for distributed storage and the processing of big data. The base Apache Hadoop framework is composed of the following modules:

Hadoop Common

: Contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System

(

HDFS

): A distributed filesystem that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

Hadoop YARN

: Introduced in 2012 as a platform responsible for managing computing resources in clusters and using them for scheduling users' applications

Hadoop MapReduce

: An implementation of the MapReduce programming model for large-scale data processing:

Summary

Data science is a very broad term that includes numerous tasks in regard to data management and processing. In this chapter, we have covered how these individual domains can be addressed with SQL Server and how SQL Server can be used as a data science platform in the enterprise. Although SQL Server is not primarily considered to be a data science tool, there are numerous data science tasks, where in SQL Server offers very mature services such as the importing and cleaning of data, integration with big-data systems, and rich visualizations, either with reporting services or with PowerBI Report Server.