Smart Grid using Big Data Analytics - Robert C. Qiu - E-Book

Smart Grid using Big Data Analytics E-Book

Robert C. Qiu

0,0
112,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

This book is aimed at students in communications and signal processing who want to extend their skills in the energy area. It describes power systems and why these backgrounds are so useful to smart grid, wireless communications being very different to traditional wireline communications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 935

Veröffentlichungsjahr: 2017

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Preface

Acknowledgments

Some Notation

1 Introduction

1.1 Big Data: Basic Concepts

1.2 Data Mining with Big Data

1.3 A Mathematical Introduction to Big Data

1.4 A Mathematical Theory of Big Data

1.5 Smart Grid

1.6 Big Data and Smart Grid

1.7 Reading Guide

Bibliographical Remarks

Part I: Fundamentals of Big Data

2 The Mathematical Foundations of Big Data Systems

2.1 Big Data Analytics

2.2 Big Data: Sense, Collect, Store, and Analyze

2.3 Intelligent Algorithms

2.4 Signal Processing for Smart Grid

2.5 Monitoring and Optimization for Power Grids

2.6 Distributed Sensing and Measurement for Power Grids

2.7 Real‐time Analysis of Streaming Data

2.8 Salient Features of Big Data

2.9 Big Data for Quantum Systems

2.10 Big Data for Financial Systems

2.11 Big Data for Atmospheric Systems

2.12 Big Data for Sensing Networks

2.13 Big Data for Wireless Networks

2.14 Big Data for Transportation

Bibliographical Remarks

3 Large Random Matrices

3.1 Modeling of Large Dimensional Data as Random Matrices

3.2 A Brief of Random Matrix Theory

3.3 Change Point of Views: From Vectors to Measures

3.4 The Stieltjes Transform of Measures

3.5 A Fundamental Result: The Marchenko–Pastur Equation

3.6 Linear Eigenvalue Statistics and Limit Laws

3.7 Central Limit Theorem for Linear Eigenvalue Statistics

3.8 Central Limit Theorem for Random Matrix

3.9 Independence for Random Matrices

3.10 Matrix‐Valued Gaussian Distribution

3.11 Matrix‐Valued Wishart Distribution

3.12 Moment Method

3.13 Stieltjes Transform Method

3.14 Concentration of the Spectral Measure for Large Random Matrices

3.15 Future Directions

Bibliographical Remarks

4 Linear Spectral Statistics of the Sample Covariance Matrix

4.1 Linear Spectral Statistics

4.2 Generalized Marchenko–Pastur Distributions

4.3 Estimation of Spectral Density Functions

4.4 Limiting Spectral Distribution of Time Series

Bibliographical Remarks

5 Large Hermitian Random Matrices and Free Random Variables

5.1 Large Economic/Financial Systems

5.2 Matrix‐Valued Probability

5.3 Wishart‐Levy Free Stable Random Matrices

5.4 Basic Concepts for Free Random Variables

5.5 The Analytical Spectrum of the Wishart–Levy Random Matrix

5.6 Basic Properties of the Stieltjes Transform

5.7 Basic Theorems for the Stieltjes Transform

5.8 Free Probability for Hermitian Random Matrices

5.9 Random Vandermonde Matrix

5.10 Non‐Asymptotic Analysis of State Estimation

Bibliographical Remarks

6 Large Non‐Hermitian Random Matrices and Quatartenionic Free Probability Theory

6.1 Quatartenionic Free Probability Theory

6.2 R‐diagonal Matrices

6.3 The Sum of Non‐Hermitian Random Matrices

6.4 The Product of Non‐Hermitian Random Matrices

6.5 Singular Value Equivalent Models

6.6 The Power of the Non‐Hermitian Random Matrix

6.7 Power Series of Large Non‐Hermitian Random Matrices

6.8 Products of Random Ginibre Matrices

6.9 Products of Rectangular Gaussian Random Matrices

6.10 Product of Complex Wishart Matrices

6.11 Spectral Relations between Products and Powers

6.12 Products of Finite‐Size I.I.D. Gaussian Random Matrices

6.13 Lyapunov Exponents for Products of Complex Gaussian Random Matrices

6.14 Euclidean Random Matrices

6.15 Random Matrices with Independent Entries and the Circular Law

6.16 The Circular Law and Outliers

6.17 Random SVD, Single Ring Law, and Outliers

6.18 The Elliptic Law and Outliers

Bibliographical Remarks

7 The Mathematical Foundations of Data Collection

7.1 Architectures and Applications for Big Data

7.2 Covariance Matrix Estimation

7.3 Spectral Estimators for Large Random Matrices

7.4 Asymptotic Framework for Matrix Reconstruction

7.5 Optimum Shrinkage

7.6 A Shrinkage Approach to Large‐Scale Covariance Matrix Estimation

7.7 Eigenvectors of Large Sample Covariance Matrix Ensembles

7.8 A General Class of Random Matrices

Bibliographical Remarks

8 Matrix Hypothesis Testing using Large Random Matrices

8.1 Motivating Examples

8.2 Hypothesis Test of Two Alternative Random Matrices

8.3 Eigenvalue Bounds for Expectation and Variance

8.4 Concentration of Empirical Distribution Functions

8.5 Random Quadratic Forms

8.6 Log‐Determinant of Random Matrices

8.7 General MANOVA Matrices

8.8 Finite Rank Perturbations of Large Random Matrices

8.9 Hypothesis Tests for High‐Dimensional Datasets

8.10 Roy’s Largest Root Test

8.11 Optimal Tests of Hypotheses for Large Random Matrices

8.13 Hypothesis Testing for Matrix Elliptically Contoured Distributions

Bibliographical Remarks

Part II: Smart Grid

9 Applications and Requirements of Smart Grid

9.1 History

9.2 Concepts and Vision

9.3 Today’s Electric Grid

9.4 Future Smart Electrical Energy System

10 Technical Challenges for Smart Grid

10.1 The Conceptual Foundation of a Self‐Healing Power System

10.3 The Electric Power System as a Complex Adaptive System

10.4 Making the Power System a Self‐Healing Network Using Distributed Computer Agents

10.5 Distribution Grid

10.6 Cyber Security

10.7 Smart Metering Network

10.8 Communication Infrastructure for Smart Grid

10.9 Wireless Sensor Networks

Bibliographical Remarks

11 Big Data for Smart Grid

11.1 Power in Numbers: Big Data and Grid Infrastructure

11.2 Energy’s Internet: The Convergence of Big Data and the Cloud

11.3 Edge Analytics: Consumers, Electric Vehicles, and Distributed Generation

11.4 Crosscutting Themes: Big Data

11.5 Cloud Computing for Smart Grid

11.6 Data Storage, Data Access and Data Analysis

11.7 The State‐of‐the‐Art Processing Techniques of Big Data

11.8 Big Data Meets the Smart Electrical Grid

11.9 4Vs of Big Data: Volume, Variety, Value and Velocity

11.10 Cloud Computing for Big Data

11.11 Big Data for Smart Grid

11.12 Information Platforms for Smart Grid

Bibliographical Remarks

12 Grid Monitoring and State Estimation

12.1 Phase Measurement Unit

12.2 Optimal PMU Placement

12.3 State Estimation

12.4 Basics of State Estimation

12.5 Evolution of State Estimation

12.6 Static State Estimation

12.7 Forecasting‐Aided State Estimation

12.8 Phasor Measurement Units

12.9 Distributed System State Estimation

12.10 Event‐Triggered Approaches to State Estimation

12.11 Bad Data Detection

12.12 Improved Bad Data Detection

12.13 Cyber‐Attacks

12.14 Line Outage Detection

Bibliographical Remarks

13 False Data Injection Attacks against State Estimation

13.1 State Estimation

13.2 False Data Injection Attacks

13.3 MMSE State Estimation and Generalized Likelihood Ratio Test

13.4 Sparse Recovery from Nonlinear Measurements

13.5 Real‐Time Intrusion Detection

Bibliographical Remarks

14 Demand Response

14.1 Why Engage Demand?

14.2 Optimal Real‐time Pricing Algorithms

14.3 Transportation Electrification and Vehicle‐to‐Grid Applications

14.4 Grid Storage

Bibliographical Remarks

Part III: Communications and Sensing

15 Big Data for Communications

15.1 5G and Big Data

15.2 5G Wireless Communication Networks

15.3 Massive Multiple Input, Multiple Output

15.4 Free Probability for the Capacity of the Massive MIMO Channel

15.5 Spectral Sensing for Cognitive Radio

Bibliographical Remarks

16 Big Data for Sensing

16.1 Distributed Detection and Estimation

16.2 Euclidean Random Matrix

16.3 Decentralized Computing

Appendix A: Some Basic Results on Free Probability

A.1 Non‐Commutative Probability Spaces

A.2 Distributions

A.3 Asymptotic Freeness of Large Random Matrices

A.4 Limit Theorems

A.5

R

‐diagonal Random Variables

A.6 Brown Measure of

R

‐diagonal Random Variables

Appendix B: Matrix‐Valued Random Variables

B.1 Random Vectors and Random Matrices

B.2 Multivariate Normal Distribution

B.3 Wishart Distribution

B.4 Multivariate Linear Model

B.5 General Linear Hypothesis Testing

Bibliographical Remarks

References

Index

End User License Agreement

List of Tables

Chapter 01

Table 1.1 Comparison between classical, free, and quatartenionic free probability theories.

Table 1.2 Comparison of different entropy definitions.

Chapter 03

Table 3.1 An analogy between the quantum system and the big data measurement system.

Table 3.2 Generating the Gaussian random matrix

G

β

(

m

, 

n

).

Table 3.3 Hermite and Laguerre ensembles.

Chapter 05

Table 5.1 Common random matrices and their moments (the entries of

W

are i.i.d. with zero mean and variance

;

W

is square

, unless otherwise specified.

).

Table 5.2 Definition of commonly encountered random matrices for convergence laws (the entries of

W

are i.i.d. with zero mean and variance

;

W

is square

, unless otherwise specified).

Table 5.3 Table of Stieltjes, R‐ and S‐ transforms (Table 5.2 lists the definitions of the matrix notations used in this table).

Chapter 06

Table 6.1 Comparison between classical, free, and quatartenionic free probability theories.

Table 6.2 A normalized Wishart‐like matrix is defined as

. Random matrix

Z

defined in (6.89) is constructed out of random unitary matrices

distributed according to the Haar measure and/or (independent) random Ginibre matrices

of a given size

N

. Asymptotic distribution

P

(

x

) of the density of a rescaled eigenvalue

of

ρ

for

is characterized by the singularity at 0, its support [

a

, 

b

], the second moment

M

2

determining the average purity

and the mean entropy

, according to which the table is ordered. Taken from [309].

Chapter 09

Table 9.1 Domains in the smart grid conceptual model.

Table 9.2 The smart grid compared with the existing grid.

Table 9.3 Evolution of the power system from a static to a dynamic infrastructure.

Chapter 10

Table 10.1 A comparison of the protection systems, smart grid, and central control system.

Table 10.2 Smart grid communication technologies.

List of Illustrations

Chapter 01

Figure 1.1 Big data, big impact: new possibilities for international development.

Figure 1.2 A big data processing framework. The research challenges form a three‐tier structure and center around the “big data mining platform” (Tier I), which focuses on low‐level data accessing and computing. Challenges on information sharing and privacy, and Big Data application domains and knowledge form Tier II, which concentrates on high‐level semantics, application‐domain knowledge, and user privacy issues. The outmost circle shows Tier III challenges on actual mining algorithms.

Figure 1.3 The square kilometer array.

Figure 1.4 The eigenvalues of a single matrix drawn from the complex Ginibre ensemble of random matrices. The dashed line is the unit circle. This numerical experiment was performed using the promising Julia http://julialang.org/(accessed August 17, 2016).

Figure 1.5 Vision of a smart transmission grid.

Figure 1.6 Big data vision.

Chapter 02

Figure 2.1 Smoothed density of the eigenvalues of

C

, where the correlation matrix

C

is extracted from

assets of the S&P500 during the years 1991–1996. For comparison we have plotted the density (2.10) for

and

: this is the theoretical value obtained assuming that the matrix is purely random except for its highest eigenvalue (dotted line). A better fit can be obtained with a smaller value of

(solid line), corresponding to 74% of the total variance. Inset: same plot, but including the highest eigenvalue corresponding to the “market,” which is found to be 30 times greater than

b

.

Figure 2.2 Plots of

ρ

(

λ

) versus for different values of

and

.

Figure 2.3 Combined plot of

ρ

(

λ

) versus

λ

for

obtained analytically as well as numerically for independent, identically distributed random data sets.

Figure 2.4 Spectra of the covariance matrix

C

for the Student distribution (2.29) with

and

,

, for

, 2, 5, 20, and 100 (thin lines from solid to dotted), calculated using the formula (2.30) and compared to the uncorrelated Wishart (thick line). One sees that for

the spectra tend to the Wishart distribution.

Figure 2.5 Spectra of the empirical covariance matrix

S

calculated from (2.30) with

, compared to experimental data (stair lines) obtained by the Monte Carlo generation of finite matrices

.

Figure 2.6 Eigenvalue spacing distribution for the monthly mean sea‐level pressure (SLP) correlation matrix. The solid curve is the GOE prediction.

Figure 2.7 Eigenvalue spacing distribution for the monthly mean wind‐stress correlation matrix. The solid curve is the GUE prediction.

Figure 2.8 The ring law for the product of non‐Hermitian random matrix with white noise only. The number of random matrix

. The radii of the inner circle and the outer circle agree with (2.71).

Figure 2.9 The ring law for the product of non‐Hermitian random matrices with signal plus white noise. The number of random matrix

. The radius of the inner circle is less than that of the white‐noise‐only scenario.

Chapter 03

Figure 3.1 Plotted above is the distribution of the eigenvalues of

where

X

is an

random Gaussian matrix with

and

. The blue curve is the Marchenko–Pastur law with density function

f

MP

(

x

).

Figure 3.2 Plotted above is the distribution of the eigenvalues of

where

X

is an

random Gaussian matrix with

and

. The blue curve is the quarter circle law with density function.

Chapter 04

Figure 4.1 The curve of

(solid thin), and the sets B and supp

c

(

F

n

(

x

)) (solid thick) for

and

.

. In the figure the

is the

m

in our notation.

Figure 4.2 Spectral density curves for sample covariance matrices

,

;

.

X

ij

are i.i.d. standard Gaussian distribution with zero mean and variance 1, or

. In MATLAB:

X=randn(N,n);

Figure 4.3 The same as Figure 4.2 except

;

.

Figure 4.4 The kernel density estimation of

is compared with the Marchenko‐Pastur law.

,

,

, and

.

Figure 4.5 The kernel density estimation

f

n

(

x

) deviates from the limiting Marchenko–Pastur law

:

,

. The optimal bandwidth

. Here

points are plotted. The setting is the same as Figure 1 unless otherwise specified.

Figure 4.6 The same as Figure 4 except that

.

Chapter 05

Figure 5.1 The figures represent spectra of the eigenvalue distributions

ρ

(

λ

) of the experimental correlation matrix measured in a series of measurements for

, respectively. The underlying correlation matrix has two eigenvalues

and

with the weights

. At the critical value

((5.43)), the spectrum splits. The spectral densities are calculated analytically.

Figure 5.2 Simulated limit distribution for a uniform distribution

with

averaged over 700 sample matrices.

Figure 5.3 Simulated limit distribution for

, where

f

(

x

)is an unbounded pdf defined as

. The distributions with

are averaged over 700 sample matrices.

Chapter 06

Figure 6.1 Left: Complex‐valued operation for a real function in upper complex plane. Right: Quaternion‐valued operation for a complex function in hyper complex plane.

Figure 6.2 The sum of

L

non‐Hermitian random matrices:

,

and

, for one matrix, i.e.,

.

Figure 6.3 The same as Figure 6.2 except

.

Figure 6.4 Eigenvalues for a product of

L

non‐Hermitian random matrices:

,

and

, for one matrix, i.e.,

.

Figure 6.5 The same as Figure 6.4 except

.

Figure 6.6 The empirical eigenvalue density function for a product of

L

non‐Hermitian random matrices:

,

and

, for one matrix, i.e.,

.

Figure 6.7 The same as Fig. 6.6 except

.

Figure 6.8 The empirical eigenvalue density function for one non‐Hermitian random matrix:

,

,

and

for

(thus

).

Figure 6.9 The same as Figure 6.8 except

.

Figure 6.10 The eigenvalues for one non‐Hermitian random matrix:

,

,

and

for

(thus

).

Figure 6.11 The same as Figure 6.10 except

.

Figure 6.12 The eigenvalues for one non‐Hermitian random matrix:

,

,

and

for

and

.

Figure 6.13 The same as Figure 6.12.

Figure 6.14 The eigenvalues of (

X

L

)

1/

M

for one non‐Hermitian random matrix

X

of size

:

,

,

,

. The ratio

α

is

.

Figure 6.15 The same as Figure 6.14. Four outliers.

Figure 6.16 The eigenvalues of (

X

1/

M

)

L

for one non‐Hermitian random matrix

X

of size

. All other parameters are the same as Figure 6.14.

Figure 6.17 The same as Figure 6.16. Four outliers.

Figure 6.18 The eigenvalues of a geometric series of

K

terms: each term is (

X

L

/

M

) for one non‐Hermitian random matrix

X

of size

.

,

,

.

Figure 6.19 The same as Figure 6.18. Four outliers.

Figure 6.20 The eigenvalues of a geometric series of

K

terms: each term is (

X

L

/

M

) for one non‐Hermitian random matrix

X

of size

.

,

,

, and

.

Figure 6.21 The same as Figure 6.20 except

.

Figure 6.22 Product of

square i.i.d. matrices.

.

Figure 6.23 The same as Figure 6.22.

Figure 6.24 Density plots of the logarithm of eigenvalue density of the N

N random Green’s matrix (6.19) obtained by numerical diagonalization of 10 realizations of the matrix for N

. The solid lines represent the borderlines of the support of eigenvalue density following from the theory. The dashed lines show the diffusion approximation (6.22). From (a) to (d), we keep increasing

Y

.

Figure 6.25 This figure shows the eigenvalues of a single

i.i.d. random matrix with atom distribution

X

defined by a white Gaussian random variable with zero mean and variance one; the eigenvalues were perturbed by adding the diagonal matrix with ten diagonal entries:

; 3; 2;

;

;

;

;

;

;

, corresponding to ten locations on the complex

z

plane. The small circles are centered at these ten locations on the complex plane, and each has a radius

where

. Five hundred Monte Carlo trials are performed to see how stable these eigenvalues locations are. We can clearly identify every corresponding eigenvalue location on the complex plane.

Figure 6.26 Parameters are same as Figure 6.25, except for

.

Figure 6.27 Parameters are same as Figure 6.25, except for

.

Figure 6.28 Parameters are same as Figure 6.25, except for

. Only 50 Monte Carlo trials are performed here, rather than 500.

Figure 6.29 This figure shows the eigenvalues of a single

i.i.d. random matrix with atom distribution

X

defined by a white Gaussian random variable with zero mean and variance one; the eigenvalues were perturbed by adding a deterministic matrix with four eigenvalues: 2 + 2j;2‐

δ

+2j;2 + 2j‐j‐

δ

;2 + 2j+

+j

(their corresponding eigenvectors are random Gaussian vectors). Here

is the minimum distance between two eigenvalues. The small circles are centered at these four eigenvalue locations on the complex plane, respectively, and each has a radius

where

. Five hundred Monte Carlo trials are performed to see how stable these eigenvalues locations are. We can clearly identify every corresponding eigenvalue location on the complex plane.

Figure 6.30 This figure shows the eigenvalues of a single

i.i.d. random matrix with atom distribution

X

defined by a white Gaussian random variable with zero mean and variance one; the eigenvalues were perturbed by adding a deterministic matrix with four eigenvalues: a + jb;a‐

δ

+jb;a + jb‐j

δ

;a + jb+

+j

(their corresponding eigenvectors are random Gaussian vectors). Here

is the minimum distance between two eigenvalues, and

. The small circles are centered at these 4 eigenvalue locations on the complex plane, and each has a radius of

. In this case

. 200 Monte Carlo trials are performed to see how stable these eigenvalues locations are. We can clearly identify every corresponding eigenvalue location on the complex plane.

Figure 6.31 This figure shows the eigenvalues of a single

i.i.d. random matrix with atom distribution

X

defined by a white Gaussian random variable with zero mean and variance one; the eigenvalues were perturbed by adding a deterministic matrix with four eigenvalues: 2; 2 + 2j;2j;‐2‐2j (their corresponding eigenvectors are random Gaussian vectors). The small circles are centered at these four eigenvalue locations on the complex plane, and each has a radius

where

. Twenty Monte Carlo trials are performed to see how stable these eigenvalues locations are. We can clearly identify every corresponding eigenvalue location on the complex plane.

Figure 6.32 Eigenvalues of

for

where

and

. The small circles are centered at

, respectively, and each has a radius of

. Twenty Monte Carlo trials are performed.

Figure 6.33 Eigenvalues of the matrix

for

and

. Each entry is an i.i.d. Gaussian random variable.

Figure 6.34 Eigenvalues of the matrix

for

and

. Each entry is an i.i.d. Bernoulli random variable, taking the values +1 and

1 each with probability 1/2.

Figure 6.35 Eigenvalues of the matrix

for

and

. Each entry is an i.i.d. Gaussian random variable.

Figure 6.36 Eigenvalues of the matrix

for

and

. Each entry is an i.i.d. Bernoulli random variable, taking the values +1 and

each with probability 1/2.

Figure 6.37 Eigenvalues of the matrix

for

,

and

. Each entry is an i.i.d. Gaussian random variable.

Figure 6.38 Eigenvalues of the matrix

for

,

and

. Each entry is an i.i.d. Bernoulli random variable, taking the values +1 and

each with probability 1/2.

Figure 6.39 Plotted above is the distribution of the eigenvalues of

where

X

n

is an

random matrix with

and

. Each entry of

X

n

is an i.i.d. Gaussian random variable.

. The three circles with radius of 1/

n

1/4

are located at

. Twenty Monte Carlo trials are performed.

Chapter 07

Figure 7.1 Comparison of the risk estimate using Monte Carlo (solid line) and SURE (cross) versus

for

. SNR = 0.5

Figure 7.2 The same as Figure 7.1 except SNR = 1

Figure 7.3 Theorem 7.6.1 interpreted as a projection in Hilbert space.

Figure 7.4 Theorem 7.6.1 interpreted as tradeoff between bias and variance: Shrinkage intensity 0 corresponds to the sample covariance matrix

S

. Shrinkage intensity 1 corresponds to the shrinkage target

μ

I

. Optimal shrinkage intensity (represented by •) corresponds to the minimum expected loss combination

Figure 7.5 Bayesian interpretation. The left sphere has center

μ

I

and radius

α

and represents prior information. The right sphere has center

S

and radius

β

. The distance between sphere centers is

δ

and represents sample information. If all we knew was that the true covariance matrix

lies on the left sphere, our best guess would be its center: the shrinkage target

μ

I

. If all we knew was that the true covariance matrix

lies on the right sphere, our best guess would be its center: the sample covariance matrix

S

. Putting together both pieces of information, the true covariance matrix

must lie on the circle where the two spheres intersect; therefore, our best guess is its center: the optimal linear shrinkage

.

Figure 7.6 Sample versus true eigenvalues. The solid line represents the distribution of the eigenvalues of the sample covariance matrix. Eigenvalues are sorted from largest to smallest, then plotted against their rank. In this case, the true covariance matrix is the identity, that is, the true eigenvalues are all equal to one. The distribution of true eigenvalues is plotted as a dashed horizontal line at one. Distributions are obtained in the limit as the number of observations

n

and the number of variables

p

both go to infinity with the ratio

p

/

n

converging to a finite positive limit. The four plots correspond to different values of this limit.

Figure 7.7 Projection of first sample eigenvector onto population eigenvectors (indexed by their their associated eigenvalues). We have taken

.

Figure 7.8 Limiting density of sample eigenvalues, in the particular case where all the eigenvalues of the true covariance matrix are equal to one. The graph shows excess dispersion of the sample eigenvalues. The formula for this plot comes from solving the Marchenko–Pastur equation for

.

Figure 7.9 Comparison of the optimal linear versus nonlinear bias correction formula. The distribution of true eigenvalues

H

places 20% mass at 1, 40% mass at 3 and 40% mass at 10.

Figure 7.10 Percentage relative improvement in average loss (PRIAL) from applying the optimal nonlinear shrinkage formula to the sample eigenvalues. The solid line shows the PRIAL obtained by dividing the

i

‐th sample eigenvalue by the correction factor

, as a function of sample size. The dotted line shows the PRIAL of the linear shrinkage estimator first proposed in [402]. For each sample size we ran 10 000 Monte Carlo simulations. As in Figure 8, we used

and the distribution of true eigenvalues

H

placing 20% mass at 1, 40% mass at 3 and 40% mass at 10.

Figure 7.11 A distributed system with a large number of sensors.

Figure 7.12 System with

interering users.

Chapter 08

Figure 8.1 Log‐likelihood functions defined in (8.230) under hypothesis ℋ

0

and hypothesis ℋ

1

for different Monte Carlo realizations.

dB, N = 100.

Chapter 09

Figure 9.1 The future electric grid. RE: renewable energy.

Figure 9.2 Visual history of industrial revolutions: from energy to services and communication and back again to energy.

Figure 9.3 Conceptual model of the smart grid.

Figure 9.4 Projected power generation additions: 2020.

Figure 9.5 Next‐generation cost comparison.

Figure 9.6 NIST smart grid reference model.

Figure 9.7 Smart grid domains.

Figure 9.8 Smart grid pyramid.

Figure 9.9 A view of the utility information system impacted by smart grid strategies.

Figure 9.10 Using the cloud for smart‐grid applications.

Figure 9.11 Systems required to support the high penetration of distributed resources.

Chapter 10

Figure 10.1 A damage‐adaptive intelligent flight‐control system (IFCS).

Figure 10.2 How energy management systems can help to avoid blackouts.

Figure 10.3 A sample system with processors connected by communication links.

Figure 10.4 Role of demand response in electric system planning and operations.

Figure 10.5 Model of domestic energy streams.

Figure 10.6 Three step control methodology.

Figure 10.7 Smart grid detailed logical model.

Figure 10.8 Household electricity demand profile.

Figure 10.9 Distribution of network smart metering data.

Figure 10.10 Communication Infrastructure for Smart Grid.

Figure 10.11 Smart grid architecture increases the capacity and flexibility of the network and provides advanced sensing and control through modern communications technologies.

Chapter 11

Figure 11.1 Smart grid framework.

Chapter 12

Figure 12.1 Compensating for signal delay introduced by the antialiasing filter.

Figure 12.2 Electricity ecosystem of the future grid featuring various of players and levels of interactions.

Figure 12.3 Relationship between different elements that collectively constitute the EMS/SCADA.

Chapter 14

Figure 14.1 Demand response connectivity and information flow.

Figure 14.2 Interaction of demand response, variable generation, and storage.

Figure 14.3 A simplified illustration of the wholesale electricity market formed by multiple generators and several regional retail companies. Each retailer provides electricity for a number of users. Retailers are connected to the users via local area networks which are used to announce real‐time prices to the users.

Figure 14.4 Daily residential load curve.

Figure 14.5 Subscription options of charging time zones for PEV owners and variable short‐term market energy pricing.

Chapter 15

Figure 15.1 A proposed 5G heterogeneous wireless cellular architecture.

Guide

Cover

Table of Contents

Begin Reading

Pages

iii

iv

v

xv

xvi

xvii

xix

xxi

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

455

457

458

459

460

461

462

463

464

465

466

467

468

469

471

472

473

474

475

476

477

478

479

480

481

482

483

485

486

487

488

489

490

491

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

517

518

519

520

521

522

523

525

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

557

558

559

560

561

562

563

564

565

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

601

602

603

604

Smart Grid using Big Data Analytics

A Random Matrix Theory Approach

 

Robert C. Qiu and Paul Antonik

 

 

 

 

 

 

 

This edition first published 2017© 2017 John Wiley & Sons Ltd

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law.Advice on how to obtain permision to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Robert C. Qiu and Paul Antonik to be identified as the authors of this work has been asserted in accordance with law.

Registered OfficesJohn Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USAJohn Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

Editorial OfficeThe Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of WarrantyThe publisher and the authors make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of fitness for a particular purpose. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for every situation. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of experimental reagents, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each chemical, piece of equipment, reagent, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. The fact that an organization or website is referred to in this work as a citation and/or potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this works was written and when it is read. No warranty may be created or extended by any promotional statements for this work. Neither the publisher nor the author shall be liable for any damages arising here from.

Library of Congress Cataloging‐in‐Publication Data

Names: Qiu, Robert C., 1966– author. | Antonik, Paul, author.Title: Smart grid using big data analytics/Robert C. Qiu, Paul Antonik.Description: Chichester, West Sussex, United Kingdom : John Wiley & Sons, Inc., 2017. |    Includes bibliographical references and index.Identifiers: LCCN 2016042795| ISBN 9781118494059 (cloth) | ISBN 9781118716793 (epub) |    ISBN 9781118716809 (Adobe PDF)Subjects: LCSH: Smart power grids. | Big data.Classification: LCC TK3105 .Q25 2017 | DDC 621.310285/57–dc23LC record available at https://lccn.loc.gov/2016042795

Cover design by WileyCover image: johnason/loops7/ziggymaj/gettyimages

 

 

 

 

 

 

To Lily L. Li

Preface

When, in the fall of 2010, the first author wrote the initial draft of this book in the form of lecture notes for a smart grid course, the preface began by justifying the need for such a course. He explained at length why it was important that electrical engineers understand Smart Grid. Now, such a justification seems unnecessary. Rather, he had to justify repeatedly his own decision to cover aspects of big data in a smart grid course, in order to convince the audience and most of the time himself. Although we feel completely comfortable with this “big” decision at this point of writing, we still want to outline some points that led to that decision. The decision was motivated by our passion to pursue research in this direction. The excitement of the problems that lie at the intersection between the two topics convinced us that the time had come to study big data for smart grid, which is the integration of communications and sensing.

For big data, we have two major tasks: (i) big data modeling and (ii) big data analytics. After the book was finished, we realized that more than 90% of the contents were dedicated to these two aspects. The applications of this material are treated very lightly. We emphasize the mathematical foundation of big data, in a similar way to Qiu and Wicks’ Cognitive Networked Sensing and Big Data (Springer, 2014). Qiu, Hu, Li and Wicks’ Cognitive Radio Communication and Networking (John Wiley & Sons Ltd, 2012) complements both books. All three books are unified by matrix‐valued random variables (random matrix theory).

In choosing topics we heeded the warning of the former NYU professor K. O. Friedrichs: “It is easy to write a book if you are willing to put into it everything you know about the subject” (P. Lax, Functional Analysis, Wiley‐Interscience, 2002, p. xvii). The services provided by Google Scholar and online digital libraries completely relieved us of the burden of physically going to the library. Using the “cited by” function provided by Google Scholar, even working remotely from the office, we could put things together without difficulty. We were able to use this function to track the latest results on the subject. This book deals with the fundamentals of big data, addressing principles and applications. We view big data as a new science: a combination of information science and data science. Smart grid, communications and sensing are three applications of special interest to the authors.

This book studies the intersection of big data (Part I) with Smart Grid (part II) and communications and sensing (Part III). Random matrix theory is treated as the unifying theme. Random matrix models provide a powerful framework for modeling numerous physical phenomena in quantum systems, financial systems, sensor networks, wireless networks, smart grid, and so forth. One goal is to outline how an audience with a signals‐and‐systems background can contribute to big data research and development (R&D). As most mathematical results are synthesized from the literature of mathematics and physics, we have tried to present them in very different ways, usually motivated by the above Big Data systems. Roughly speaking, a big data system means a large statistical system or “large models.” Although no claim of novel mathematical results is made, the combination of these mathematical models with these particular big data systems seems worth mentioning. Initially, we really intended to write a textbook in a traditional way; however, as the project evolved, we could not resist the temptation to include many beautiful mathematical results. These results are relatively new in the statistical literature and completely novel to the engineering community. We aim to bridge the gap between big data modeling/analytics and large random matrices in a systematic manner. The latest references are reasonably comprehensive in this treatment (sometimes exhaustive, for example for non‐Hermitian random matrices).

Random matrices are ubiquitous [1]. The reason for this is twofold. First, they have a great degree of universality; that is, the eigenvalue properties of large matrices do not depend on the underlying statistical matrix ensemble. Second, random matrices can be viewed as noncommuting probability theory where the whole matrix is treated as an element of the probability space. Nowadays, data sets are usually organized as large matrices whose first dimension is equal to the number of degrees of freedom and the second to the number of measurements. Typical examples include financial systems, sensing systems and wireless communications systems.

As pointed out above, random matrix theory is the foundation for many problems in smart grid and big data. We hold the belief that big data is more basic than smart grid; the latter is is the applied science of the former. On the other hand, smart grid motivates big data. As a result, the close interaction is the natural topic for study. During the first offering of the first author’s course on smart grid (Fall 2010), he primarily relied on the journal papers on power systems. During the second offering of this course (Fall 2013), the contents of the materials mainly covered big data aspects, especially the latest results of the random matrix theory. The audience were graduate students from EE and CS. He realized that without solid backgrounds in big data, the introduction of smart grid—large power systems that lead to high‐dimensional data—could be very superficial. For example, the challenges of state estimation and bad data detection are due to the high dimensionality of the resultant datasets. This issue belongs to the larger class of standard big data problems. Although, in Fall 2013, he pushed the course to the frontiers of statistics, theoretical physics, and finance, he knew that his class had difficulty in following him. He lost most students when he addressed random matrices. To do that, he had to go back to cover random vectors first. It was a very painful experience for all of them because the students were not comfortable with random vectors, which are the most important prerequisite for reading Part I of this book.

Big data is a new science with numerous applications. After we combine smart grid and big data, we are able to crystallize many standard problems and focus our efforts on the marriage of two subjects. We feel very comfortable that this combination will be extremely fruitful in the near future. In the infancy of this connection, our aim is to spell out our goals and methodologies; at the same time, we outline the mathematical foundations by introducing random matrix theory, in the hope that this mathematical theory is sufficiently general and flexible to provide a definitive machinery for the analysis of big data and smart grid. It is common to hear that big data lacks a theoretical foundation. Maybe there is no theory at all. It is the sense of the mission (to search for such a theory) that has sustained us in this long journey.

Acknowledgments

This book is the result of many years of teaching and research in the field of smart grid. This work is in part funded by the National Science Foundation through three grants (ECCS‐0901420, ECCS‐0821658, and CNS‐1247778), and the Office of Naval Research through two grants (N00010‐10‐1‐0810 and N00014‐11‐1‐0006). We want to thank Dr. Santanu K. Das (ONR) for his support for the work.

We want to thank Dr. Zhen Hu for reading through the whole manuscript. The first author wants to thank the ECE students in his smart grid courses (Fall 2010, Fall 2013) for their patience and useful feedback. The first author was working with China Power Research Institute (CPRI), Beijing, China, when the book was nearing completion. He wants to thank his host Dr. Dongxia Zhang (CPRI) and Dr. Chaoyang Zhu for their hospitality. The first author also worked for two months at Shanghai Jiaotong University. He wants to thank Professors Wenxian Yu, Xiuchen Jiang and Zhijian Jin for their hospitality and useful discussions. He also wants to thank Professors Shaoqian Li and Guangrong Yue at the University of Electronic Science and Technology of China (UESTC) for their hospitality and useful discussions.

Some Notation

Kronecker product of two matrices

matrix with

p

rows and

q

columns

free additive convolution; Voiculescu’s operations

and

free multiplicative convolution

expectation of

operator norm of the matrix

Frobenius norm of the matrix

convergence in distribution

the set of complex numbers

expectation of random variable

X

expectation of random vector

x

expectation of random matrix

X

indicator function

1Introduction

1.1 Big Data: Basic Concepts

Data is “unreasonably effective” [2]. Nobel laureate Eugene Wigner referred to the unreasonable effectiveness of mathematics in the natural sciences [3]. What is big data? According to [4], its sizes are in the order of terabytes or petabytes; it is often online, and it is not available from a central source. It is diverse, may be loosely structured with a large percentage of data missing.It is heterogeneous.

The promise of data‐driven decision‐making is now broadly recognized [5–16]. There is no clear consensus about what big data is. In fact, there have been many controversial statements about big data, such as “Size is the only thing that matters.”

Big data is a big deal [17]. The Big Data Research and Development Initiative has been launched by the US Federal government. “By improving our ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help accelerate the pace of discovery in science and engineering, strengthen our national security, and transform teaching and learning” [17]. Universities are beginning to create new courses to prepare the next generation of “data scientists.”

The age of big data has already arrived with global data doubling every two years. The utility industry is not the only one facing this issue (Wal‐Mart has a million customer transactions a day) but utilities have been slower to respond to the data deluge. Scaling up the algorithms to massive datasets is a big challenge.

According to [18]:

A key tenet of big data is that the world and the data that describe it are constantly changing and organizations that can recognize the changes and react quickly and intelligently will have the upper hand … As the volume of data explodes, organizations will need analytic tools that are reliable, robust and capable of being automated. At the same time, the analytics, algorithms, and user interfaces they employ will need to facilitate interactions with the people who work with the tools.

1.1.1 Big Data—Big Picture

Data is a strategic resource, together with natural resources and human resources. Data is king! “Big data” refers to a technology phenomenon that has arisen since the late 1980s [19]. As computers have improved, their growing storage and processing capacities have provided new and powerful ways to gain insight into the world by sifting through enormous quantities of data available. But this insight, discoverable in previously unseen patterns and trends within these phenomenally large data sets, can be hard to detect without new analytic tools that can comb through the information and highlight points of interest.

Sources such as online or mobile financial transactions, social media traffic, and GPS coordinates, now generate over 2.5 quintillion bytes of so‐called “big data” every day. The growth of mobile data traffic from subscribers in emerging markets exceeded 100% annually through 2015. There are new possibilities for international development (see Figure 1.1).

Figure 1.1 Big data, big impact: new possibilities for international development.

Source: Reproduced from [6] with permission from the World Economic Forum.

Big data at the societal level provides a powerful microscope, together with social mining—the ability to discover knowledge from these data. Scientific research is being revolutionized by this, and policy making is next in line, because big data and social mining are providing novel means for measuring and monitoring wellbeing in our society more realistically, beyond the GDP, more precisely, continuously, everywhere [20].

Most scientific disciplines are finding the data deluge to be extremely challenging, and tremendous opportunities can be realized if we can better organize and access the data [16].

Chris Anderson believed that the data deluge makes the scientific method obsolete [21]. Petabytes data tell us to say correlation is enough. There is no need to find the models. Correction replaces causality. It remains open to see whether the data growth will lead to a fundamental change in scientific methods.

In the computing industry we are now focussing on how to process big data [22].

A fundamental question is “What is the unifying theory for big data?” This book adopts the viewpoint that big data is a new science of combining data science and information science. Specialists in different fields deal with big data on their own, while information experts play a secondary role as assistants. In other words, most scientific problems are in the hands of specialists whereas only few problems—common to all fields—are refined by computing experts. When more and more problems are open, some unifying challenges common to all fields will arise. Big data from the Internet may receive more attention first. Big data from physical systems will become more and more important.

Big data will form a unique discipline that requires expertise from mathematics, statistics and computing algorithms.

Following the excellent review in [22], we highlight some challenges for big data:

Processing unstructured and semistructured data.

Presently 85% of the data are unstructured or semistructured. Traditional relational databases cannot handle these massive datasets. High scalability is the most important requirement for big‐data analysis. MapReduce and Hadoop are two nonrelational data analysis technologies.

Novel approaches for data representation.

Current data representation cannot visually express the true essence of the data. If the raw data are labeled, the problem is much easier but customers do not approve of the labeling.

Data fusion.

The true value of big data cannot exhibit itself without data fusion. The data deluge on the Internet has something to do with data formats. One critical challenge is whether we can conveniently fuse the data from individuals, industry and government. It is preferable that data formats be platform free.

Redundancy reduction and high‐efficiency, low‐cost data storage.

Redundancy reduction is important for cost reduction.

Analytical tools and development environments that are suitable for a variety of fields.

Computing algorithm researchers and people from different disciplines are encouraged to work together closely as a team. There are enormous barriers for people from different disciplines to share data. Data collection, especially simultaneous collection for relational data, is still very challenging.

Novel approaches to save energy for data processing, data storage, and communication.

1.1.2 DARPA’s XDATA Program

The Defense Advanced Research Projects Agency’s (DARPA’s) XDATA program seeks to develop computational techniques and software tools for analyzing large volumes of data, both semistructured (e.g., tabular, relational, categorical, metadata) and unstructured (e.g., text documents, message traffic). Central challenges to be addressed include (i) developing scalable algorithms for processing imperfect data in distributed data stores, and (ii) creating effective human–computer interaction tools to facilitate rapidly customizable visual reasoning for diverse missions.

Data continues to be generated and digitally archived at increasing rates, resulting in vast databases available for search and analysis. Access to these databases has generated new insights through data‐driven methods in the commercial, science, and computing sectors [23]. The defense section is “swimming in sensors and drowning in data.” Big data arises from the Internet and the monitoring of industrial equipment. Sensor networks and the Internet of Things (IoT) are another two drivers.

There is a trend for data to be used that can sometimes be seen only once, for milliseconds, or can only be stored for a short time before being deleted, especially in some defense applications. This trend is accelerated by the proliferation of various digital devices and the Internet. It is important to develop fast, scalable, and efficient methods for processing and visualizing data.

The XDATA program’s technology development is approached through four technical areas (TAs):

TA1: Scalable analytics and data‐processing technology;

TA2: Visual user interface technology;

TA3: Research software integration;

TA4: Evaluation.

It is useful to consider distributed computing via architectures like MapReduce, and its open source implementation, Hadoop. Data collected by the Department of Defense (DoD) are particularly difficult to deal with, including missing data, missing connections between data, incomplete data, corrupted data, data of variable size and type, and so forth [23]. We need to develop analytical principles and implementations scalable to data volume and distributed computer architectures. The challenge for Technical Area 1 is how to enable systematic use of big data in the following list of topic areas:

Methods for leveraging the problem structure to create new algorithms to achieve optimal tradeoffs among time complexity, space complexity, and stream complexity (i.e., how many passes over the data are needed).

Methods for the propagation of uncertainty (i.e., every query should have an answer and an error bar), with performance guarantees for loss of precision due to approximations.

Methods for measuring nonlinear relationships among data.

Sampling and estimation techniques for distributed platforms, including compensating for missing information, corrupted information, and incomplete information.

Methods for distributed dimensionality reduction, matrix factorization, matrix completion (within a distributed data store where data are not all in one place).

Methods for operating on streaming data feeds.

Methods for determining optimal cloud configurations and resource allocation with asymmetric components (e.g., many standard machines, a small number of large‐memory machines, machines with graphical processing units).

The challenge for Technical Area 2 is how to hook up big data analytics to interfaces, including but not limited to the following topics:

Visualization of data for scientific discovery, activity patterns, and summaries.

Expressive visualization and/or query languages and processing that support domain‐specific interaction, successive query refinement, repeated viewing of data, faceted search, multidimensional queries, and collaborative/interactive search.

Principled design, including menus, query boxes, hover tips, invalid action notifications, layout logic, as well as processes of overview, zoom and filter, and details‐on‐demand.

Support for the study and characterization of users, including extraction of relations and history, usage, hover time, click rate, dwell, etc.

Functions of timeliness, online versus batch processing, metainformation, etc.

Analytical workflows including data cleaning and intermediate processing.

Tools for rapid domain‐specific end‐user customization.

1.1.3 National Science Foundation

The phrase “big data” in the National Science Foundation (NSF) refers to large, diverse, complex, longitudinal, and/or distributed data sets generated from instruments, sensors, Internet transactions, e‐mail, video, click streams, and/or all other digital sources available today and in the future [5].

Today, US government agencies recognize that the scientific, biomedical and engineering research communities are undergoing a profound transformation with the use of large‐scale, diverse, and high‐resolution data sets that allow for data‐intensive decision making, including clinical decision making, at a level never before imagined. New statistical and mathematical algorithms, prediction techniques, and modeling methods, as well as multidisciplinary approaches to data collection, data analysis and new technologies for sharing data and information are enabling a paradigm shift in scientific and biomedical investigation. Advances in machine learning, data mining, and visualization are enabling new ways of extracting useful information in a timely fashion from massive data sets, which complement and extend existing methods of hypothesis testing and statistical inference. As a result, a number of agencies are developing big‐data strategies to align with their missions. The NSF’s solicitation focuses on common interests in big data research across the National Institutes of Health (NIH) and the NSF.

1.1.4 Challenges and Opportunities with Big Data

There are challenges with Big Data. The first step is data acquisition. Some data sources, such as sensor networks, can produce staggering amounts of raw data. A lot of this data is not of interest. It can be filtered out and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information.

The second big challenge is to generate the right metadata automatically, and to describe what data is recorded and how it is recorded and measured. This metadata is likely to be crucial to downstream analysis. Frequently, the information collected will not be in a format ready for analysis. We have to deal with erroneous data: some news reports are inaccurate.

Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large‐scale analysis all of this has to happen in a completely automated manner.

Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big data computing environments. Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non‐SQL process, and bringing the data back.