Applied Modeling Techniques and Data Analysis 2 -  - E-Book

Applied Modeling Techniques and Data Analysis 2 E-Book

0,0
139,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.

Mehr erfahren.
Beschreibung

BIG DATA, ARTIFICIAL INTELLIGENCE AND DATA ANALYSIS SET Coordinated by Jacques Janssen

Data analysis is a scientific field that continues to grow enormously, most notably over the last few decades, following rapid growth within the tech industry, as well as the wide applicability of computational techniques alongside new advances in analytic tools. Modeling enables data analysts to identify relationships, make predictions, and to understand, interpret and visualize the extracted information more strategically.

This book includes the most recent advances on this topic, meeting increasing demand from wide circles of the scientific community. Applied Modeling Techniques and Data Analysis 2 is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians, working on the front end of data analysis and modeling applications. The chapters cover a cross section of current concerns and research interests in the above scientific areas. The collected material is divided into appropriate sections to provide the reader with both theoretical and applied information on data analysis methods, models and techniques, along with appropriate applications.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 364

Veröffentlichungsjahr: 2021

Bewertungen
0,0
0
0
0
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Title Page

Copyright

Preface

PART 1: Financial and Demographic Modeling Techniques

1 Data Mining Application Issues in the Taxpayer Selection Process

1.1. Introduction

1.2. Materials and methods

1.3. Results

1.4. Discussion

1.5. Conclusion

1.6. References

2 Asymptotics of Implied Volatility in the Gatheral Double Stochastic Volatility Model

2.1. Introduction

2.2. The results

2.3. Proofs

2.4. References

3 New Dividend Strategies

3.1. Introduction

3.2. Model 1

3.3. Model 2

3.4. Conclusion and further results

3.5. Acknowledgments

3.6. References

4 Introduction of Reserves in Self-adjusting Steering of Parameters of a Pay-As-You-Go Pension Plan

4.1. Introduction

4.2. The pension system

4.3. Theoretical framework of the Musgrave rule

4.4. Transformation of the retirement fund

4.5. Conclusion

4.6. References

5 Forecasting Stochastic Volatility for Exchange Rates using EWMA

5.1. Introduction

5.2. Data

5.3. Empirical model

5.4. Exchange rate volatility forecasting

5.5. Conclusion

5.6. Acknowledgments

5.7. References

6 An Arbitrage-free Large Market Model for Forward Spread Curves

6.1. Introduction and background

6.2. Construction of a market with infinitely many assets

6.3. Existence, uniqueness and non-negativity

6.4. Conclusion and future works

6.5. References

7 Estimating the Healthy Life Expectancy (HLE) in the Far Past: The Case of Sweden (1751-2016) with Forecasts to 2060

7.1. Life expectancy and healthy life expectancy estimates

7.2. The logistic model

7.3. The HALE estimates and our direct calculations

7.4. Conclusion

7.5. References

8 Vaccination Coverage Against Seasonal Influenza of Workers in the Primary Health Care Units in the Prefecture of Chania

8.1. Introduction

8.2. Material and method

8.3. Results

8.4. Discussion

8.5. References

9 Some Remarks on the Coronavirus Pandemic in Europe

9.1. Introduction

9.2. Background

9.3. Materials and analyses

9.4. The first phase of the pandemic

9.5. Concluding remarks

9.6. References

PART 2: Applied Stochastic and Statistical Models and Methods

10 The Double Flexible Dirichlet: A Structured Mixture Model for Compositional Data

10.1. Introduction

10.2. The double flexible Dirichlet distribution

10.3. Computational and estimation issues

10.4. References

11 Quantization of Transformed Lévy Measures

11.1. Introduction

11.2. Estimation strategy

11.3. Estimation of masses and the atoms

11.4. Simulation results

11.5. Conclusion

11.6. References

12 A Flexible Mixture Regression Model for Bounded Multivariate Responses

12.1. Introduction

12.2. Flexible Dirichlet regression model

12.3. Inferential issues

12.4. Simulation studies

12.5. Discussion

12.6. References

13 On Asymptotic Structure of the Critical Galton-Watson Branching Processes with Infinite Variance and Allowing Immigration

13.1. Introduction

13.2. Invariant measures of GW process

13.3. Invariant measures of GWPI

13.4. Conclusion

13.5. References

14 Properties of the Extreme Points of the Joint Eigenvalue Probability Density Function of the Wishart Matrix

14.1. Introduction

14.2. Background

14.3. Polynomial factorization of the Vandermonde and Wishart matrices

14.4. Matrix norm of the Vandermonde and Wishart matrices

14.5. Condition number of the Vandermonde and Wishart matrices

14.6. Conclusion

14.7. Acknowledgments

14.8. References

15 Forecast Uncertainty of the Weighted TAR Predictor

15.1. Introduction

15.2. SETAR predictors and bootstrap prediction intervals

15.3. Monte carlo simulation

15.4. References

16 Revisiting Transitions Between Superstatistics

16.1. Introduction

16.2. From superstatistic to transition between superstatistics

16.3. Transition confirmation

16.4. Beck’s transition model

16.5. Conclusion

16.6. Acknowledgments

16.7. References

17 Research on Retrial Queue with Two-Way Communication in a Diffusion Environment

17.1. Introduction

17.2. Mathematical model

17.3. Asymptotic average characteristics

17.4. Deviation of the number of applications in the system

17.5. Probability distribution density of device states

17.6. Conclusion

17.7. References

List of Authors

Index

End User License Agreement

List of Tables

Chapter 1

Table 1.1.

Tax claim, interesting and not interesting taxpayers

Table 1.2.

Number of coercive procedures per tax claim interval

Table 1.3.

Predicted values versus actual coercive procedures

Table 1.4.

Predicted coercive procedures versus actual interesting taxpayers

Table 1.5.

The most significant results of the models

Chapter 4

Table 4.1.

Distribution of the workforce between the categories

Table 4.2.

Replacementrates and ratio between benefits and contributions

Table 4.3.

Replacement rates and contribution ratio in the newsystem

Chapter 5

Table 5.1.

Descriptive statistics of raw data

Table 5.2.

Descriptive statistics of logarithmic returns

Table 5.3. Errors (RMSE and MAPE) for different decay factors λi and out-of-sam...

Chapter 7

Table 7.1.

Logistic model parameters and estimates

Table 7.2.

HALE and healthy life expectancy direct estimates and logistic fit

Chapter 8

Table 8.1. Age and gender distribution in terms of HC/LHU inside/outside the ci...

Table 8.2. Professional characteristics regarding HC/LHU inside/outside the cit...

Table 8.3. % frequency and 95% CI of vaccinations in total and by gender, age o...

Table 8.4. Frequency of vaccinations and 95% CI between HC/LHU inside/outside t...

Table 8.5. Breakdown by type of staff of impulses and preventions of vaccinatio...

Chapter 9

Table 9.1. COVID-19 suspected case criteria (adapted from the WHO: https://www....

Chapter 10

Table 10.1. Mean Vectors stratified by cluster. μkj refers to the j-th element ...

Table 10.2.

Mean of 500 initializations of (

α

, τ

) in different parameter config...

Table 10.3.

Parameter configurations for all the DFD simulations

Table 10.4. Results for the simulation study regarding the initialization proce...

Table 10.5.

ID4- Simulation results

Chapter 12

Table 12.1. Posterior means and CIs of unknown parameters together with WAIC ba...

Table 12.2. Posterior means and CIs of unknown parameters together with WAIC ba...

Table 12.3. Posterior means and CIs of unknown parameters together with WAIC ba...

Table 12.4. Posterior means and CIs of unknown parameters together with WAIC ba...

Table 12.5. Simulation study2: posterior means and CIs of unknown parameters to...

Table 12.6. Simulation study 3: posterior means and CIs of unknown parameters t...

Chapter 14

Table 14.1. For different points on a three-dimensional sphere and the square o...

Table 14.2.

Comparison of the value of the Vandermonde determinant

(∣X∣) and th...

Chapter 15

Table 15.1. Evaluation of the Pi’s of the weighted predictor for the models M1,...

Table 15.2.

Skewness-adjusted (Grabowski

et al. 2020) PI’s of Li (2011) and Sta...

Guide

Cover

Table of Contents

Title Page

Copyright

Preface

Begin Reading

List of Authors

Index

End User License Agreement

Pages

v

iii

iv

xi

xii

xiii

1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

135

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

Big Data, Artificial Intelligence and Data Analysis Set

coordinated by

Jacques Janssen

Volume 8

Applied Modeling Techniques and Data Analysis 2

Financial, Demographic, Stochastic and Statistical Models and Methods

Edited by

Yannis Dimotikalis

Alex Karagrigoriou

Christina Parpoula

Christos H. Skiadas

First published 2021 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:

ISTE Ltd

27-37 St George’s Road

London SW19 4EU

UK

www.iste.co.uk

John Wiley & Sons, Inc.

111 River Street

Hoboken, NJ 07030

USA

www.wiley.com

© ISTE Ltd 2021

The rights of Yannis Dimotikalis, Alex Karagrigoriou, Christina Parpoula and Christos H. Skiadas to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

Library of Congress Control Number: 2020951002

British Library Cataloguing-in-Publication Data

A CIP record for this book is available from the British Library

ISBN 978-1-78630-674-6

Preface

Data analysis as an area of importance has grown exponentially, especially during the past couple of decades. This can be attributed to a rapidly growing technology industry and the wide applicability of computational techniques, in conjunction with new advances in analytic tools. Modeling enables analysts to apply various statistical models to the data they are investigating, to identify relationships between variables, to make predictions about future sets of data, as well as to understand, interpret and visualize the extracted information more strategically. Many new research results have recently been developed and published and many more are developing and in progress at the present time. The topic is also widely presented at many international scientific conferences and workshops. This being the case, the need for the literature that addresses this is self-evident. This book includes the most recent advances on the topic. As a result, on one hand, it unifies in a single volume all new theoretical and methodological issues and, on the other, introduces new directions in the field of applied data analysis and modeling, which are expected to further grow the applicability of data analysis methods and modeling techniques.

This book is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians, who have been working on the front end of data analysis. The chapters included in this collective volume represent a cross-section of current concerns and research interests in the above-mentioned scientific areas. This volume is divided into two parts with a total of 17 chapters in a form that provides the reader with both theoretical and applied information on data analysis methods, models and techniques, along with appropriate applications.

Part 1 focuses on financial and demographic modeling techniques and includes nine chapters: Chapter 1, “Data Mining Application Issues in the Taxpayer Selection Process”, by Mauro Barone, Stefano Pisani and Andrea Spingola;Chapter 2, “Asymptotics of Implied Volatility in the Gatheral Double Stochastic Volatility Model”, by Mohammed Albuhayri, Anatoliy Malyarenko, Sergei Silvestrov, Ying Ni, Christopher Engström, Finnan Tewolde and Jiahui Zhang;Chapter 3, “New Dividend Strategies”, by Ekaterina Bulinskaya;Chapter 4, “Introduction of Reserves in Self-adjusting Steering the Parameters of a Pay-As-You-Go Pension Plan”, by Keivan Diakite, Abderrahim Oulidi and Pierre Devolder; Chapter 5, “Forecasting Stochastic Volatility for Exchange Rates using EWMA”, by Jean-Paul Murara, Anatoliy Malyarenko, Milica Rancic and Sergei Silvestrov; Chapter 6, “An Arbitrage-free Large Market Model for Forward Spread Curves”, by Hossein Nohrouzian, Ying Ni and Anatoliy Malyarenko; Chapter 7, “Estimating the Healthy Life Expectancy (HLE) in the Far Past: The Case of Sweden (1751-2016) with Forecasts to 2060”, by Christos H. Skiadas and Charilaos Skiadas; Chapter 8, “Vaccination Coverage Against Seasonal Influenza of Workers in the Primary Health Care Units in the Prefecture of Chania”, by Aggeliki Maragkaki and George Matalliotakis; Chapter 9, “Some Remarks on the Coronavirus Pandemic in Europe”, by Konstantinos N. Zafeiris and Marianna Koukli.

Part 2 covers the area of applied stochastic and statistical models and methods and comprises eight chapters: Chapter 10, “The Double Flexible Dirichlet: A Structured Mixture Model for Compositional Data”, by Roberto Ascari, Sonia Migliorati and Andrea Ongaro;Chapter 11, “Quantization of Transformed Lévy Measures”, by Mark Anthony Caruana;Chapter 12, “A Flexible Mixture Regression Model for Bounded Multivariate Responses”, by Agnese M. Di Brisco and Sonia Migliorati;Chapter 13, “On Asymptotic Structure of the Critical Galton-Watson Branching Processes with Infinite Variance and Allowing Immigration”, by Azam A. Imomov and Erkin E. Tukhtaev; Chapter 14, “Properties of the Extreme Points of the Joint Eigenvalue Probability Density Function of the Wishart Matrix”, by Asaph Keikara Muhumuza, Karl Lundengård, Sergei Silvestrov, John Magero Mango and Godwin Kakuba; Chapter 15, “Forecast Uncertainty of the Weighted TAR Predictor”, by Francesco Giordano and Marcella Niglio; Chapter 16, “Revisiting Transitions Between Superstatistics”, by Petr Jizba and Martin Prokš; Chapter 17, “Research on Retrial Queue with Two-Way Communication in a Diffusion Environment”, by Viacheslav Vavilov.

We wish to thank all the authors for their insights and excellent contributions to this book. We would like to acknowledge the assistance of all those involved in the reviewing process of this book, without whose support this could not have been successfully completed. Finally, we wish to express our thanks to the secretariat and, of course, the publishers. It was a great pleasure to work with them in bringing to life this collective volume.

Yannis DIMOTIKALIS

Crete, Greece

Alex KARAGRIGORIOU

Samos, Greece

Christina PARPOULA

Athens, Greece

Christos H. SKIADAS

Athens, Greece

December 2020

PART 1Financial and Demographic Modeling Techniques

1Data Mining Application Issues in the Taxpayer Selection Process

This chapter provides a data analysis framework designed to build an effective learning scheme aimed at improving the Italian Revenue Agency’s ability to identify non-compliant taxpayers, with special regard to self-employed individuals allowed to keep simplified registers. Our procedure involves building two C4.5 decision trees, both trained and validated on a sample of 8,000 audited taxpayers, but predicting two different class values, based on two different predictive attribute sets. That is, the first model is built in order to identify the most likely non-compliant taxpayers, while the second identifies the ones that are are less likely to pay the additional due tax bill. This twofold selection process target is needed in order to maximize the overall audit effectiveness. Once both models are in place, the taxpayer selection process will be held in such a way that businesses will only be audited if they are judged as worthy by both models. This methodology will soon be validated on real cases: that is, a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will subsequently be involved in some audit processes.

1.1. Introduction

Fraud detection systems are designed to automate and help reduce the manual parts of a screening/checking process (Phua et al. 2005). Data mining plays an important role in fraud detection as it is often applied to extract fraudulent behavior profiles hidden behind large quantities of data and, thus, may be useful in decision support systems for planning effective audit strategies. Indeed, huge amounts of resources (to put it bluntly, money) may be recovered from well-targeted audits. This explains the increasing interest and investments of both governments and fiscal agencies in intelligent systems for audit planning. The Italian Revenue Agency (hereafter, IRA) itself has been studying data mining application techniques in order to detect tax evasion, focusing, for instance, on the tax credit system, supposed to support investments in disadvantaged areas (de Sisti and Pisani 2007), on fraud related to credit mechanisms, with regard to value-added tax – a tax that is levied on the price of a product or service at each stage of production, distribution or sale to the end consumer, except where a business is the end consumer, which will reclaim this input value (Basta et al. 2009) and on income indicators audits (Barone et al. 2017).

This chapter contributes to the empirical literature on the development of classification models applied to the tax evasion field, presenting a case study that focuses on a dataset of 8,000 audited taxpayers on the fiscal year 2012, each of them described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.1

In this context, all the taxpayers are in some way “unfaithful”, since all of them have received a tax notice that somehow rectified the tax return they had filed. Thus, the predictive analysis tool we develop is designed to find patterns in data that may help tax offices recognize only the riskiest taxpayers’ profiles.

Evidence on data at hand shows that our first model, which is described in detail later, is able to distinguish the taxpayers who are worthy of closer investigation from those who are not. 2

However, by defining the class value as a function of the higher due taxes, we satisfy the need of focusing on the taxpayers who are more likely to be “significant” tax evaders, but we do not ensure an efficient collection of their tax debt. Indeed, data shows that as the tax bill increases, the number of coercive collection procedures put in place also increases. Unfortunately, these procedures are highly inefficient, as they are able to only collect about 5% of the overall credits claimed against the audited taxpayers (Italian Court of Auditors 2016). As a result, the tax authorities’ ability to collect the due taxes may be jeopardized.

Further analysis is thus devoted to finding a way to discover, among the “significant” evaders, the most solvent ones. We recall that the 2018–2020 Agreement between the IRA and the Ministry of Finance states that audit effectiveness is measured, among others, by an indicator that is simply equal to the sum of the collected due taxes which summarizes the effectiveness of the IRA’s efforts to tackle tax evasion (Ministry of Economy and Finance – IRA Agreement for 2018–2010 2018). This is a reasonable indicator because the ordinary activities taken in the fight against tax evasion are crucial from the State budget point of view, because public expenditures (i.e. public services) strictly depend on the amount of public revenue. Of course, fraud and other incorrect fiscal behaviors may be tackled, even though no tax collection is guaranteed, in order to reach the maximum tax compliance. Such extra activities may also be jointly conducted with the Finance Guard or the Public Prosecutor if tax offenses arise.

Therefore, to tackle our second problem, i.e. to guarantee a certain degree of due tax collection, a trivial fact that we start from is that a taxpayer with no properties will not be willing to pay his dues, whereas if he had something to lose (a home or a car that could be seized), then, if the IRA’s claim is right, it is more probable that he might reach an agreement with the tax authorities.

Therefore, a second model only focusing on a few features indicating whether the taxpayer owned some kind of assets or not is built, in order to predict each tax notice’s final status (in this case, we only distinguish between statuses ending with an enforced recovery proceeding and statuses where such enforced recovery proceedings do not take place). Once both models are available, the taxpayer selection process is held in such a way that businesses will only be audited if they are judged as worthy by both models.

The key feature of our procedure is the twofold selection process target, needed to maximize the IRA’s audit processes’ effectiveness. The methodology we suggest will soon be validated in real cases i.e. a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will be subsequently involved in some audit processes.

1.2. Materials and methods

1.2.1. Data

Data on hand refers to a sample of 8,028 audited self-employed individuals for fiscal year 2012, each described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.3

Just for descriptive purposes, we can depict the statistical distribution of the revenues achieved by the businesses in our sample, grouped in classes (in thousands of euros), in Figure 1.1.

Most of our dataset is made up of small-sized taxpayers, of which almost 50% show revenues lower than € 75,000 per year and only 4% higher than € 500,000, with a sample average of € 146,348.

Figure 1.1.Revenues distribution

For each taxpayer in the dataset, both his tax notice status and the additional due taxes (i.e. the additional requested tax amount) are known.

Here comes the first problem that needs to be tackled: the additional due tax is a numeric attribute which measures the seriousness of the taxpayer’s tax evasion, whereas our algorithms, as we will show later on, need categorical values in order to predict. Thus, we cannot directly use the additional due taxes, but we need to define a class variable and decide both which values it will take and how to map each numeric value referred to the additional due taxes into such categorical values.

1.2.2. Interesting taxpayers

We must define a function f(x) which associates, to each element x in the dataset, a categorical value that shows its fraud risk degree and represents the class our first model will try to predict. Of course, a function that labels all the taxpayers in the dataset as tax evaders would be useless. Thus, a distinction needs to be drawn between serious tax evasion cases and those that are less relevant. To this purpose, we somehow follow (Basta et al. 2009) and choose to divide the taxpayers into two groups, the interesting ones and the not interesting ones, from the tax administration point of view (to a certain extent, interesting stands for “it might be interesting for the tax administration to go and check what’s going on ...”), based on two criteria: profitability (i.e. the ability to identify the most serious cases of tax evasion, independently from all other factors) and fairness (i.e. the ability to identify the most serious cases of tax evasion, with respect to the taxpayer’s turnover).

Honest taxpayers are treated as not interesting taxpayers, even though this label is used to indicate moderate tax evasion cases. We are somehow forced to use this approximation since we only have data on taxpayers who received a tax notice, and not on taxpayers for which an audit process may have been closed without qualifications, or may have not even been started.

Therefore, in order to take the profitability issue into account, we define a new variable, called the tax claim, which represents the higher assessed taxes if the tax notice stage is still open, or the higher settled taxes if the stage status is definitive. Note that the higher assessed tax could be different from the higher settled tax, because the IRA and the taxpayer, while reaching an agreement, can both reconsider their positions. The tax claim distribution grouped in classes (again, in thousands of euros) is shown in Figure 1.2.

Figure 1.2.Tax claim distribution. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip

The left vertical axis is related to the tax claim distribution, grouped in the classes shown on the horizontal axis; the right vertical axis, on the contrary, sums up the monetary tax claim amount that arises from each group (in thousands of euro). Therefore, as it can easily be seen, the 331 most profitable tax notices (12% of the total) account for almost half of the tax revenue arising from our dataset.

The fairness criterion is then introduced to address the audit process, even towards smaller firms (which usually are charged smaller amounts of due income taxes), and it is useful as it allows the tax authorities to not discriminate against taxpayers on the basis of their turnover and introduces a deterrent effect which improves the overall tax compliance.

Therefore, we define another variable, called Z, which takes into account, for each taxpayer, both his turnover and revenues, and compares them to the tax claim. More formally, both of the ratios and are computed. Then, the minimum between these two ratios and 1 is taken. That is, the variable Z value, which thus ranges from 0 to 1.

Now, for both tax claim (TC) and Z, we calculate the 25th percentile (Q1), the median value (Q2) and the 75th percentile (Q3). We then state that a taxpayer may be considered interesting if he satisfies one of the following conditions:

The three above-mentioned rules can be represented as in Figure 1.3.

Figure 1.3.Determining interesting and not interesting taxpayers. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip

Once the population of our dataset is entirely divided into interesting and not interesting taxpayers, we can see from Table 1.1 that the interesting ones are far more profitable than the others (tax claim values are in thousands of euros). A machine learning tool able to distinguish these two kinds of taxpayers fairly well would then be very useful.

Our first model task will then be that of identifying, with a certain confidence degree, the taxpayers who are more likely to have evaded (both in absolute terms and as a percentage of revenues or turnover).

The literature on tax fraud detection, although using different methods and algorithms, is usually only concerned about this issue, i.e. in finding the best way to identify the most relevant cases of tax evasion (Bonchi et al. 1999; Wu et al. 2012; Gonzalez and J.D. Velasquez 2013; de Roux et al. 2018).

There is another crucial issue that has to be taken into account, i.e. the effective tax authorities’ ability to collect the tax debt arising from the tax notices sent to all of the unfaithful taxpayers. Table 1.1.Tax claim, interesting and not interesting taxpayers

Table 1.1.Tax claim, interesting and not interesting taxpayers

 

Not interesting

Interesting

Tax claim

Num

Total tax claim

Average

Num

Total tax claim

Average

[0 - 1]

736

322

0.44

0

0

0.00

[1 - 2]

631

942

1.49

0

0

0.00

[2 - 5]

1,607

5,409

3.37

138

563

4.08

[5 - 10]

1,127

7,727

6.86

517

4,157

8.04

[10 - 20]

446

5,911

13.25

902

13,139

14.57

[20 - 50]

0

0

0.00

1,164

36,056

30.98

[50 - 100]

0

0

0.00

433

30,055

69.41

[100+]

0

0

0.00

327

101,987

311.89

Total

4,547

20,311

4.47

3,481

185,957

53.42

1.2.3. Enforced tax recovery proceedings

What happens if a taxpayer does not spontaneously pay the additional tax amount he is charged? Well, after a while, coercive collection procedures will be deployed by the tax authorities. However, as we have seen above, these procedures are highly ineffective, as they only collect about the 5% of the overall credits claimed against the audited taxpayers.

Indeed, data shows that coercive procedures take place in almost 40% of cases, although its distribution is not uniform: they are more frequent if the tax bill is high, as reported in Table 1.2 (again, tax claim values are in thousands of euros).

Table 1.2.Number of coercive procedures per tax claim interval

Tax claim

Coercive procedures

Total

 

No

Yes

 

[0 - 1]

578

158

736

[1 - 2]

476

155

631

[2 - 5]

1,268

477

1,745

[5 - 10]

1,072

572

1,644

[10 - 20]

745

603

1,348

[20 - 50]

511

653

1,164

[50 - 100]

159

274

433

[100+]

90

237

327

Total

4,899

3,129

8,028

Table 1.2 is actually a double frequency table, which can be used to investigate the existing relationship between the two categorical variables, Coercive procedures and Tax claim (they both take on values that are labels). Recall that given characters X and Y, X is independent from Y if for all Y values, the relative distribution of X does not change. Therefore, a quick glance at Table 1.2 shows that Coercive procedures depend on the values taken by Tax claim.

In a more formal way, following the Openstax (2013) notation, we could also perform a test of independence for these variables, by using the well-known test statistic for a test of independence:

where O is the observed value, E is the expected value, calculated as (row total)(column total) over total number surveyed.

Given the values in Table 1.2, the test would let us reject the hypothesis of the two variables being independent at a 1% level of significance: therefore, from the data, there is sufficient evidence to conclude that Coercive procedures are dependent on the Tax claim level.

It is easy to calculate, from Table 1.2, for each tax claim interval, the total coercive procedures rate, the tax notices rate and the coercive procedures within that tax claim interval rate (all of these ratios are depicted in Figure 1.4).

A close look at Figure 1.4 shows that until the tax claim is “low” (less than € 10,000; please note that the intervals are in thousands of euros), the blue line, i.e. the percentage of tax notices, is above the purple one, i.e. the percentage of coercive procedures, while for higher values of tax claim, the blue line is under the purple one. This is quite strong evidence that coercive procedures are not independent from tax claim.

As a result, the red line shows that the higher the tax claim, the higher the percentage of procedures within the tax claim range itself, up to over 70% in the last and, apparently, most desirable range.

Therefore, with just one model in place, whose task is to recognize interesting taxpayers, the tax authorities would risk facing many cases of coercive procedures. Thus their ability to ensure tax collection may be seriously jeopardized.

We therefore need to find a way to discover, among the most interesting taxpayers, the most solvent ones, the most willing to pay.

Figure 1.4.Coercive procedures and tax claim. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip

We can start by observing that a taxpayer with no properties will probably not be willing to pay his dues. Therefore, a second model only focusing on a few features indicating whether the taxpayer owned some kind of assets or not is built, in order to predict if a tax notice will end in an enforced recovery proceeding or not.

Once both models are available, the taxpayer selection process is held in such a way that undertakings will only be audited if judged worthy by both models.

1.2.4. The models

Our selection strategy needs to take into account two competing demands: on one hand, tax notices must be profitable, i.e. they have to address serious tax fraud or the tax evasion phenomena; on the other, tax collectability must be guaranteed in order to justify all of the tax authorities’ efforts.

To this purpose, we develop two models, both in the form of classification trees: the first one predicts whether a taxpayer is interesting or not, while the second predicts the final stage of a tax notice, distinguishing between those ending with an enforced recovery proceeding and the others, where such enforced recovery proceedings do not take place.

The first one’s attributes are taken from several datasets run by the IRA and are related to the taxpayers’ tax returns and their annexes (such as the sector studies), their properties details, their customers and suppliers lists and their tax notices, whereas the second one only focuses on a set of features concerning taxpayers’ assets.

In the taxpayer selection process, models that are easier to interpret are preferred to more complex models. Typically, decision trees meet the above requested conditions, so both of our models take that form.

In both cases, instead of considering just one decision tree, both practical and theoretical reasons (Breiman 1996) lead us towards a more sophisticated technique, known as bagging, which stands for bootstrap aggregating, with which many base classifiers are computed (in our case, many trees).

Moreover, a cost matrix is used while building the models. Indeed, in our context, to classify an actual not interesting taxpayer as interesting is a much more serious error than that of classifying as an actual interesting taxpayer as not interesting, based on the fact that, generally, tax offices’ human resources are barely sufficient to perform all of the audits they are assigned. Therefore, as long as offices audit interesting taxpayers, everything is fine, even though many interesting taxpayers may not be considered. In the same way, to predict that a tax notice will not end in a coercive procedure when it actually does, is a much more serious error than that of classifying a tax notice final stage the other way round. Therefore, different weights are given to different misclassification errors.

Finally, Ross Quinlan’s C4.5 decision tree algorithm is used to build the base classifiers within the bagging process.

Figure 1.5 puts all the pieces of our models together.

Figure 1.5.The two models together

1.3. Results

Our first model predicts, on the basis of the available features, 415 taxpayers to be interesting (i.e. 15.5% of the entire test set), with a precision rate of about 80%, as shown in Figure 1.6.

Figure 1.6.First model statistics and confusion matrix

In terms of tax claim amounts, the model appears to perform quite well, since the selected taxpayers’ average due additional taxes amounts to € 49,094, whereas the average on the entire test set is equal to € 22,339.

So far, we have shown that our model, on average, is able to distinguish serious tax evasion phenomena from the less significant ones. But what about the tax collection issue? To deal with this matter, we should investigate what kind of taxpayers we have just selected. For this purpose, Table 1.3 shows that the majority of the taxpayers, the model would select, would also be subject to coercive procedures (as we can see, the sum of the values of each column is 100%).

Table 1.3.Predicted values versus actual coercive procedures

 

Pred Interesting Not Interesting

Act

 

 

Procedure

70.12%

32.24%

No procedure

29.88%

67.76%

Thus, many of the selected taxpayers have a debt payment issue. This jeopardizes the overall selection process efficiency and effectiveness. As pointed out by the Italian Court of Auditors, coercive procedures, on average, are able to collect only about 5% of the overall claimed credits.

To evaluate the problem extent, we can replace the actual tax claim value corresponding to the problematic taxpayers with the estimated collectable tax, which is equal to the tax claim multiplied by a discount factor of 95%, and compare the two scenarios, as in Figures 1.7 and 1.8, where we depict both the total tax claim and the average tax claim arising from the taxpayers’ notices in the entire test set.

Figure 1.7.Total tax claim and discounted tax claim. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip

Taxpayers are ordered, from left to right, according to their probability of being interesting, as calculated by our model. Figure 1.7, for instance, depicts the cumulative tax claim charged up to a certain taxpayer: the red line values refer to the additional taxes requested with the tax notices, while the black line is drawn by considering the discounted values. The dashed vertical line indicates the levels corresponding to the last selected taxpayer according to the model (in our case, the 415th). Recall that when associating a class label with a record, the model also provides a probability, which highlights how confident the model is about its own prediction. Therefore, to a certain extent, it sets a ranking among taxpayers, which we can exploit to draw Figures 1.7 and 1.8. As we can easily observe, the overall tax claim charged to the selected taxpayers plummets from € 20 million to € 5 million, and the average tax claim, depicted in Figure 1.8, from € 49,000 to € 12,000. Thus, the selection process, which relied on our data mining model and at first sight seemed to be very efficient, shows some important flaws that we need to face. In fact, tax collectability is not adequately guaranteed.

Figure 1.8.Average total tax claim and discounted tax claim. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip

A second model may then help us by predicting which taxpayers would not be subject to coercive procedures, by focusing on a set of features concerning their assets.