139,99 €
BIG DATA, ARTIFICIAL INTELLIGENCE AND DATA ANALYSIS SET Coordinated by Jacques Janssen
Data analysis is a scientific field that continues to grow enormously, most notably over the last few decades, following rapid growth within the tech industry, as well as the wide applicability of computational techniques alongside new advances in analytic tools. Modeling enables data analysts to identify relationships, make predictions, and to understand, interpret and visualize the extracted information more strategically.
This book includes the most recent advances on this topic, meeting increasing demand from wide circles of the scientific community. Applied Modeling Techniques and Data Analysis 2 is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians, working on the front end of data analysis and modeling applications. The chapters cover a cross section of current concerns and research interests in the above scientific areas. The collected material is divided into appropriate sections to provide the reader with both theoretical and applied information on data analysis methods, models and techniques, along with appropriate applications.
Sie lesen das E-Book in den Legimi-Apps auf:
Seitenzahl: 364
Veröffentlichungsjahr: 2021
Cover
Title Page
Copyright
Preface
PART 1: Financial and Demographic Modeling Techniques
1 Data Mining Application Issues in the Taxpayer Selection Process
1.1. Introduction
1.2. Materials and methods
1.3. Results
1.4. Discussion
1.5. Conclusion
1.6. References
2 Asymptotics of Implied Volatility in the Gatheral Double Stochastic Volatility Model
2.1. Introduction
2.2. The results
2.3. Proofs
2.4. References
3 New Dividend Strategies
3.1. Introduction
3.2. Model 1
3.3. Model 2
3.4. Conclusion and further results
3.5. Acknowledgments
3.6. References
4 Introduction of Reserves in Self-adjusting Steering of Parameters of a Pay-As-You-Go Pension Plan
4.1. Introduction
4.2. The pension system
4.3. Theoretical framework of the Musgrave rule
4.4. Transformation of the retirement fund
4.5. Conclusion
4.6. References
5 Forecasting Stochastic Volatility for Exchange Rates using EWMA
5.1. Introduction
5.2. Data
5.3. Empirical model
5.4. Exchange rate volatility forecasting
5.5. Conclusion
5.6. Acknowledgments
5.7. References
6 An Arbitrage-free Large Market Model for Forward Spread Curves
6.1. Introduction and background
6.2. Construction of a market with infinitely many assets
6.3. Existence, uniqueness and non-negativity
6.4. Conclusion and future works
6.5. References
7 Estimating the Healthy Life Expectancy (HLE) in the Far Past: The Case of Sweden (1751-2016) with Forecasts to 2060
7.1. Life expectancy and healthy life expectancy estimates
7.2. The logistic model
7.3. The HALE estimates and our direct calculations
7.4. Conclusion
7.5. References
8 Vaccination Coverage Against Seasonal Influenza of Workers in the Primary Health Care Units in the Prefecture of Chania
8.1. Introduction
8.2. Material and method
8.3. Results
8.4. Discussion
8.5. References
9 Some Remarks on the Coronavirus Pandemic in Europe
9.1. Introduction
9.2. Background
9.3. Materials and analyses
9.4. The first phase of the pandemic
9.5. Concluding remarks
9.6. References
PART 2: Applied Stochastic and Statistical Models and Methods
10 The Double Flexible Dirichlet: A Structured Mixture Model for Compositional Data
10.1. Introduction
10.2. The double flexible Dirichlet distribution
10.3. Computational and estimation issues
10.4. References
11 Quantization of Transformed Lévy Measures
11.1. Introduction
11.2. Estimation strategy
11.3. Estimation of masses and the atoms
11.4. Simulation results
11.5. Conclusion
11.6. References
12 A Flexible Mixture Regression Model for Bounded Multivariate Responses
12.1. Introduction
12.2. Flexible Dirichlet regression model
12.3. Inferential issues
12.4. Simulation studies
12.5. Discussion
12.6. References
13 On Asymptotic Structure of the Critical Galton-Watson Branching Processes with Infinite Variance and Allowing Immigration
13.1. Introduction
13.2. Invariant measures of GW process
13.3. Invariant measures of GWPI
13.4. Conclusion
13.5. References
14 Properties of the Extreme Points of the Joint Eigenvalue Probability Density Function of the Wishart Matrix
14.1. Introduction
14.2. Background
14.3. Polynomial factorization of the Vandermonde and Wishart matrices
14.4. Matrix norm of the Vandermonde and Wishart matrices
14.5. Condition number of the Vandermonde and Wishart matrices
14.6. Conclusion
14.7. Acknowledgments
14.8. References
15 Forecast Uncertainty of the Weighted TAR Predictor
15.1. Introduction
15.2. SETAR predictors and bootstrap prediction intervals
15.3. Monte carlo simulation
15.4. References
16 Revisiting Transitions Between Superstatistics
16.1. Introduction
16.2. From superstatistic to transition between superstatistics
16.3. Transition confirmation
16.4. Beck’s transition model
16.5. Conclusion
16.6. Acknowledgments
16.7. References
17 Research on Retrial Queue with Two-Way Communication in a Diffusion Environment
17.1. Introduction
17.2. Mathematical model
17.3. Asymptotic average characteristics
17.4. Deviation of the number of applications in the system
17.5. Probability distribution density of device states
17.6. Conclusion
17.7. References
List of Authors
Index
End User License Agreement
Chapter 1
Table 1.1.
Tax claim, interesting and not interesting taxpayers
Table 1.2.
Number of coercive procedures per tax claim interval
Table 1.3.
Predicted values versus actual coercive procedures
Table 1.4.
Predicted coercive procedures versus actual interesting taxpayers
Table 1.5.
The most significant results of the models
Chapter 4
Table 4.1.
Distribution of the workforce between the categories
Table 4.2.
Replacementrates and ratio between benefits and contributions
Table 4.3.
Replacement rates and contribution ratio in the newsystem
Chapter 5
Table 5.1.
Descriptive statistics of raw data
Table 5.2.
Descriptive statistics of logarithmic returns
Table 5.3. Errors (RMSE and MAPE) for different decay factors λi and out-of-sam...
Chapter 7
Table 7.1.
Logistic model parameters and estimates
Table 7.2.
HALE and healthy life expectancy direct estimates and logistic fit
Chapter 8
Table 8.1. Age and gender distribution in terms of HC/LHU inside/outside the ci...
Table 8.2. Professional characteristics regarding HC/LHU inside/outside the cit...
Table 8.3. % frequency and 95% CI of vaccinations in total and by gender, age o...
Table 8.4. Frequency of vaccinations and 95% CI between HC/LHU inside/outside t...
Table 8.5. Breakdown by type of staff of impulses and preventions of vaccinatio...
Chapter 9
Table 9.1. COVID-19 suspected case criteria (adapted from the WHO: https://www....
Chapter 10
Table 10.1. Mean Vectors stratified by cluster. μkj refers to the j-th element ...
Table 10.2.
Mean of 500 initializations of (
α
, τ
) in different parameter config...
Table 10.3.
Parameter configurations for all the DFD simulations
Table 10.4. Results for the simulation study regarding the initialization proce...
Table 10.5.
ID4- Simulation results
Chapter 12
Table 12.1. Posterior means and CIs of unknown parameters together with WAIC ba...
Table 12.2. Posterior means and CIs of unknown parameters together with WAIC ba...
Table 12.3. Posterior means and CIs of unknown parameters together with WAIC ba...
Table 12.4. Posterior means and CIs of unknown parameters together with WAIC ba...
Table 12.5. Simulation study2: posterior means and CIs of unknown parameters to...
Table 12.6. Simulation study 3: posterior means and CIs of unknown parameters t...
Chapter 14
Table 14.1. For different points on a three-dimensional sphere and the square o...
Table 14.2.
Comparison of the value of the Vandermonde determinant
(∣X∣) and th...
Chapter 15
Table 15.1. Evaluation of the Pi’s of the weighted predictor for the models M1,...
Table 15.2.
Skewness-adjusted (Grabowski
et al. 2020) PI’s of Li (2011) and Sta...
Cover
Table of Contents
Title Page
Copyright
Preface
Begin Reading
List of Authors
Index
End User License Agreement
v
iii
iv
xi
xii
xiii
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
135
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
Big Data, Artificial Intelligence and Data Analysis Set
coordinated by
Jacques Janssen
Volume 8
Edited by
Yannis Dimotikalis
Alex Karagrigoriou
Christina Parpoula
Christos H. Skiadas
First published 2021 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
John Wiley & Sons, Inc.
111 River Street
Hoboken, NJ 07030
USA
www.wiley.com
© ISTE Ltd 2021
The rights of Yannis Dimotikalis, Alex Karagrigoriou, Christina Parpoula and Christos H. Skiadas to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.
Library of Congress Control Number: 2020951002
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
ISBN 978-1-78630-674-6
Data analysis as an area of importance has grown exponentially, especially during the past couple of decades. This can be attributed to a rapidly growing technology industry and the wide applicability of computational techniques, in conjunction with new advances in analytic tools. Modeling enables analysts to apply various statistical models to the data they are investigating, to identify relationships between variables, to make predictions about future sets of data, as well as to understand, interpret and visualize the extracted information more strategically. Many new research results have recently been developed and published and many more are developing and in progress at the present time. The topic is also widely presented at many international scientific conferences and workshops. This being the case, the need for the literature that addresses this is self-evident. This book includes the most recent advances on the topic. As a result, on one hand, it unifies in a single volume all new theoretical and methodological issues and, on the other, introduces new directions in the field of applied data analysis and modeling, which are expected to further grow the applicability of data analysis methods and modeling techniques.
This book is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians, who have been working on the front end of data analysis. The chapters included in this collective volume represent a cross-section of current concerns and research interests in the above-mentioned scientific areas. This volume is divided into two parts with a total of 17 chapters in a form that provides the reader with both theoretical and applied information on data analysis methods, models and techniques, along with appropriate applications.
Part 1 focuses on financial and demographic modeling techniques and includes nine chapters: Chapter 1, “Data Mining Application Issues in the Taxpayer Selection Process”, by Mauro Barone, Stefano Pisani and Andrea Spingola;Chapter 2, “Asymptotics of Implied Volatility in the Gatheral Double Stochastic Volatility Model”, by Mohammed Albuhayri, Anatoliy Malyarenko, Sergei Silvestrov, Ying Ni, Christopher Engström, Finnan Tewolde and Jiahui Zhang;Chapter 3, “New Dividend Strategies”, by Ekaterina Bulinskaya;Chapter 4, “Introduction of Reserves in Self-adjusting Steering the Parameters of a Pay-As-You-Go Pension Plan”, by Keivan Diakite, Abderrahim Oulidi and Pierre Devolder; Chapter 5, “Forecasting Stochastic Volatility for Exchange Rates using EWMA”, by Jean-Paul Murara, Anatoliy Malyarenko, Milica Rancic and Sergei Silvestrov; Chapter 6, “An Arbitrage-free Large Market Model for Forward Spread Curves”, by Hossein Nohrouzian, Ying Ni and Anatoliy Malyarenko; Chapter 7, “Estimating the Healthy Life Expectancy (HLE) in the Far Past: The Case of Sweden (1751-2016) with Forecasts to 2060”, by Christos H. Skiadas and Charilaos Skiadas; Chapter 8, “Vaccination Coverage Against Seasonal Influenza of Workers in the Primary Health Care Units in the Prefecture of Chania”, by Aggeliki Maragkaki and George Matalliotakis; Chapter 9, “Some Remarks on the Coronavirus Pandemic in Europe”, by Konstantinos N. Zafeiris and Marianna Koukli.
Part 2 covers the area of applied stochastic and statistical models and methods and comprises eight chapters: Chapter 10, “The Double Flexible Dirichlet: A Structured Mixture Model for Compositional Data”, by Roberto Ascari, Sonia Migliorati and Andrea Ongaro;Chapter 11, “Quantization of Transformed Lévy Measures”, by Mark Anthony Caruana;Chapter 12, “A Flexible Mixture Regression Model for Bounded Multivariate Responses”, by Agnese M. Di Brisco and Sonia Migliorati;Chapter 13, “On Asymptotic Structure of the Critical Galton-Watson Branching Processes with Infinite Variance and Allowing Immigration”, by Azam A. Imomov and Erkin E. Tukhtaev; Chapter 14, “Properties of the Extreme Points of the Joint Eigenvalue Probability Density Function of the Wishart Matrix”, by Asaph Keikara Muhumuza, Karl Lundengård, Sergei Silvestrov, John Magero Mango and Godwin Kakuba; Chapter 15, “Forecast Uncertainty of the Weighted TAR Predictor”, by Francesco Giordano and Marcella Niglio; Chapter 16, “Revisiting Transitions Between Superstatistics”, by Petr Jizba and Martin Prokš; Chapter 17, “Research on Retrial Queue with Two-Way Communication in a Diffusion Environment”, by Viacheslav Vavilov.
We wish to thank all the authors for their insights and excellent contributions to this book. We would like to acknowledge the assistance of all those involved in the reviewing process of this book, without whose support this could not have been successfully completed. Finally, we wish to express our thanks to the secretariat and, of course, the publishers. It was a great pleasure to work with them in bringing to life this collective volume.
Yannis DIMOTIKALIS
Crete, Greece
Alex KARAGRIGORIOU
Samos, Greece
Christina PARPOULA
Athens, Greece
Christos H. SKIADAS
Athens, Greece
December 2020
This chapter provides a data analysis framework designed to build an effective learning scheme aimed at improving the Italian Revenue Agency’s ability to identify non-compliant taxpayers, with special regard to self-employed individuals allowed to keep simplified registers. Our procedure involves building two C4.5 decision trees, both trained and validated on a sample of 8,000 audited taxpayers, but predicting two different class values, based on two different predictive attribute sets. That is, the first model is built in order to identify the most likely non-compliant taxpayers, while the second identifies the ones that are are less likely to pay the additional due tax bill. This twofold selection process target is needed in order to maximize the overall audit effectiveness. Once both models are in place, the taxpayer selection process will be held in such a way that businesses will only be audited if they are judged as worthy by both models. This methodology will soon be validated on real cases: that is, a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will subsequently be involved in some audit processes.
Fraud detection systems are designed to automate and help reduce the manual parts of a screening/checking process (Phua et al. 2005). Data mining plays an important role in fraud detection as it is often applied to extract fraudulent behavior profiles hidden behind large quantities of data and, thus, may be useful in decision support systems for planning effective audit strategies. Indeed, huge amounts of resources (to put it bluntly, money) may be recovered from well-targeted audits. This explains the increasing interest and investments of both governments and fiscal agencies in intelligent systems for audit planning. The Italian Revenue Agency (hereafter, IRA) itself has been studying data mining application techniques in order to detect tax evasion, focusing, for instance, on the tax credit system, supposed to support investments in disadvantaged areas (de Sisti and Pisani 2007), on fraud related to credit mechanisms, with regard to value-added tax – a tax that is levied on the price of a product or service at each stage of production, distribution or sale to the end consumer, except where a business is the end consumer, which will reclaim this input value (Basta et al. 2009) and on income indicators audits (Barone et al. 2017).
This chapter contributes to the empirical literature on the development of classification models applied to the tax evasion field, presenting a case study that focuses on a dataset of 8,000 audited taxpayers on the fiscal year 2012, each of them described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.1
In this context, all the taxpayers are in some way “unfaithful”, since all of them have received a tax notice that somehow rectified the tax return they had filed. Thus, the predictive analysis tool we develop is designed to find patterns in data that may help tax offices recognize only the riskiest taxpayers’ profiles.
Evidence on data at hand shows that our first model, which is described in detail later, is able to distinguish the taxpayers who are worthy of closer investigation from those who are not. 2
However, by defining the class value as a function of the higher due taxes, we satisfy the need of focusing on the taxpayers who are more likely to be “significant” tax evaders, but we do not ensure an efficient collection of their tax debt. Indeed, data shows that as the tax bill increases, the number of coercive collection procedures put in place also increases. Unfortunately, these procedures are highly inefficient, as they are able to only collect about 5% of the overall credits claimed against the audited taxpayers (Italian Court of Auditors 2016). As a result, the tax authorities’ ability to collect the due taxes may be jeopardized.
Further analysis is thus devoted to finding a way to discover, among the “significant” evaders, the most solvent ones. We recall that the 2018–2020 Agreement between the IRA and the Ministry of Finance states that audit effectiveness is measured, among others, by an indicator that is simply equal to the sum of the collected due taxes which summarizes the effectiveness of the IRA’s efforts to tackle tax evasion (Ministry of Economy and Finance – IRA Agreement for 2018–2010 2018). This is a reasonable indicator because the ordinary activities taken in the fight against tax evasion are crucial from the State budget point of view, because public expenditures (i.e. public services) strictly depend on the amount of public revenue. Of course, fraud and other incorrect fiscal behaviors may be tackled, even though no tax collection is guaranteed, in order to reach the maximum tax compliance. Such extra activities may also be jointly conducted with the Finance Guard or the Public Prosecutor if tax offenses arise.
Therefore, to tackle our second problem, i.e. to guarantee a certain degree of due tax collection, a trivial fact that we start from is that a taxpayer with no properties will not be willing to pay his dues, whereas if he had something to lose (a home or a car that could be seized), then, if the IRA’s claim is right, it is more probable that he might reach an agreement with the tax authorities.
Therefore, a second model only focusing on a few features indicating whether the taxpayer owned some kind of assets or not is built, in order to predict each tax notice’s final status (in this case, we only distinguish between statuses ending with an enforced recovery proceeding and statuses where such enforced recovery proceedings do not take place). Once both models are available, the taxpayer selection process is held in such a way that businesses will only be audited if they are judged as worthy by both models.
The key feature of our procedure is the twofold selection process target, needed to maximize the IRA’s audit processes’ effectiveness. The methodology we suggest will soon be validated in real cases i.e. a sample of taxpayers will be selected according to the classification criteria developed in this chapter and will be subsequently involved in some audit processes.
Data on hand refers to a sample of 8,028 audited self-employed individuals for fiscal year 2012, each described by a set of features, concerning, among others, their tax returns, their properties and their tax notice.3
Just for descriptive purposes, we can depict the statistical distribution of the revenues achieved by the businesses in our sample, grouped in classes (in thousands of euros), in Figure 1.1.
Most of our dataset is made up of small-sized taxpayers, of which almost 50% show revenues lower than € 75,000 per year and only 4% higher than € 500,000, with a sample average of € 146,348.
Figure 1.1.Revenues distribution
For each taxpayer in the dataset, both his tax notice status and the additional due taxes (i.e. the additional requested tax amount) are known.
Here comes the first problem that needs to be tackled: the additional due tax is a numeric attribute which measures the seriousness of the taxpayer’s tax evasion, whereas our algorithms, as we will show later on, need categorical values in order to predict. Thus, we cannot directly use the additional due taxes, but we need to define a class variable and decide both which values it will take and how to map each numeric value referred to the additional due taxes into such categorical values.
We must define a function f(x) which associates, to each element x in the dataset, a categorical value that shows its fraud risk degree and represents the class our first model will try to predict. Of course, a function that labels all the taxpayers in the dataset as tax evaders would be useless. Thus, a distinction needs to be drawn between serious tax evasion cases and those that are less relevant. To this purpose, we somehow follow (Basta et al. 2009) and choose to divide the taxpayers into two groups, the interesting ones and the not interesting ones, from the tax administration point of view (to a certain extent, interesting stands for “it might be interesting for the tax administration to go and check what’s going on ...”), based on two criteria: profitability (i.e. the ability to identify the most serious cases of tax evasion, independently from all other factors) and fairness (i.e. the ability to identify the most serious cases of tax evasion, with respect to the taxpayer’s turnover).
Honest taxpayers are treated as not interesting taxpayers, even though this label is used to indicate moderate tax evasion cases. We are somehow forced to use this approximation since we only have data on taxpayers who received a tax notice, and not on taxpayers for which an audit process may have been closed without qualifications, or may have not even been started.
Therefore, in order to take the profitability issue into account, we define a new variable, called the tax claim, which represents the higher assessed taxes if the tax notice stage is still open, or the higher settled taxes if the stage status is definitive. Note that the higher assessed tax could be different from the higher settled tax, because the IRA and the taxpayer, while reaching an agreement, can both reconsider their positions. The tax claim distribution grouped in classes (again, in thousands of euros) is shown in Figure 1.2.
Figure 1.2.Tax claim distribution. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip
The left vertical axis is related to the tax claim distribution, grouped in the classes shown on the horizontal axis; the right vertical axis, on the contrary, sums up the monetary tax claim amount that arises from each group (in thousands of euro). Therefore, as it can easily be seen, the 331 most profitable tax notices (12% of the total) account for almost half of the tax revenue arising from our dataset.
The fairness criterion is then introduced to address the audit process, even towards smaller firms (which usually are charged smaller amounts of due income taxes), and it is useful as it allows the tax authorities to not discriminate against taxpayers on the basis of their turnover and introduces a deterrent effect which improves the overall tax compliance.
Therefore, we define another variable, called Z, which takes into account, for each taxpayer, both his turnover and revenues, and compares them to the tax claim. More formally, both of the ratios and are computed. Then, the minimum between these two ratios and 1 is taken. That is, the variable Z value, which thus ranges from 0 to 1.
Now, for both tax claim (TC) and Z, we calculate the 25th percentile (Q1), the median value (Q2) and the 75th percentile (Q3). We then state that a taxpayer may be considered interesting if he satisfies one of the following conditions:
The three above-mentioned rules can be represented as in Figure 1.3.
Figure 1.3.Determining interesting and not interesting taxpayers. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip
Once the population of our dataset is entirely divided into interesting and not interesting taxpayers, we can see from Table 1.1 that the interesting ones are far more profitable than the others (tax claim values are in thousands of euros). A machine learning tool able to distinguish these two kinds of taxpayers fairly well would then be very useful.
Our first model task will then be that of identifying, with a certain confidence degree, the taxpayers who are more likely to have evaded (both in absolute terms and as a percentage of revenues or turnover).
The literature on tax fraud detection, although using different methods and algorithms, is usually only concerned about this issue, i.e. in finding the best way to identify the most relevant cases of tax evasion (Bonchi et al. 1999; Wu et al. 2012; Gonzalez and J.D. Velasquez 2013; de Roux et al. 2018).
There is another crucial issue that has to be taken into account, i.e. the effective tax authorities’ ability to collect the tax debt arising from the tax notices sent to all of the unfaithful taxpayers. Table 1.1.Tax claim, interesting and not interesting taxpayers
Table 1.1.Tax claim, interesting and not interesting taxpayers
Not interesting
Interesting
Tax claim
Num
Total tax claim
Average
Num
Total tax claim
Average
[0 - 1]
736
322
0.44
0
0
0.00
[1 - 2]
631
942
1.49
0
0
0.00
[2 - 5]
1,607
5,409
3.37
138
563
4.08
[5 - 10]
1,127
7,727
6.86
517
4,157
8.04
[10 - 20]
446
5,911
13.25
902
13,139
14.57
[20 - 50]
0
0
0.00
1,164
36,056
30.98
[50 - 100]
0
0
0.00
433
30,055
69.41
[100+]
0
0
0.00
327
101,987
311.89
Total
4,547
20,311
4.47
3,481
185,957
53.42
What happens if a taxpayer does not spontaneously pay the additional tax amount he is charged? Well, after a while, coercive collection procedures will be deployed by the tax authorities. However, as we have seen above, these procedures are highly ineffective, as they only collect about the 5% of the overall credits claimed against the audited taxpayers.
Indeed, data shows that coercive procedures take place in almost 40% of cases, although its distribution is not uniform: they are more frequent if the tax bill is high, as reported in Table 1.2 (again, tax claim values are in thousands of euros).
Table 1.2.Number of coercive procedures per tax claim interval
Tax claim
Coercive procedures
Total
No
Yes
[0 - 1]
578
158
736
[1 - 2]
476
155
631
[2 - 5]
1,268
477
1,745
[5 - 10]
1,072
572
1,644
[10 - 20]
745
603
1,348
[20 - 50]
511
653
1,164
[50 - 100]
159
274
433
[100+]
90
237
327
Total
4,899
3,129
8,028
Table 1.2 is actually a double frequency table, which can be used to investigate the existing relationship between the two categorical variables, Coercive procedures and Tax claim (they both take on values that are labels). Recall that given characters X and Y, X is independent from Y if for all Y values, the relative distribution of X does not change. Therefore, a quick glance at Table 1.2 shows that Coercive procedures depend on the values taken by Tax claim.
In a more formal way, following the Openstax (2013) notation, we could also perform a test of independence for these variables, by using the well-known test statistic for a test of independence:
where O is the observed value, E is the expected value, calculated as (row total)(column total) over total number surveyed.
Given the values in Table 1.2, the test would let us reject the hypothesis of the two variables being independent at a 1% level of significance: therefore, from the data, there is sufficient evidence to conclude that Coercive procedures are dependent on the Tax claim level.
It is easy to calculate, from Table 1.2, for each tax claim interval, the total coercive procedures rate, the tax notices rate and the coercive procedures within that tax claim interval rate (all of these ratios are depicted in Figure 1.4).
A close look at Figure 1.4 shows that until the tax claim is “low” (less than € 10,000; please note that the intervals are in thousands of euros), the blue line, i.e. the percentage of tax notices, is above the purple one, i.e. the percentage of coercive procedures, while for higher values of tax claim, the blue line is under the purple one. This is quite strong evidence that coercive procedures are not independent from tax claim.
As a result, the red line shows that the higher the tax claim, the higher the percentage of procedures within the tax claim range itself, up to over 70% in the last and, apparently, most desirable range.
Therefore, with just one model in place, whose task is to recognize interesting taxpayers, the tax authorities would risk facing many cases of coercive procedures. Thus their ability to ensure tax collection may be seriously jeopardized.
We therefore need to find a way to discover, among the most interesting taxpayers, the most solvent ones, the most willing to pay.
Figure 1.4.Coercive procedures and tax claim. For a color version of this figure, see www.iste.co.uk/dimotikalis/analysis2.zip
We can start by observing that a taxpayer with no properties will probably not be willing to pay his dues. Therefore, a second model only focusing on a few features indicating whether the taxpayer owned some kind of assets or not is built, in order to predict if a tax notice will end in an enforced recovery proceeding or not.
Once both models are available, the taxpayer selection process is held in such a way that undertakings will only be audited if judged worthy by both models.
Our selection strategy needs to take into account two competing demands: on one hand, tax notices must be profitable, i.e. they have to address serious tax fraud or the tax evasion phenomena; on the other, tax collectability must be guaranteed in order to justify all of the tax authorities’ efforts.
To this purpose, we develop two models, both in the form of classification trees: the first one predicts whether a taxpayer is interesting or not, while the second predicts the final stage of a tax notice, distinguishing between those ending with an enforced recovery proceeding and the others, where such enforced recovery proceedings do not take place.
The first one’s attributes are taken from several datasets run by the IRA and are related to the taxpayers’ tax returns and their annexes (such as the sector studies), their properties details, their customers and suppliers lists and their tax notices, whereas the second one only focuses on a set of features concerning taxpayers’ assets.
In the taxpayer selection process, models that are easier to interpret are preferred to more complex models. Typically, decision trees meet the above requested conditions, so both of our models take that form.
In both cases, instead of considering just one decision tree, both practical and theoretical reasons (Breiman 1996) lead us towards a more sophisticated technique, known as bagging, which stands for bootstrap aggregating, with which many base classifiers are computed (in our case, many trees).
Moreover, a cost matrix is used while building the models. Indeed, in our context, to classify an actual not interesting taxpayer as interesting is a much more serious error than that of classifying as an actual interesting taxpayer as not interesting, based on the fact that, generally, tax offices’ human resources are barely sufficient to perform all of the audits they are assigned. Therefore, as long as offices audit interesting taxpayers, everything is fine, even though many interesting taxpayers may not be considered. In the same way, to predict that a tax notice will not end in a coercive procedure when it actually does, is a much more serious error than that of classifying a tax notice final stage the other way round. Therefore, different weights are given to different misclassification errors.
Finally, Ross Quinlan’s C4.5 decision tree algorithm is used to build the base classifiers within the bagging process.
Figure 1.5 puts all the pieces of our models together.
Figure 1.5.The two models together
Our first model predicts, on the basis of the available features, 415 taxpayers to be interesting (i.e. 15.5% of the entire test set), with a precision rate of about 80%, as shown in Figure 1.6.
Figure 1.6.First model statistics and confusion matrix
In terms of tax claim amounts, the model appears to perform quite well, since the selected taxpayers’ average due additional taxes amounts to € 49,094, whereas the average on the entire test set is equal to € 22,339.
So far, we have shown that our model, on average, is able to distinguish serious tax evasion phenomena from the less significant ones. But what about the tax collection issue? To deal with this matter, we should investigate what kind of taxpayers we have just selected. For this purpose, Table 1.3 shows that the majority of the taxpayers, the model would select, would also be subject to coercive procedures (as we can see, the sum of the values of each column is 100%).
Table 1.3.Predicted values versus actual coercive procedures
Pred Interesting Not Interesting
Act
Procedure
70.12%
32.24%
No procedure
29.88%
67.76%
Thus, many of the selected taxpayers have a debt payment issue. This jeopardizes the overall selection process efficiency and effectiveness. As pointed out by the Italian Court of Auditors, coercive procedures, on average, are able to collect only about 5% of the overall claimed credits.
To evaluate the problem extent, we can replace the actual tax claim value corresponding to the problematic taxpayers with the estimated collectable tax, which is equal to the tax claim multiplied by a discount factor of 95%, and compare the two scenarios, as in Figures 1.7 and 1.8, where we depict both the total tax claim and the average tax claim arising from the taxpayers’ notices in the entire test set.
Figure 1.7.Total tax claim and discounted tax claim. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip
Taxpayers are ordered, from left to right, according to their probability of being interesting, as calculated by our model. Figure 1.7, for instance, depicts the cumulative tax claim charged up to a certain taxpayer: the red line values refer to the additional taxes requested with the tax notices, while the black line is drawn by considering the discounted values. The dashed vertical line indicates the levels corresponding to the last selected taxpayer according to the model (in our case, the 415th). Recall that when associating a class label with a record, the model also provides a probability, which highlights how confident the model is about its own prediction. Therefore, to a certain extent, it sets a ranking among taxpayers, which we can exploit to draw Figures 1.7 and 1.8. As we can easily observe, the overall tax claim charged to the selected taxpayers plummets from € 20 million to € 5 million, and the average tax claim, depicted in Figure 1.8, from € 49,000 to € 12,000. Thus, the selection process, which relied on our data mining model and at first sight seemed to be very efficient, shows some important flaws that we need to face. In fact, tax collectability is not adequately guaranteed.
Figure 1.8.Average total tax claim and discounted tax claim. For a color version of this figure, seewww.iste.co.uk/dimotikalis/analysis2.zip
A second model may then help us by predicting which taxpayers would not be subject to coercive procedures, by focusing on a set of features concerning their assets.
