Tax authorities can audit an entity at any point of time to check if it is compliant with applicable tax laws, and the entity needs to ensure it is always ready with the answers in time. Here’s how a system can be built to ensure tax audits is a smooth process and the respective team is prepared to respond to audits at all times.

Any large business entity can face time to time tax audits which requires gathering the needed information from different systems and running data reconciliations on it to check for potential anomalies which can put the entity at risk.

Usually data for large organisations is distributed across multiple teams and systems and each system has a different structure of data. Let’s take an example. Now, a particular buyer for an entity could have different identifiers in different systems like invoicing, customer database, etc., while various governments/jurisdictions identify the same buyers by their various other unique identifiers, like PAN for India. As a result, data retrieval becomes a challenge as it would need associating data without any common identifier. Hence, figuring out the right data sources for different sets of information gets complex and becomes a slow process. On the top of it, identification and remediation of anomalies, i.e., running all kinds of tax rules across different tax jurisdictions, is a challenge at the scale a particular organization might operate. In addition, Organizations also have to timely respond back to notices (questions) issued by a tax authority on time with relevant data.

1) Ability to gather distributed data at scale: As the volume of data continues to expand, the process of collecting data from multiple systems becomes increasingly challenging.

2) Detecting the anomalies proactively: As organizations these days are working more and more in distributed manner, the data held has to be reconciled with each other before reporting it to the tax authorities. Generally, this activity is complex as finding anomalies is like finding a needle from the haystack. For instance, during employee onboarding on the Company portal, an employee may provide their name as “foo”, while their PAN lists their name as “foobar”. In order to comply with local laws and regulations, it becomes necessary to reconcile these systems.

The final proposed solution to overcome these issues involves the following components:
1. Data Ingestion (Data Lake): This layer connects different data sources (DBs with varied structures) to a central data repository. It provides capabilities to upload data in batches to support historical data, as well as upload based upon real-time events. Data can be ingested in an asynchronous manner in following ways:
a. Near real-time consumption using integration with Kinesis firehose: This provides automatic connectors to S3, functionality for data buffering before moving it to S3, allows record transformation/enrichment, etc.
b. Pluggable batch upload to consume bulk data into data lake: This can be used for backfilling historical data on a need basis.
c. Event listeners for client notifications, and accordingly getting data ingested into the data lake.

2. Anomaly Detection: This layer allows to run different anomaly rules on top of the data present in data lake to ensure data is always tax compliant. The layer allows to run two different types of rules:
a. Deterministic Anomaly Rules: These are deterministic rules for handling anomalies and are relatively easy to configure. Team can easily come up with the rules on datasets and apply those rules to check if a particular record is valid or not. An example of a simple rule is to check if all active players in a sports tournament are completely paid in the system.

Tournament IdPlayer IdPlayer FeeStatus
1Player 1100Unpaid
Player IdPlayer NameStatus
Player 1Mr. AActive

b. Non-Deterministic Anomaly Rules: These correspond to anomalies for which exact rules are hard to define manually. Machine Learning (ML) algorithms can be used for detecting such anomalies. Multiple ML algorithms need to be applied to detect different types of anomalies. Before running any ML algorithms do ensure to clean up input data to check for any kind of noise in input data, e.g., remove special symbols, stop words, punctuations, etc., as these are irrelevant for making any kind of predictions. Also, Divide problem space into sub-problems, some may require ML modelling and some may not, e.g., if we want to identify invoices with zero gross amount we don't need ML for the same. Also, target one kind of anomaly at once, e.g., identification of anomalies for gross amount and address might need different ML models/algorithms for better results. Finally, The results can be shown to the users on QuickSight dashboards along with confidence scores for taking the corrective actions. The anomalies are classified into the following types:

  • “Supervised Anomalies”: These are the anomalies for which a training set is available to train ML model.
    Example 1:
    State abbreviations (21st street Canada vs. 21st street CA)
    Special symbols like comma, etc. (21st street, Canada. vs. 21st street Canada)
    Extra spaces (21st. Street Canada, 21st street Canada)

An ML model can be trained to figure that the strings above point to the same address.
Example 2:
Random Strings: foobar, shjdgdyug etc.

Here an ML model can be trained to check whether the address string given points to a valid address.

  • “Unsupervised Anomalies”: These are anomalies where labelled training dataset is not available to train ML models. Under these cases, outliers can be identified in the dataset to check for any potential anomalies. E.g.:

Invoice Description

Tax Rate






Here, the ML algorithm deduces that for most AC Invoices, 10% tax rate is applied, so if any other rate is applied on any transaction for an AC, it is likely incorrect.

Conclusion and Learnings

The utilisation of data lake facilitates the management of large-scale data and supports the development of solutions that require access to diverse datasets. The absence of data lake can impede efforts to reconcile data across multiple systems and ensure data synchronization.

There is no universal machine learning algorithm that is suitable for all problems. The selection of the appropriate algorithm is dependent on factors such as the input data set, the availability of a training set, and the nature of the anomalies being detected.