I mention this because it is symptomatic of all data integration and data movement projects: data quality needs to start before you begin your project (so that you can properly budget time and resources) and continues right through the course of the project and, where it is not a one-time project like data migration, is maintained on an on-going basis through production. Further, in order to maintain quality you need to be able to monitor data quality (via dashboards and the like) on an on-going basis as well. This is especially important within the context of data governance.
In other words, data quality follows the lifecycle of your data, and it spans multiple applications and systems as the data is reused and shared. In this article I discuss the need for an integrated data quality platform to support such an environment, with particular reference to the Trillium Software System, whose latest release (version 11) has just been announced.
Version 11 includes some very significant features, not least of which is that this release represents the culmination of the efforts the company has made, over the last few years, to fully integrate its Avellino acquisition. Specifically, this means that there is now a single repository (Metabase) which is shared across the environment (or you can have multiple Metabases), and that there is a single integrated interface across the product. The consequence of this is that data profiling and analysis need not be distinct from data cleansing and matching. In other words, you can now easily swap between one function and the other, as requirements dictate, rather than being forced to use a more waterfall-style approach in which profiling came first and quality came second.
The second major enhancement is the introduction of phrase analysis. This provides the ability to identify unique words and phrases (substrings), and combinations thereof, within a selected attribute. The importance of this is that it provides the ability to parse unstructured and semi-structured data and to then build data quality rules based on these. This is particularly important if you want to apply data quality rules to product data or other descriptive data that comes into the organisation in unstructured formats.
Shortly, phrase analysis will be extended by the introduction of Universal Data Libraries that will provide standardised taxonomies for things such as colours, units of measure, currencies, sizes and shapes and so on. Although the core libraries will be in English at first, you will be able to customise these libraries so that you can automatically recognise that blau = bleu = blue for example.
In addition, the company has opened up the Metabase so that customers and partners can more easily integrate data quality and profiling processes into their applications and in order to provide a foundation for interactive report building. Through this API, for example, customers can easily incorporate metadata from Trillium, such as profiling results or business rules, into broader metadata repositories or other applications.
Finally, this release sees the introduction of time series analysis. Previously, you could only take snapshots of quality information but, with time series support, trending becomes possible so that you can monitor data quality and profiling statistics over time.
Anyway, so much for the major new features (there are a number of smaller ones as well) in version 11. Now I want to focus more on strategy.
There are two types of data quality vendor: pure plays (like Trillium) and those integrated with ETL tools.
As far as pure plays in the market are concerned, most of these (apart from Trillium) are smaller companies that specialise in a particular area such as real-time data quality (embedding data quality into call centre applications) or product data quality, for example. In contrast, Trillium is recognised as an enterprise-wide solution, providing the tools and content to support data quality needs across a wide range of business applications, data domains, and implementation types. However, there is more to it than that: by choosing the smaller suppliers of point solutionswhether for price or because they offer deeper capabilities in a specific areayou will end up having multiple data quality tools when it might be better to have a single solution that did everything, even if, in some circumstances, it wasn’t quite the best thing since sliced bread.
Now, this may seem like the traditional integrated solution versus best-of-breed argument. Some people like one, some people like the other. However, in the case of data quality it is not quite as simple as that because using a data quality platform such as Trilliums allows you to reuse business rules across both real-time and batch processes, and across different application environments. Further, if we consider the implementation of data governance then one of the precepts involved would be the adoption of common data quality standards across the organisation, which is best facilitated by using a common platform and reusable rules.