At a foundational level, it’s important to begin any “first time” data productization project with a commitment to data normalization. Today, I’m writing about normalization of data for the purposes of analytics, which necessarily is a step to normalization of data for machine learning and artificial intelligence. The normalization process is designed to deliver greater on-going insights from the data, including a reduction in data duplicates, data consistency across the product offerings plus ease access and queryability.
Data normalization can be challenging for many reasons, including handling data redundancy, ensuring query performance and tackling the complexity of data relationships to name a few. While removing redundant data is critical to ensure proper creation of models and analytical reports, at the foundational data lake level it’s important to keep the data raw and add a layer to remove duplicate records. The larger the dataset, the more it becomes necessary to optimize queries and tables along with caching structure in order to ensure acceptable performance and application responsiveness. Ensuring the correct sequence of data updates and their cascading effects goes along with the need to maintain relationships between tables.
Let’s acknowledge another, more human challenge inherent to a data normalization project. It is, in and of itself, not a very exciting undertaking for strong engineers. So, how does product management foster a healthy environment to get basic work done? I like to set a bigger vision that holds the promise of working on technologically interesting projects such as use of machine learning or event driven architectures. At the same time, it’s important to establish much smaller milestones and celebrate their achievement. This means writing robust success criteria and including those expectations in tickets so that it’s clear a meaningful bar has been achieved. When the inevitable mistakes are made and rectified, celebrate the improvements and acknowledge the innovations. As a product leader, it’s important to empower the team by allowing enough time for proper up-front design rather than simply barreling forward. In fact, a data normalization project is often needed because future needs of the business were only briefly considered in the past because of an urgent time to market demand. Even a short planning window can result in robust database table design that organizes data by common attributes and reduces the chance of inconsistency. Support the team to establish a plan for primary keys for query performance and foreign keys for efficient table joins. Encourage a culture that empowers the team to design for the use cases laid out by the business (product management), rather than impose specifics from on high. This creates a culture of personal growth and spurs creativity