Today’s digital world is witnessing generation of large volumes of data from various electronic devices and processing of data in terabytes and petabytes is a tough task for normal data warehouse systems. These tasks can be handled by Netezza, a data warehousing appliance that can process terabytes and petabytes of data in a flash. Netezza is a focussed is a centralized appliance quintessential for satellite data marts and departmental data warehouse solutions that broadens enterprise data warehouse to outline applications. Crux of Netezza is to present the things simple and speed up high performance analysis of data and help clients in gaining insight values in their business.
Brief on Netezza Appliance:
With an aim to provide basement for centralized machine data with improved efficiency, the existing file repository based archive system must be replaced with Netezza data warehouse appliance to empower controlled and fast access to machine data.
The key deliverables of the project to create “New System” are:
- A Netezza data warehouse appliance brimmed with machine data as fetched from the machines stationed at end user sites, along with the loaders to grain the daily inflow of machine data into the Netezza appliance.
- An API(Application Programming Interface), giving demonstrative applications competent access to the stored machine data. Two key things about API:
- The exemplar shift from existing strategy( huge amounts of actual machine data transferred to client system, and converting into information on the client PC) to the new strategy( preserve the original machine data in the Appliance and deliver the information to the client).
- Most data part, in number of files or volumes, of the existing archive system will be stored in the new Netezza. Other part of the data which has no values/benefits will be stored on a micro archive as files.
Performance Issues Aftermath “New System” Implementation With Netezza:
Infosphere Datastage of IBM is the tool used to load data into the Netezza Appliance. For querying and reporting purposes, API and OBIEE are used. Performance issues were noticed while loading data into and running queries on Netezza Data warehouse Appliance.
Issues With ETL Loading:
All the customer machines send data to the new system in the pattern of ADC packages. Each package consists of files related to analysis, monitoring and performance of machine data. These are packaged into a unix tape archive(tar) and then compressed as gzip giving rise to file with extension .tgz that contains one day machine data. The new system receives approximately 2500 packages per day and all together comprises of 200GB of data per day.
IBM Infosphere Datastage processes these packages and loads the data into 5 types of tables such as parameters, configuration, events, constants and test reports. No issues arose for an year or so but later Infosphere was unable to process 2500 packages in a day. So, if there are any bug fixes or releases the backlog of the package is getting increased and the target to process the complete days of the packages as it comes is not being achieved.
Solutions For ETL Loading:
Each iteration of Infosphere Datastage would process 200 packages per iteration and it was taking 3 hours. So before inserting the data into the table Infosphere does a lookup into the existing tables and checks whether the data exists or not and basing on that it either inserts or updates. The biggest fact tables in Netezza are having approximately between 50-200 billion records. So the lookup into these big fact tables is expensive.
In the “new system” Infosphere Datastage has 8 nodes and is designed to use parallel processing. So any job is split into 8 tasks and each task is run on each node parallel which would speed up the jobs. However this boomeranged when doing a lookup because all the eight nodes are trying to scan the same table at the same time for a limited amount of data. A single lookup on big fact tables itself is expensive and instead of doing a single lookup to check for any existing records, Infosphere is doing the lookups 8 times on the same table which killed the Netezza Performance.
After we identified the issue we altered the Infosphere job to do a single lookup on the fact table to check for existing records. This improved the performance and brought down the time to process 200 packages to 1 hour. This is still far away from the performance we expected. Now we looked into the table structures of Netezza to do further optimization at Netezza. We observed that ETL jobs would do a lookup of the tables based on machine and date. The fact tables were having column of timestamp datatype. So while doing a lookup on the fact table the timestamp column was being truncated to date data type.
So we proposed two changes in the table structure.
To add a new column with date datatype
Organize the table based on machine and date.
Organizing is a feature in Netezza which will sort the complete table data based on the columns we select. This will be extremely beneficial for lookups and filter conditions of queries.
Also we observed few fact tables are skewed and the data distribution is not equal in Netezza. So we changed the distribution of those tables to avoid skew.
All the above changes improved the ETL performance and loading of 200 packages was now completing in 12-15 minutes instead of 3 hours. So now we are able to process one day of 2500 packages within 3 hours and we now can process any amount of packages that come to Infosphere.
- Benefits Acquired From Performance Tuning:
- To complete one iteration of 200 packages, ETL loads used to take 3 hours. But now completes it in 15 minutes. Performance improved is achieved by 95%. For more information visit IBM Netezza Training
- Few tools took 20 minutes to run queries on netezza appliance but now completes in less than 5 seconds.