Pentaho Data Integration is a powerful tool to for ETL process. When it is a matter of performance while processing data one need to think for all the alternatives and come out with one or more solution.
Pentaho Data Integration has multiple components to process a xml file.
- Get Data From Xml
- Streaming xml Input
- Get Data from XML
- XML input stream(StAX).
These all steps are able to get xml data and process them. We have mentioned a very smart solution to overcome this possible problem at the end of the article.
Generally when users are using Get Data From Xml Component everything will be working fine till the size of xml is small and the amount of file is very small but as the size and number of files increases, the problem related to performance is likely to occur.
2. Now try more Number of copies for processing to improve performance.But by using these all approaches, one will not get much output from this. Finally one can decide to use parallel approach provided by Pentaho. But it is also not going to help much.
After using these all approaches, the output matrix is as below:
No. of Files
1. Get Data From XML
2.Increase No of Copies
3. PDI parallel approach
4. Custom Solution(Leveling)
5. Custom with Parallel(SPEC Solution)
One can decide to use Xml Input Stream (StAX) Step to get data from files and load in my data mart.
The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data structures and the need for very fast data loads.
This step is chosen due to fast processing and is independent of memory when file size is very large. Step is very flexible and can read different parts of XML file in different ways.
This step divides the file in various levels of xml tag and then load the file.
Now, using output of this step, one can re-group various tags and populated them in FACT and DIM table.
Smart Solution by SPEC INDIA:
1. SPEC INDIA developed Transformation for populating XML file.
2. This transformation is capable of taking argument as its one or more instance for parallel process.
3. SPEC INDIA launched this job from multiple command prompts by passing different arguments :
kitchen.bat ……../Phase2/Job” /user:admin /level:basic P1
kitchen.bat ……../Phase2/Job” /user:admin /level:basic P2
By this Parallel approach processing time is reduced up to less than 24 hours. Still one can improve performance by increasing no of parallel processes.