Loading...

A Guide To Handling XML File Using Pentaho Kettle (PDI)

Author
SPEC INDIA
Posted

September 20, 2013

Updated

April 27th, 2023

Pentaho Data Integration is a powerful tool to for ETL process. When it is a matter of performance while processing data one need to think for all the alternatives and come out with one or more solution.

Pentaho Data Integration has multiple components to process a xml file.
Like :

  • Get Data From Xml
  • Streaming xml Input
  • Get Data from XML
  • XML input stream(StAX).

These all steps are able to get xml data and process them. We have mentioned a very smart solution to overcome this possible problem at the end of the article.

Generally when users are using Get Data From Xml Component everything will be working fine till the size of xml is small and the amount of file is very small but as the size and number of files increases, the problem related to performance is likely to occur.

Performance Approach:

1. Generally Users are using Get Data From XML step so it looks like :

2. Now try more Number of copies for processing to improve performance.But by using these all approaches, one will not get much output from this. Finally one can decide to use parallel approach provided by Pentaho. But it is also not going to help much.

After using these all approaches, the output matrix is as below:

Approach

File Size

No. of Files

Processing Time/file

Total Time

1. Get Data From XML

1MB

160000

13 Sec.

577 Hours

2.Increase No of Copies

1MB

160000

9 Sec.

400 Hours

3. PDI parallel approach

1MB

160000

7 Sec.

311 Hours

4. Custom Solution(Leveling)

1MB

160000

1 Sec.

44 Hours

5. Custom with Parallel(SPEC Solution)

1MB

160000

0.54 Sec.

<24 Hours

Custom Approach:

One can decide to use Xml Input Stream (StAX) Step to get data from files and load in my data mart.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data structures and the need for very fast data loads.

This step is chosen due to fast processing and is independent of memory when file size is very large. Step is very flexible and can read different parts of XML file in different ways.

This step divides the file in various levels of xml tag and then load the file.

 

Now, using output of this step, one can re-group various tags and populated them in FACT and DIM table.

e.g.

Now there is a ray of hope to populate data quickly but thirst of more performance is to come up with a more efficient solution:

Smart Solution by SPEC INDIA:

1. SPEC INDIA developed Transformation for populating XML file.
2. This transformation is capable of taking argument as its one or more instance for parallel process.
3. SPEC INDIA launched this job from multiple command prompts by passing different arguments :

e.g.
kitchen.bat ……../Phase2/Job” /user:admin /level:basic P1

kitchen.bat ……../Phase2/Job” /user:admin /level:basic P2

Conclusion:

By this Parallel approach processing time is reduced up to less than 24 hours. Still one can improve performance by increasing no of parallel processes.

Delivering Digital Outcomes To Accelerate Growth
Let’s Talk
Author
SPEC INDIA

SPEC INDIA, as your single stop IT partner has been successfully implementing a bouquet of diverse solutions and services all over the globe, proving its mettle as an ISO 9001:2015 certified IT solutions organization. With efficient project management practices, international standards to comply, flexible engagement models and superior infrastructure, SPEC INDIA is a customer’s delight. Our skilled technical resources are apt at putting thoughts in a perspective by offering value-added reads for all.

Delivering Digital Outcomes To Accelerate Growth
Let’s Talk