A Guide To Handling XML File Using Pentaho Kettle (PDI)

Pentaho Data Integration is a powerful tool to for ETL process. When it is a matter of performance while processing data one need to think for all the alternatives and come out with one or more solution.

Pentaho Data Integration has multiple components to process a xml file.
Like :

  • Get Data From Xml
  • Streaming xml Input
  • Get Data from XML
  • XML input stream(StAX).

These all steps are able to get xml data and process them. We have mentioned a very smart solution to overcome this possible problem at the end of the article.

Generally when users are using Get Data From Xml Component everything will be working fine till the size of xml is small and the amount of file is very small but as the size and number of files increases, the problem related to performance is likely to occur.

Performance Approach:

1. Generally Users are using Get Data From XML step so it looks like :Guide to Handling XML

2. Now try more Number of copies for processing to improve performance.Guide to Handling XMLBut by using these all approaches, one will not get much output from this. Finally one can decide to use parallel approach provided by Pentaho. But it is also not going to help much.

After using these all approaches, the output matrix is as below:

Approach

File Size

No. of Files

Processing Time/file

Total Time

1. Get Data From XML

1MB

160000

13 Sec.

577 Hours

2.Increase No of Copies

1MB

160000

9 Sec.

400 Hours

3. PDI parallel approach

1MB

160000

7 Sec.

311 Hours

4. Custom Solution(Leveling)

1MB

160000

1 Sec.

44 Hours

5. Custom with Parallel(SPEC Solution)

1MB

160000

0.54 Sec.

<24 Hours

Custom Approach:

One can decide to use Xml Input Stream (StAX) Step to get data from files and load in my data mart.

The XML Input Stream (StAX) step uses a completely different approach to solve use cases with very big and complex data structures and the need for very fast data loads.

This step is chosen due to fast processing and is independent of memory when file size is very large. Step is very flexible and can read different parts of XML file in different ways.

This step divides the file in various levels of xml tag and then load the file.

Guide to Handling XML

Now, using output of this step, one can re-group various tags and populated them in FACT and DIM table.

e.g.

Guide to Handling XMLNow there is a ray of hope to populate data quickly but thirst of more performance is to come up with a more efficient solution:

Smart Solution by SPEC INDIA:

1. SPEC INDIA developed Transformation for populating XML file.
2. This transformation is capable of taking argument as its one or more instance for parallel process.
3. SPEC INDIA launched this job from multiple command prompts by passing different arguments :

e.g.
kitchen.bat ……../Phase2/Job” /user:admin /level:basic P1

kitchen.bat ……../Phase2/Job” /user:admin /level:basic P2

Conclusion:

By this Parallel approach processing time is reduced up to less than 24 hours. Still one can improve performance by increasing no of parallel processes.

Author: SPEC INDIA


less words, more information

Tech
IN 200
words

Read our microblogs

Subscribe Now For Fresh Content

Loading

Guest Contribution

We are looking for industry experts to contribute to our blog section through fresh and innovative content.

Write For Us

Our Portfolio

Proven Solutions Across Industries
Technology for Real-Life

Visit Our Portfolio

Scroll Up