I am working with large xml files (ranging from 100mb to 1GB+), thus it is not really possible to load them into memory.. and the goal is to convert these files to a csv format, but pipe (|) delimited.
The XML file looks like this;
<?xml version="1.0" encoding="UTF-8" ?>
- <merch_item_feed>
- <item_data>
- <item_basic_data>
<item_unique_id>0115526102
</item_uni
que_id>
<item_ean>9780115526107</i
tem_ean>
<item_sku>0115526102</item
_sku>
<item_upc />
<item_mpn />
<item_brand>Stationery Office Books</item_brand>
<item_name>The Official Learning to Drive Pack (Driving Skills)</item_name>
<item_model />
<item_category>Book</item_
category>
<item_short_desc>Paperback
, Stationery Office Books</item_short_desc>
<item_page_url>
http://www.amazon.co.uk/exec/obidos/ASIN/0115526102/AssocID/ref=nosim</
item_page_
url>
<amzn_page_url>
http://www.amazon.co.uk/exec/obidos/ASIN/0115526102/AssocID/ref=nosim</
amzn_page_
url>
<offer_page_url>
http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115526102/new/ASIN/0115526102&camp=1634&creative=6738</of
fer_page_u
rl>
<offer_used_url>
http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115526102/used/ASIN/0115526102&camp=1634&creative=6738</o
ffer_used_
url>
<item_image_url>
http://ec1.images-amazon.com/images/I/31DY01WGJ1L.jpg</item_
image_url>
<item_image_url_small>
http://ec1.images-amazon.com/images/I/11FE11DK7YL.jpg<
/item_imag
e_url_smal
l>
<item_salesrank>233499</it
em_salesra
nk>
<item_price>21.23</item_pr
ice>
<item_inventory>Usually dispatched within 1-2 business days</item_inventory>
<item_shipping_charge>Chec
k Site.</item_shipping_charg
e>
<amzn_price>24.99</amzn_pr
ice>
<amzn_inventory>Usually dispatched within 24 hours</amzn_inventory>
<amzn_shipping_charge>Free
!</amzn_sh
ipping_cha
rge>
<fm_price>24.99</fm_price>
<fm_inventory>Usually dispatched within 24 hours</fm_inventory>
<fm_shipping_charge>Free!<
/fm_shippi
ng_charge>
<tp_new_price>21.23</tp_ne
w_price>
<tp_new_inventory>Usually dispatched within 1-2 business days</tp_new_inventory>
<tp_new_shipping_charge>Ch
eck Site.</tp_new_shipping_cha
rge>
<tp_used_price>20.00</tp_u
sed_price>
<tp_used_inventory>In Stock</tp_used_inventory>
<tp_used_shipping_charge>C
heck Site.</tp_used_shipping_ch
arge>
</item_basic_data>
- <prod_specific_data category="book">
<known_attr_val_pair attr="book_author" val="Driving Standards Agency" />
<known_attr_val_pair attr="book_isbn" val="0115526102" />
<known_attr_val_pair attr="book_format" val="Paperback" />
</prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
<merch_cat_name>277082</me
rch_cat_na
me>
<merch_cat_path>Books/Subj
ects/Refer
ence/Trans
port/Autom
otive/Driv
ing & the Highway Code</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>278131</me
rch_cat_na
me>
<merch_cat_path>Books/Subj
ects/Scien
ce & Nature/Engineering & Technology/Civil Engineering/Road & Transport</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>10834521</
merch_cat_
name>
<merch_cat_path>Books/Spec
ial Features/34% off Books over £10/Science & Nature</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>10834491</
merch_cat_
name>
<merch_cat_path>Books/Spec
ial Features/34% off Books over £10/Reference & Languages</merch_cat_path>
</merch_cat_item>
</merch_cat_list>
</item_data>
- <item_data>
- <item_basic_data>
<item_unique_id>0115528423
</item_uni
que_id>
<item_ean>9780115528422</i
tem_ean>
<item_sku>0115528423</item
_sku>
<item_upc />
<item_mpn />
<item_brand>The Stationary Office (TSO)</item_brand>
<item_name>The Official DSA Theory Test for Motorcyclists CD-ROM</item_name>
<item_model />
<item_category>Software</i
tem_catego
ry>
<item_short_desc>, Platforms: Windows XP</item_short_desc>
<item_page_url>
http://www.amazon.co.uk/exec/obidos/ASIN/0115528423/AssocID/ref=nosim</
item_page_
url>
<amzn_page_url>
http://www.amazon.co.uk/exec/obidos/ASIN/0115528423/AssocID/ref=nosim</
amzn_page_
url>
<offer_page_url>
http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115528423/new/ASIN/0115528423&camp=1634&creative=6738</of
fer_page_u
rl>
<offer_used_url>
http://www.amazon.co.uk/o/redirect?tag=AssocID&link_code=asm&path=tg/detail/offer-listing/-/0115528423/used/ASIN/0115528423&camp=1634&creative=6738</o
ffer_used_
url>
<item_image_url>
http://ec1.images-amazon.com/images/I/31Au9yM7IZL.jpg</item_
image_url>
<item_image_url_small>
http://ec1.images-amazon.com/images/I/11ldivFAIML.jpg<
/item_imag
e_url_smal
l>
<item_salesrank>1068</item
_salesrank
>
<item_price>16.99</item_pr
ice>
<item_inventory>Not yet released</item_inventory>
<item_shipping_charge>Free
!</item_sh
ipping_cha
rge>
<amzn_price>16.99</amzn_pr
ice>
<amzn_inventory>Not yet released</amzn_inventory>
<amzn_shipping_charge>Free
!</amzn_sh
ipping_cha
rge>
<fm_price>16.99</fm_price>
<fm_inventory>Not yet released</fm_inventory>
<fm_shipping_charge>Free!<
/fm_shippi
ng_charge>
</item_basic_data>
- <prod_specific_data category="software">
<known_attr_val_pair attr="hardware_platform" val="PC" />
<known_attr_val_pair attr="software_os" val="Windows XP" />
<known_attr_val_pair attr="software_format" val="CD-ROM" />
</prod_specific_data>
- <merch_cat_list>
- <merch_cat_item>
<merch_cat_name>277082</me
rch_cat_na
me>
<merch_cat_path>Books/Subj
ects/Refer
ence/Trans
port/Autom
otive/Driv
ing & the Highway Code</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>278131</me
rch_cat_na
me>
<merch_cat_path>Books/Subj
ects/Scien
ce & Nature/Engineering & Technology/Civil Engineering/Road & Transport</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>912026</me
rch_cat_na
me>
<merch_cat_path>Software/C
ategories/
Hobbies & Pastimes/Driving Tests</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>16305411</
merch_cat_
name>
<merch_cat_path>Software/C
ategories/
Hobbies & Pastimes/All Hobbies & Pastimes</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>317243011<
/merch_cat
_name>
<merch_cat_path>Software/C
ategories/
Digital Imaging/Creativity Software</merch_cat_path>
</merch_cat_item>
- <merch_cat_item>
<merch_cat_name>341610011<
/merch_cat
_name>
<merch_cat_path>uk-shops/E
ducation Resources/Software/Driving
Tests</merch_cat_path>
</merch_cat_item>
</merch_cat_list>
</item_data>
- </merchitemfeed>
Objective is to extract the data from the 'item_basic_data' elements and separate them by pipe character.
Output should look something like (with the field headers);
item_unique_id|item_ean|it
eam_upc
12345678901|12345678|12345
678
12345678901|12345678|12345
678
12345678901|12345678|12345
678
--------------------------
----------
------
Please note that only the information from 'item_basic_data' needs to be extracted - instructions on how to accomplish this is sufficient as an answer. However, if you know your stuff, I would appreciate a solution that could extract the first instance of 'merch_cat_path'. If you notice, each 'item_basic_data' has 4 or 5 duplicate elements of 'merch_cat_path', but we only want the first instance if possible.
I am assuming we will need some xslt file, but I don't know how to write it. I am experimenting with a program that will do the processing of the input xml, transform xslt, and output csv files, but it does not supply the xslt file itself.
Also, if you have any suggestions for similar programs that can handle & process large xml files - preferably freeware, but commercial is ok too.
Start Free Trial