Parsing XML Data with Python

up vote
2
down vote

favorite

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.

XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml

XML structure (example):

<parkingAreaStatus>

      <parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="2[Zeil]" 

       version="1.0"/>

      <parkingAreaStatusTime>2018-02-

       04T01:30:00.000+01:00</parkingAreaStatusTime  

      </parkingAreaStatus>



<parkingAreaStatus>

      <parkingAreaOccupancy>0.34625</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]" 

       version="1.0"/>

      </parkingAreaStatus>

Using the code

import csv

import pymysql

import urllib.request



url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"



from lxml.objectify import parse

from lxml import etree

from urllib.request import urlopen

locations_root = parse(urlopen(url)).getroot()





locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)



print(*locations)

I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.

Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code

locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)



 print(*locations)

-> 0.2533602

which is not the expected outcome

-> 0.2533602
-> 0.34625

MY question is:

What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?

Example output:

A = [[ID:2[Zeil],0.2533602,stable,2018-02-

   04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-

   04T01:30:00.000+01:00]]

or in general

A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]

After hours of research I hope for some tips from your site

Thank you in advance,

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

add a comment |

up vote
2
down vote

favorite

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.

XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml

XML structure (example):

<parkingAreaStatus>

      <parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="2[Zeil]" 

       version="1.0"/>

      <parkingAreaStatusTime>2018-02-

       04T01:30:00.000+01:00</parkingAreaStatusTime  

      </parkingAreaStatus>



<parkingAreaStatus>

      <parkingAreaOccupancy>0.34625</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]" 

       version="1.0"/>

      </parkingAreaStatus>

Using the code

import csv

import pymysql

import urllib.request



url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"



from lxml.objectify import parse

from lxml import etree

from urllib.request import urlopen

locations_root = parse(urlopen(url)).getroot()





locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)



print(*locations)

I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.

locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)



 print(*locations)

-> 0.2533602

which is not the expected outcome

-> 0.2533602
-> 0.34625

MY question is:

What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?

Example output:

A = [[ID:2[Zeil],0.2533602,stable,2018-02-

   04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-

   04T01:30:00.000+01:00]]

or in general

A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]

After hours of research I hope for some tips from your site

Thank you in advance,

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

add a comment |

up vote
2
down vote

favorite

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.

XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml

XML structure (example):

<parkingAreaStatus>

      <parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="2[Zeil]" 

       version="1.0"/>

      <parkingAreaStatusTime>2018-02-

       04T01:30:00.000+01:00</parkingAreaStatusTime  

      </parkingAreaStatus>



<parkingAreaStatus>

      <parkingAreaOccupancy>0.34625</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]" 

       version="1.0"/>

      </parkingAreaStatus>

Using the code

import csv

import pymysql

import urllib.request



url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"



from lxml.objectify import parse

from lxml import etree

from urllib.request import urlopen

locations_root = parse(urlopen(url)).getroot()





locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)



print(*locations)

I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.

locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)



 print(*locations)

-> 0.2533602

which is not the expected outcome

-> 0.2533602
-> 0.34625

MY question is:

What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?

Example output:

A = [[ID:2[Zeil],0.2533602,stable,2018-02-

   04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-

   04T01:30:00.000+01:00]]

or in general

A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]

After hours of research I hope for some tips from your site

Thank you in advance,

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.

XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml

XML structure (example):

<parkingAreaStatus>

      <parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="2[Zeil]" 

       version="1.0"/>

      <parkingAreaStatusTime>2018-02-

       04T01:30:00.000+01:00</parkingAreaStatusTime  

      </parkingAreaStatus>



<parkingAreaStatus>

      <parkingAreaOccupancy>0.34625</parkingAreaOccupancy>

      <parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>

      <parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]" 

       version="1.0"/>

      </parkingAreaStatus>

Using the code

import csv

import pymysql

import urllib.request



url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"



from lxml.objectify import parse

from lxml import etree

from urllib.request import urlopen

locations_root = parse(urlopen(url)).getroot()





locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)



print(*locations)

I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.

locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)



 print(*locations)

-> 0.2533602

which is not the expected outcome

-> 0.2533602
-> 0.34625

MY question is:

What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?

Example output:

A = [[ID:2[Zeil],0.2533602,stable,2018-02-

   04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-

   04T01:30:00.000+01:00]]

or in general

A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]

After hours of research I hope for some tips from your site

Thank you in advance,

python xml parsing lxml

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

edited Feb 4 at 1:22

asked Feb 4 at 1:01

theamokdog

245

asked Feb 4 at 1:01

theamokdog

245

asked Feb 4 at 1:01

theamokdog

245

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

You can just use etree directly and find interesting elements using XPath¹ query. One important thing to note is, that your XML has default namespace declared at the root element :

xmlns="http://datex2.eu/schema/2/2_0"

By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :

root = etree.parse(urlopen(url)).getroot()

ns = { 'd': 'http://datex2.eu/schema/2/2_0' }

parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)

for pa in parking_area:

    area_ref = pa.find('d:parkingAreaReference', ns)

    occupancy = pa.find('d:parkingAreaOccupancy', ns)

    trend = pa.find('d:parkingAreaOccupancyTrend', ns)

    status_time = pa.find('d:parkingAreaStatusTime', ns)

    print area_ref.get('id'), occupancy.text, trend.text, status_time.text

Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :

2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00

5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00

1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00

3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00

4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00

1) some references on XPath :

XPath 1.0 spec: The most trustworthy reference on XPath 1.0

XPath syntax: Gentler introduction to basic XPath expressions

answered Feb 4 at 4:49

har07

74.4k114885

1

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48603692%2fparsing-xml-data-with-python%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

You can just use etree directly and find interesting elements using XPath¹ query. One important thing to note is, that your XML has default namespace declared at the root element :

xmlns="http://datex2.eu/schema/2/2_0"

root = etree.parse(urlopen(url)).getroot()

ns = { 'd': 'http://datex2.eu/schema/2/2_0' }

parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)

for pa in parking_area:

    area_ref = pa.find('d:parkingAreaReference', ns)

    occupancy = pa.find('d:parkingAreaOccupancy', ns)

    trend = pa.find('d:parkingAreaOccupancyTrend', ns)

    status_time = pa.find('d:parkingAreaStatusTime', ns)

    print area_ref.get('id'), occupancy.text, trend.text, status_time.text

Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :

2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00

5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00

1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00

3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00

4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00

1) some references on XPath :

XPath 1.0 spec: The most trustworthy reference on XPath 1.0

XPath syntax: Gentler introduction to basic XPath expressions

answered Feb 4 at 4:49

har07

74.4k114885

1

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

add a comment |

up vote
3
down vote

accepted

You can just use etree directly and find interesting elements using XPath¹ query. One important thing to note is, that your XML has default namespace declared at the root element :

xmlns="http://datex2.eu/schema/2/2_0"

root = etree.parse(urlopen(url)).getroot()

ns = { 'd': 'http://datex2.eu/schema/2/2_0' }

parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)

for pa in parking_area:

    area_ref = pa.find('d:parkingAreaReference', ns)

    occupancy = pa.find('d:parkingAreaOccupancy', ns)

    trend = pa.find('d:parkingAreaOccupancyTrend', ns)

    status_time = pa.find('d:parkingAreaStatusTime', ns)

    print area_ref.get('id'), occupancy.text, trend.text, status_time.text

Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :

2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00

5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00

1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00

3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00

4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00

1) some references on XPath :

XPath 1.0 spec: The most trustworthy reference on XPath 1.0

XPath syntax: Gentler introduction to basic XPath expressions

answered Feb 4 at 4:49

har07

74.4k114885

1

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

add a comment |

up vote
3
down vote

accepted

You can just use etree directly and find interesting elements using XPath¹ query. One important thing to note is, that your XML has default namespace declared at the root element :

xmlns="http://datex2.eu/schema/2/2_0"

root = etree.parse(urlopen(url)).getroot()

ns = { 'd': 'http://datex2.eu/schema/2/2_0' }

parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)

for pa in parking_area:

    area_ref = pa.find('d:parkingAreaReference', ns)

    occupancy = pa.find('d:parkingAreaOccupancy', ns)

    trend = pa.find('d:parkingAreaOccupancyTrend', ns)

    status_time = pa.find('d:parkingAreaStatusTime', ns)

    print area_ref.get('id'), occupancy.text, trend.text, status_time.text

Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :

2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00

5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00

1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00

3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00

4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00

1) some references on XPath :

XPath 1.0 spec: The most trustworthy reference on XPath 1.0

XPath syntax: Gentler introduction to basic XPath expressions

answered Feb 4 at 4:49

har07

74.4k114885

You can just use etree directly and find interesting elements using XPath¹ query. One important thing to note is, that your XML has default namespace declared at the root element :

xmlns="http://datex2.eu/schema/2/2_0"

root = etree.parse(urlopen(url)).getroot()

ns = { 'd': 'http://datex2.eu/schema/2/2_0' }

parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)

for pa in parking_area:

    area_ref = pa.find('d:parkingAreaReference', ns)

    occupancy = pa.find('d:parkingAreaOccupancy', ns)

    trend = pa.find('d:parkingAreaOccupancyTrend', ns)

    status_time = pa.find('d:parkingAreaStatusTime', ns)

    print area_ref.get('id'), occupancy.text, trend.text, status_time.text

Below is the output of the demo code above. Instead of print, you can store these information in whatever data structure you like :

2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00

5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00

1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00

3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00

4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00

1) some references on XPath :

XPath 1.0 spec: The most trustworthy reference on XPath 1.0

XPath syntax: Gentler introduction to basic XPath expressions

answered Feb 4 at 4:49

har07

74.4k114885

answered Feb 4 at 4:49

har07

74.4k114885

answered Feb 4 at 4:49

har07

74.4k114885

answered Feb 4 at 4:49

har07

74.4k114885

1

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

add a comment |

1

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Xr,wDPYaMIf

搜尋此網誌

Nrthugu