Parsing XML Data with Python
up vote
2
down vote
favorite
actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
python xml parsing lxml
add a comment |
up vote
2
down vote
favorite
actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
python xml parsing lxml
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
python xml parsing lxml
actually I am working on a small project and need to parse public available XML data. My goal is to write the data to an mysql database for further processing.
XML Data Link: http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml
XML structure (example):
<parkingAreaStatus>
<parkingAreaOccupancy>0.2533602</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="2[Zeil]"
version="1.0"/>
<parkingAreaStatusTime>2018-02-
04T01:30:00.000+01:00</parkingAreaStatusTime
</parkingAreaStatus>
<parkingAreaStatus>
<parkingAreaOccupancy>0.34625</parkingAreaOccupancy>
<parkingAreaOccupancyTrend>stable</parkingAreaOccupancyTrend>
<parkingAreaReference targetClass="ParkingArea" id="5[Dom / Römer]"
version="1.0"/>
</parkingAreaStatus>
Using the code
import csv
import pymysql
import urllib.request
url = "http://offenedaten.frankfurt.de/dataset/912fe0ab-8976-4837-b591-57dbf163d6e5/resource/48378186-5732-41f3-9823-9d1938f2695e/download/parkdatendyn.xml"
from lxml.objectify import parse
from lxml import etree
from urllib.request import urlopen
locations_root = parse(urlopen(url)).getroot()
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaReference)
print(*locations)
I expected to get a list of all "parkingAreaReference" entries within the XML document. Unfortunately the list is empty.
Playing arround with some code I got the sentiment that only the first block is parsed, I was able to fill the list with the value of "parkingAreaOccupancy" of the "parkingAreaReference" id="2[Zeil]" block by using the code
locations = list(locations_root.payloadPublication.genericPublicationExtension.parkingFacilityTableStatusPublication.parkingAreaStatus.parkingAreaOccupancy)
print(*locations)
-> 0.2533602
which is not the expected outcome
-> 0.2533602
-> 0.34625
MY question is:
What is the best way to get a matrix i can further work with of all blocks incl. the corresponding values stated in the XML document?
Example output:
A = [[ID:2[Zeil],0.2533602,stable,2018-02-
04T01:30:00.000+01:00],[id="5[Dom / Römer],0.34625,stable,2018-02-
04T01:30:00.000+01:00]]
or in general
A = [parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,....],[parkingAreaOccupancy,parkingAreaOccupancyTrend,parkingAreaStatusTime,.....]
After hours of research I hope for some tips from your site
Thank you in advance,
TR
python xml parsing lxml
python xml parsing lxml
edited Feb 4 at 1:22
asked Feb 4 at 1:01
theamokdog
245
245
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
You can just use etree
directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d
, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print
, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
You can just use etree
directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d
, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print
, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
add a comment |
up vote
3
down vote
accepted
You can just use etree
directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d
, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print
, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
You can just use etree
directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d
, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print
, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
You can just use etree
directly and find interesting elements using XPath1 query. One important thing to note is, that your XML has default namespace declared at the root element :
xmlns="http://datex2.eu/schema/2/2_0"
By definition, element where default namespace is declared and all descendant elements without prefix are belong to this default namespace (unless another default namespace found in one of the descendant elements, which is not the case with your XML). This is why we define a prefix d
, which references default namespace URI, in the following code, and we use that prefix to find every elements we need to get information from :
root = etree.parse(urlopen(url)).getroot()
ns = { 'd': 'http://datex2.eu/schema/2/2_0' }
parking_area = root.xpath('//d:parkingAreaStatus', namespaces=ns)
for pa in parking_area:
area_ref = pa.find('d:parkingAreaReference', ns)
occupancy = pa.find('d:parkingAreaOccupancy', ns)
trend = pa.find('d:parkingAreaOccupancyTrend', ns)
status_time = pa.find('d:parkingAreaStatusTime', ns)
print area_ref.get('id'), occupancy.text, trend.text, status_time.text
Below is the output of the demo code above. Instead of print
, you can store these information in whatever data structure you like :
2[Zeil] 0.22177419 stable 2018-02-04T05:16:00.000+01:00
5[Dom / Römer] 0.28625 stable 2018-02-04T05:16:00.000+01:00
1[Anlagenring] 0.257889 stable 2018-02-04T05:16:00.000+01:00
3[Mainzer Landstraße] 0.20594966 stable 2018-02-04T05:16:00.000+01:00
4[Bahnhofsviertel] 0.31513646 stable 2018-02-04T05:16:00.000+01:00
1) some references on XPath :
XPath 1.0 spec: The most trustworthy reference on XPath 1.0
XPath syntax: Gentler introduction to basic XPath expressions
answered Feb 4 at 4:49
har07
74.4k114885
74.4k114885
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
add a comment |
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
1
1
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
Thank you har07 - for the solution and the back ground Infos. People like you make stack overflow to what it is. Happy weekend
– theamokdog
Feb 4 at 14:29
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f48603692%2fparsing-xml-data-with-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown