im new to XML and i want to know how to create a dataframe in python from this XML file.
<EXTENDEDPROPERTIES>
<DEBTCONFIGURATION>
<row Key="guid" Value="2018438038"/>
<row Key="status" Value="0"/>
<row Key="forma_pago" Value="DEBITO A CUENTA"/>
<row Key="monto" Value="23699.1"/>
<row Key="monto_abono" Value="360.55"/>
<row Key="entidad" Value="BANCO CAPRICHOSO S.A."/>
<row Key="tipo" Value="PREST. AUTO"/>
<row Key="balance" Value="19617.5"/>
<row Key="KIND_ID" Value="PRINCIPAL"/>
<row Key="TYPE_ID" Value="CEDULA_IDENTIDAD"/>
<row Key="CUSTOMER_ID" Value="777-555-888"/>
<row Key="MEMBER_TYPE" Value="DEUDOR"/>
</DEBTCONFIGURATION>
I have the following code, it creates the DataFrame but when i tried to append the value of the row, i dont know why it keeps coming "None".
I dont know if i have to change de calling argument i.e Attrib.get.
Also i tried changing the attrib.get to find("value").text but it give me the error that it dosnt have the a text attribute.
import pandas as pd
import xml.etree.ElementTree as ET
xtree = ET.parse("davi_apc.xml")
xroot = xtree.getroot()
df_cols = ["guid", "status", "forma_pago", "monto", "monto_abono", "entidad", "tipo", "balance","KIND_ID", "TYPE_ID", "CUSTOMER_ID", "MEMBER_TYPE"]
rows = []
for node in xroot:
s_guid = node.attrib.get("guid")
s_status = node.attrib.get("status")
s_formapago = node.attrib.get("forma_pago")
s_monto = node.attrib.get("monto")
s_monto_abono = node.attrib.get("monto_abono")
s_entidad = node.attrib.get("entidad")
s_tipo = node.attrib.get("tipo")
s_balance = node.attrib.get("balance")
s_kind_id = node.attrib.get("KIND_ID")
s_type_id = node.attrib.get("TYPE_ID")
s_customer_id = node.attrib.get("CUSTOMER_ID")
s_mebder_type = node.attrib.get("MEMBER_TYPE")
rows.append({
"guid" : s_guid,
"status" : s_status,
"forma_pago" : s_formapago,
"monto" : s_monto,
"monto_abono" : s_monto_abono,
"entidad" : s_entidad,
"tipo" : s_tipo,
"balance" : s_balance,
"KIND_ID" : s_kind_id,
"TYPE_ID" : s_type_id,
"CUSTOMER_ID" : s_customer_id,
"MEMBER_TYPE" : s_mebder_type
})
out_df = pd.DataFrame(rows, columns = df_cols)
this is the printout of print(rows)
[{'guid': None, 'status': None, 'forma_pago': None, 'monto': None, 'monto_abono': None, 'entidad': None, 'tipo': None, 'balance': None, 'KIND_ID': None, 'TYPE_ID': None, 'CUSTOMER_ID': None, 'MEMBER_TYPE': None}]
and this is the printout of the dataframe
guid status forma_pago monto monto_abono entidad tipo balance KIND_ID
0 None None None None None None None None None
TYPE_ID CUSTOMER_ID MEMBER_TYPE
0 None None None
Here is a working solution:
1/ remove top line from xml file, I am unsure if the first tag is xml compliant ?
<DEBTCONFIGURATION>
<row Key="guid" Value="2018438038"/>
<row Key="status" Value="0"/>
<row Key="forma_pago" Value="DEBITO A CUENTA"/>
<row Key="monto" Value="23699.1"/>
<row Key="monto_abono" Value="360.55"/>
<row Key="entidad" Value="BANCO CAPRICHOSO S.A."/>
<row Key="tipo" Value="PREST. AUTO"/>
<row Key="balance" Value="19617.5"/>
<row Key="KIND_ID" Value="PRINCIPAL"/>
<row Key="TYPE_ID" Value="CEDULA_IDENTIDAD"/>
<row Key="CUSTOMER_ID" Value="777-555-888"/>
<row Key="MEMBER_TYPE" Value="DEUDOR"/>
</DEBTCONFIGURATION>
2/ code:
import pandas as pd
import xml.etree.ElementTree as ET
xtree = ET.parse("davi_apc.xml")
xroot = xtree.getroot()
rows = [{}]
for node in xroot:
print(node.attrib)
rows[0].update({node.attrib['Key']:node.attrib['Value']})
out_df = pd.DataFrame(rows)
3/ output for out_df:
out_df.head(10)
guid status ... CUSTOMER_ID MEMBER_TYPE
0 2018438038 0 ... 777-555-888 DEUDOR
Related
I have a xml file: 'product.xml', here is an example of the sample file:
<?xml version="1.0"?>
<Rowset>
<ROW>
<Product_ID>32</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>90</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1010</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>190</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:1111</Application_ID>
</ROW>
<ROW>
<Product_ID>63</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>99</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1212</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>65</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:2210</Application_ID>
</ROW>
This is my code:
import xml.etree.cElementTree as ET
tree = ET.parse('product.xml')
root = tree.getroot()
for rows in root:
for attr in rows:
if (attr.tag=='User_ID'):
print('User_ID: ' + attr.text)
if (attr.tag=='Application_ID'):
print('Application_ID: ' + attr.text)
Output for this is:
User_ID: 90
Application_ID: BBC#:1010
User_ID: 190
Application_ID: NBA#:1111
User_ID: 99
Application_ID: BBC#:1212
I am wondering how can I generate a 2D table with Pandas Data frame, using 'Application_ID' and 'User_ID' as ROW Headers and their data as columns, like:
Application_ID User_ID
BBC#:1010 90
NBA#:1111 190
BBC#:1212 99
And export these 2D Table results into a csv file to save them, Thank you.
Something like the below
import xml.etree.ElementTree as ET
import pandas as pd
xml = '''<?xml version="1.0"?>
<Rowset>
<ROW>
<Product_ID>32</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>90</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1010</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>190</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:1111</Application_ID>
</ROW>
<ROW>
<Product_ID>63</Product_ID>
<Company_ID>4</Company_ID>
<User_ID>99</User_ID>
<Product_Type>1</Product_Type>
<Application_ID>BBC#:1212</Application_ID>
</ROW>
<ROW>
<Product_ID>22</Product_ID>
<Company_ID>2</Company_ID>
<User_ID>65</User_ID>
<Product_Type>2</Product_Type>
<Application_ID>NBA#:2210</Application_ID>
</ROW>
</Rowset>
'''
FIELDS = ['Application_ID','User_ID']
data = []
root = ET.fromstring(xml)
for row in root.findall('.//ROW'):
data.append([row.find(f).text for f in FIELDS])
df = pd.DataFrame(data,columns=FIELDS)
print(df)
output
Application_ID User_ID
0 BBC#:1010 90
1 NBA#:1111 190
2 BBC#:1212 99
3 NBA#:2210 65
Try:
def parse_row(row):
ret = {'User_ID':np.nan, 'Application_ID':np.nan}
for attr in row:
if attr.tag in ret: ret[attr.tag] = attr.text
return ret
out = pd.DataFrame([parse_row(r) for r in root])
Output:
User_ID Application_ID
0 90 BBC#:1010
1 190 NBA#:1111
2 99 BBC#:1212
3 65 NBA#:2210
Pandas is able to read most file types into DataFrames.
### This line would get you all of your columns
df = pd.read_xml('product.xml')
### Drop (remove) unwanted columns
df.drop(['Product_ID', 'Company_ID', 'Product_Type'], axis=1, inplace=True)
### Export to csv
df.to_csv('outputfile.csv')
I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?
As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None
Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')
Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:
Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:
I have multiple xml files that look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<parent>
<row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should
I elicit prior distributions from experts when fitting a Bayesian
model?</p>
" CommentCount="1" CreationDate="2010-07-
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26"
Tags="<bayesian><prior><elicitation>"
Title="Eliciting priors from experts" ViewCount="1457" />
I would like to be able to use PySpark to count the lines that DO NOT contain the string: <row
My current thought:
def startWithRow(line):
if line.strip().startswith("<row"):
return True
else:
return False
sc.textFile(localpath("folder_containing_xmg.gz_files")) \
.filter(lambda x: not startWithRow(x)) \
.count()
I have tried validating this, but am getting results from even a simple count lines that don't make sense (I downloaded the xml file and did a wc on it which did not match the word count from PySpark.)
Does anything about my approach above stand out as wrong/weird?
I will just use lxml library combined with Spark to count the line with row or filter something out.
from lxml import etree
def find_number_of_rows(path):
try:
tree = etree.fromstring(path)
except:
tree = etree.parse(path)
return len(tree.findall('row'))
rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths
rdd.map(lambda x: find_number_of_rows(x)).collect()
For example, if you have list or XML string (just toy example), you can do the following:
text = [
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
""",
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
"""
]
rdd = spark.sparkContext.parallelize(text)
rdd.map(lambda x: find_number_of_rows(x)).collect()
In your case, your function have to take in path to file instead. Then, you can count or filter those rows. I don't have a full file to test on. Let me know if you need extra help!
def badRowParser(x):
try:
line = ET.fromstring(x.strip().encode('utf-8'))
return True
except:
return False
posts = sc.textFile(localpath('folder_containing_xml.gz_files'))
rejected = posts.filter(lambda l: "<row" in l.encode('utf-
8')).map(lambda x: not badRowParser(x))
ans = rejected.collect()
from collections import Counter
Counter(ans)
I'm trying to parse the following xml file:
<?xml version="1.0" encoding="UTF-8"?>
<gexf xmlns="http://www.gexf.net/1.2draft" version="1.2">
<graph mode="static" defaultedgetype="directed">
<nodes>
<node id="0" label="Hello" />
<node id="1" label="Word" />
<node id="2" />
</nodes>
<edges>
<edge id="0" source="0" target="1" />
<edge id="1" source="1" target="2" weight="2.0" />
</edges>
</graph>
</gexf>
As can be seen some edges have weights, some do not.
My code is like the following:
elif name == "edge":
u = attrs.getValue("source")
v = attrs.getValue("target")
w = attrs.getValue("weight")
if w is not None:
self.edgeweight = w
Here I expect w to be None on the first line and 2.0 on the second line of the XML file. Instead all I get is an error. What's the proper way to control this?
get() method did the trick.
w = attrs.get("weight")
if w is not None:
self.weighted = True
self.edgeweight = float(w)
Try the following.
if attrs.hasKey("weight"):
w = attrs.getValue("weight")
self.edgeweight = w
I used this as reference. It doesn't specify if you can use "weight" in attrs, but you can try and see if it works.
I'm trying to parse XML document in Python, so that I can do manipulations on the data and write out a new file. The full file that I'm working with is here, but here is an excerpt:
<?xml version="1.0" encoding="UTF-8"?>
<FMPXMLRESULT xmlns="http://www.filemaker.com/fmpxmlresult">
<ERRORCODE>0</ERRORCODE>
<PRODUCT BUILD="09-11-2013" NAME="FileMaker" VERSION="ProAdvanced 12.0v5"/>
<DATABASE DATEFORMAT="M/d/yyyy" LAYOUT="" NAME="All gigs 88-07.fmp12" RECORDS="746" TIMEFORMAT="h:mm:ss a"/>
<METADATA>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Country" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Year" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="City" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="State" TYPE="TEXT"/>
<FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="Theater" TYPE="TEXT"/>
</METADATA>
<RESULTSET FOUND="746">
<ROW MODID="3" RECORDID="32">
<COL>
<DATA/>
</COL>
<COL>
<DATA>1996</DATA>
</COL>
<COL>
<DATA>Pompano Beach</DATA>
</COL>
<COL>
<DATA>FL</DATA>
</COL>
<COL>
<DATA>First Presbyterian Church</DATA>
</COL>
</ROW>
<ROW MODID="3" RECORDID="33">
<COL>
<DATA/>
</COL>
<COL>
<DATA>1996</DATA>
</COL>
<COL>
<DATA>Hilton Head</DATA>
</COL>
<COL>
<DATA>SC</DATA>
</COL>
<COL>
<DATA>Self Family Arts Center</DATA>
</COL>
</ROW>
<!-- snip many more ROW elements -->
</RESULTSET>
</FMPXMLRESULT>
Eventually, I want to use the information from the METADATA field to parse the columns in the RESULTSET, but for now I’m having trouble just getting a handle on the data. Here is what I’ve tried to get the contents of the METADATA element:
import xml.etree.ElementTree as ET
tree = ET.parse('giglist.xml')
root = tree.getroot()
print root
metadata = tree.find("METADATA")
print metadata
This prints out:
<Element '{http://www.filemaker.com/fmpxmlresult}FMPXMLRESULT' at 0x10f982cd0>
None
Why is metadata empty? Am I misusing the find() method?
You need to handle namespaces.
But, since there is only a default namespace given, you can find the element by using the following syntax:
import xml.etree.ElementTree as ET
ns = 'http://www.filemaker.com/fmpxmlresult'
tree = ET.parse('giglist.xml')
root = tree.getroot()
metadata = root.find("{%s}METADATA" % ns)
print metadata # prints <Element '{http://www.filemaker.com/fmpxmlresult}METADATA' at 0x103ccbe90>
Here are the relevant threads you may want to see:
Is there a key for the default namespace when creating dictionary for use with xml.etree.ElementTree.findall() in Python?
Parsing XML with namespace in Python via 'ElementTree'
UPD (getting the list of results):
import xml.etree.ElementTree as ET
ns = 'http://www.filemaker.com/fmpxmlresult'
tree = ET.parse('giglist.xml')
root = tree.getroot()
keys = [field.attrib['NAME'] for field in root.findall(".//{%(ns)s}METADATA/{%(ns)s}FIELD" % {'ns': ns})]
results = [dict(zip(keys, [col.text for col in row.findall(".//{%(ns)s}COL/{%(ns)s}DATA" % {'ns': ns})]))
for row in root.findall(".//{%(ns)s}RESULTSET/{%(ns)s}ROW" % {'ns': ns})]
print results
Prints:
[
{'City': 'Pompano Beach', 'Country': None, 'State': 'FL', 'Theater': 'First Presbyterian Church', 'Year': '1996'},
{'City': 'Hilton Head', 'Country': None, 'State': 'SC', 'Theater': 'Self Family Arts Center', 'Year': '1996'}
]