PySpark counting rows that contain string

PySpark counting rows that contain string - python

I have multiple xml files that look something like this:
<?xml version="1.0" encoding="UTF-8"?>
<parent>
<row AcceptedAnswerId="15" AnswerCount="5" Body="<p>How should
I elicit prior distributions from experts when fitting a Bayesian
model?</p>
" CommentCount="1" CreationDate="2010-07-
19T19:12:12.510" FavoriteCount="17" Id="1" LastActivityDate="2010-09-
15T21:08:26.077" OwnerUserId="8" PostTypeId="1" Score="26"
Tags="<bayesian><prior><elicitation>"
Title="Eliciting priors from experts" ViewCount="1457" />
I would like to be able to use PySpark to count the lines that DO NOT contain the string: <row
My current thought:
def startWithRow(line):
if line.strip().startswith("<row"):
return True
else:
return False
sc.textFile(localpath("folder_containing_xmg.gz_files")) \
.filter(lambda x: not startWithRow(x)) \
.count()
I have tried validating this, but am getting results from even a simple count lines that don't make sense (I downloaded the xml file and did a wc on it which did not match the word count from PySpark.)
Does anything about my approach above stand out as wrong/weird?

I will just use lxml library combined with Spark to count the line with row or filter something out.
from lxml import etree
def find_number_of_rows(path):
try:
tree = etree.fromstring(path)
except:
tree = etree.parse(path)
return len(tree.findall('row'))
rdd = spark.sparkContext.parallelize(paths) # paths is a list to all your paths
rdd.map(lambda x: find_number_of_rows(x)).collect()
For example, if you have list or XML string (just toy example), you can do the following:
text = [
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
""",
"""
<parent>
<row ViewCount="1457" />
<row ViewCount="1457" />
<row ViewCount="1457" />
</parent>
"""
]
rdd = spark.sparkContext.parallelize(text)
rdd.map(lambda x: find_number_of_rows(x)).collect()
In your case, your function have to take in path to file instead. Then, you can count or filter those rows. I don't have a full file to test on. Let me know if you need extra help!

def badRowParser(x):
try:
line = ET.fromstring(x.strip().encode('utf-8'))
return True
except:
return False
posts = sc.textFile(localpath('folder_containing_xml.gz_files'))
rejected = posts.filter(lambda l: "<row" in l.encode('utf-
8')).map(lambda x: not badRowParser(x))
ans = rejected.collect()
from collections import Counter
Counter(ans)

Related

Extracting comments from XML file in Python

I would like to extract the comment section of the XML file. The information that I would like to extract is found between the Tag and then within Text tag which is "EXAMPLE".
The structure of the XML file looks below.
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I tried it something below but couldn't get the information that I want.
def read_cooments(xml):
tree = lxml.etree.parse(xml)
Comments= {}
for comment in tree.xpath("//Boxes/Box"):
#
get_id = comment.attrib['Id']
Comments[get_id] = []
for group in comment.xpath(".//Tag"):
#
Comments[get_id].append(group.text)
df_name1 = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in Comments.items()]))
Can anyone help to extract comments from XML file shown above? Any help is appreciated!

Use the code given below:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
continue
rows.append([id, txtNode.text.strip()])
return pd.DataFrame(rows, columns=['id', 'Comment'])
Note that if you create a DataFrame within a function, it is a local
variable of this function and is not visible from outside.
A better and more readable approach (as I did) is that the function returns
this DataFrame.
This function contains also continue in 2 places, to guard against possible
"error cases", when either Box element does not contain Tag child or
Tag does not contain any Text child element.
I also noticed that there is no need to replace < or > with < or
> with my own code, as lxml performs it on its own.
Edit
My test is as follows: Start form imports:
import pandas as pd
from lxml import etree
I used a file containing:
<Boxes>
<Box Id="3" ZIndex="13">
<Shape>Rectangle</Shape>
<Brush Id="0" />
<Pen>
<Color>#FF000000</Color>
</Pen>
<Tag><?xml version="1.0"?>
<PFDComment xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Text>**EXAMPLE** </Text>
</PFDComment></Tag>
</Box>
</Boxes>
I called the above function:
df_name1 = read_comments('Boxes.xml')
and when I printed df_name1, I got:
id Comment
0 3 **EXAMPLE**
If something goes wrong, use the "extended" version of the above function,
with test printouts:
def read_comments(xml):
tree = etree.parse(xml)
rows= []
for box in tree.xpath('Box'):
id = box.attrib['Id']
tagTxt = box.findtext('Tag')
if tagTxt is None:
print('No Tag element')
continue
txtNode = etree.XML(tagTxt).find('Text')
if txtNode is None:
print('No Text element')
continue
txt = txtNode.text.strip()
print(f'{id}: {txt}')
rows.append([id, txt])
return pd.DataFrame(rows, columns=['id', 'Comment'])
and take a look at printouts.

XML to table form in Excel

There is this option when opening an xml file using Excel. You get prompted with the option as seen in the picture Here
It basically open that xml file in a table work and based on the analysis that I have done. It seems to do a pretty good job.
This is how it looks after I opened an xml file using excel as a tabel form Here
My Question: I want to convert an Xml into a table from like that feature in Excel does it. Is that possible?
The reason I want this result, is that working with tables inside excel is really easy using libraries like pandas. However, I don’t want to go an open every xml file with excel, show the table and then save it again. It is not very time efficient
This is my XML file
<?xml version="1.0" encoding="utf-8"?>
<ProjectData>
<FINAL>
<START id="ID0001" service_code="0x5196">
<Docs Docs_type="START">
<Rational>225196</Rational>
<Qualify>6251960000A0DE</Qualify>
</Docs>
<Description num="1213f2312">The parameter</Description>
<SetFile dg="" dg_id="">
<SetData value="32" />
</SetFile>
</START>
<START id="DG0003" service_code="0x517B">
<Docs Docs_type="START">
<Rational>23423</Rational>
<Qualify>342342</Qualify>
</Docs>
<Description num="3423423f3423">The third</Description>
<SetFile dg="" dg_id="">
<FileX dg="" axis_pts="2" name="" num="" dg_id="" />
<FileY unit="" axis_pts="20" name="TOOLS" text_id="23423" unit_id="" />
<SetData x="E1" value="21259" />
<SetData x="E2" value="0" />
</SetFile>
</START>
<START id="ID0048" service_code="0x5198">
<RawData rawdata_type="OPDATA">
<Request>225198</Request>
<Response>343243324234234</Response>
</RawData>
<Meaning text_id="434234234">The forth</Meaning>
<ValueDataset unit="m" unit_id="FEDS">
<FileX dg="kg" discrete="false" axis_pts="19" name="weight" text_id="SDF3" unit_id="SDGFDS" />
<SetData xin="sdf" xax="233" value="323" />
<SetData xin="123" xax="213" value="232" />
<SetData xin="2321" xax="232" value="23" />
</ValueDataset>
</START>
</FINAL>
</ProjectData>

So let's say I have the following input.xml file:
<main>
<item name="item1" image="a"></item>
<item name="item2" image="b"></item>
<item name="item3" image="c"></item>
<item name="item4" image="d"></item>
</main>
You can use the following code:
import xml.etree.ElementTree as ET
import pandas as pd
tree = ET.parse('input.xml')
tags = [ e.attrib for e in tree.getroot() ]
df = pd.DataFrame(tags)
# df:
# name image
# 0 item1 a
# 1 item2 b
# 2 item3 c
# 3 item4 d
And this should be independent of the number of attributes in a given file.
To write to a simple CSV file from pandas, you can use the to_csv command. See documentation. If it is necessary to be an excel sheet, you can use to_excel, see here.
# Write to csv without the row names
df.to_csv('file_name.csv', index = False)
# Write to xlsx sheet without the row names
df.to_excel('file_name.xlsx', index=False)
UPDATE:
For your XML file, and based on your clarification in the comments, I suggest the following, where all elements in the first level in the tree will be rows, and every attribute or node text will be column:
def has_children(e):
''' Check if element, e, has children'''
return(len(list(e)) > 0)
def has_attrib(e):
''' Check if element, e, has attributes'''
return(len(e.attrib)>0)
def get_uniqe_key(mydict, key):
''' Generate unique key if already exists in mydict'''
if key in mydict:
while key in mydict:
key = key + '*'
return(key)
tree = ET.parse('input2.xml')
root = tree.getroot()
# Get first level:
lvl_one = list(root)
myList = [];
for e in lvl_one:
mydict = {}
# Iterate over each node in level one element
for node in e.iter():
if (not has_children(node)) & (node.text != None):
uniqe_key = get_uniqe_key(mydict, node.tag)
mydict[uniqe_key] = node.text
if has_attrib(node):
for key in node.attrib:
uniqe_key = get_uniqe_key(mydict, key)
mydict[uniqe_key] = node.attrib[key]
myList.append(mydict)
print(pd.DataFrame(myList))
Notice in this code, I check if the column name exists for each key, and if it exists, I create a new column name by suffixing with '*'.

Parsing XML with namespace into dictionary

I'm having a hard time following the xml.etree.ElementTree documentation with regard to parsing an XML document with a namespace and nested tags.
To begin, the xml tree I am trying to parse looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ROOT-MAIN xmlns="http://fakeurl.com/page">
<Alarm> <--- I dont care about these types of objects
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm> <--- I care about these types of objects
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</name
<Address category="residential>
<address>1421 Morning SE</address>
</address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</ROOT-MAIN>
I am trying to build an array of dictionaries that have a similar structure to the second < Alarm > object. When parsing this XML file, I do the following:
import xml.etree.ElementTree as ET
tree = ET.parse('data/'+filename)
root = tree.getroot()
namespace= '{http://fakeurl.com/page}'
for alarm in tree.findall(namespace+'Alarm'):
for elem in alarm.iter():
try:
creation_time = elem.find(namespace+'CreateTime')
for story in elem.findall(namespace+'Story'):
for node in story.findall(namespace+'Node'):
for Address in node.findall(namespace+'Address'):
address = Address.find(namespace+'address').text
for build in elem.findall(namespace+'Build'):
category= build.find(namespace+'Action').attrib
action = build.find(namespace+'Action').text
for otherdata in elem.findall(namespace+'OtherData'):
#not sure how to get the 'meaning' attribute value as well as the text value for these <OtherData> tags
except:
pass
Right I'm just trying to get values for:
< address >
< Action > (attribute value and text value)
< OtherData > (attribute value and text value)
I'm sort of able to do this with for loops within for-loops but I was hoping for a cleaner, xpath solution which I haven't figured out how to do with a namespace.
Any suggestions would be much appreciated.

Here (collecting a subset of the elements you mentioned -- add more code to collect rest of elements)
import xml.etree.ElementTree as ET
import re
xmlstring = '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root xmlns="http://fakeurl.com/page">
<Alarm>
<Node>
<location>Texas></location>
<name>John</name>
</Node>
</Alarm>
<Alarm>
<CreateTime>01/01/2011</CreateTime>
<Story>
<Node>
<Name>Ethan</Name>
<Address category="residential">
<address>1421 Morning SE</address>
</Address>
</Node>
</Story>
<Build>
<Action category="build_value_1">Build was successful</Action>
</Build>
<OtherData type="string" meaning="favoriteTVShow">Purple</OtherData>
<OtherData type="string" meaning="favoriteColor">Seinfeld</OtherData>
</Alarm>
</root>'''
xmlstring = re.sub(' xmlns="[^"]+"', '', xmlstring, count=1)
root = ET.fromstring(xmlstring)
alarms = root.findall('Alarm')
alarms_list = []
for alarm in alarms:
create_time = alarm.find('CreateTime')
if create_time is not None:
entry = {'create_time': create_time.text}
alarms_list.append(entry)
actions = alarm.findall('Build/Action')
if actions:
entry['builds'] = []
for action in actions:
entry['builds'].append({'category': action.attrib['category'], 'status': action.text})
print(alarms_list)

Parsing XML document into a pandas DataFrame

I have an XML file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>
What I'm trying to do is to extract ID, Text and CreationDate colums into pandas DF and I've tried following:
import xml.etree.cElementTree as et
import pandas as pd
path = '/.../...'
dfcols = ['ID', 'Text', 'CreationDate']
df_xml = pd.DataFrame(columns=dfcols)
root = et.parse(path)
rows = root.findall('.//row')
for row in rows:
ID = row.find('Id')
text = row.find('Text')
date = row.find('CreationDate')
print(ID, text, date)
df_xml = df_xml.append(pd.Series([ID, text, date], index=dfcols), ignore_index=True)
print(df_xml)
But the output is:
None None None
How do I fix this?

As advised in this solution by gold member Python/pandas/numpy guru, #unutbu:
Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.
Therefore, consider parsing your XML data into a separate list then pass list into the DataFrame constructor in one call outside of any loop. In fact, you can pass nested lists with list comprehension directly into the constructor:
path = 'AttributesXMLPandas.xml'
dfcols = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
rows = root.findall('.//row')
# NESTED LIST
xml_data = [[row.get('Id'), row.get('Text'), row.get('CreationDate')]
for row in rows]
df_xml = pd.DataFrame(xml_data, columns=dfcols)
print(df_xml)
# ID Text CreationDate
# 0 1 (...) 2011-08-30T21:15:28.063
# 1 2 (...) 2011-08-30T21:24:56.573
# 2 3 (...) None

Just a minor change in your code
ID = row.get('Id')
text = row.get('Text')
date = row.get('CreationDate')

Based on #Parfait solution, I wrote my version that gets the columns as a parameter and returns the Pandas DataFrame.
test.xml:
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(.1.)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(.2.)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(.3.)" UserId="9" />
</comments>
xml_to_pandas.py:
'''Xml to Pandas DataFrame Convertor.'''
import xml.etree.cElementTree as et
import pandas as pd
def xml_to_pandas(root, columns, row_name):
'''get xml.etree root, the columns and return Pandas DataFrame'''
df = None
try:
rows = root.findall('.//{}'.format(row_name))
xml_data = [[row.get(c) for c in columns] for row in rows] # NESTED LIST
df = pd.DataFrame(xml_data, columns=columns)
except Exception as e:
print('[xml_to_pandas] Exception: {}.'.format(e))
return df
path = 'test.xml'
row_name = 'row'
columns = ['ID', 'Text', 'CreationDate']
root = et.parse(path)
df = xml_to_pandas(root, columns, row_name)
print(df)
output:

Since pandas 1.3.0, there's a built-in pandas function pd.read_xml that reads XML documents into a pandas DataFrame.
path = """<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="1" PostId="2" Score="0" Text="(...)" CreationDate="2011-08-30T21:15:28.063" UserId="16" />
<row Id="2" PostId="17" Score="1" Text="(...)" CreationDate="2011-08-30T21:24:56.573" UserId="27" />
<row Id="3" PostId="26" Score="0" Text="(...)" UserId="9" />
</comments>"""
# or a path to an XML doc
path = 'test.xml'
pd.read_xml(path)
The XML doc in the OP becomes the following by simply calling read_xml:

Finding values in XML document using Python

I have following code that tries to get values from XML document:
from xml.dom import minidom
xml = """<SoccerFeed TimeStamp="20130328T152947+0000">
<SoccerDocument uID="f131897" Type="Result" />
<Competition uID="c87">
<MatchData>
<MatchInfo TimeStamp="20070812T144737+0100" Weather="Windy"Period="FullTime" MatchType="Regular" />
<MatchOfficial uID="o11068"/>
<Stat Type="match_time">91</Stat>
<TeamData TeamRef="t810" Side="Home" Score="4" />
<TeamData TeamRef="t2012" Side="Away" Score="1" />
</MatchData>
<Team uID="t810" />
<Team uID="t2012" />
<Venue uID="v2158" />
</SoccerDocument>
</SoccerFeed>"""
xmldoc = minidom.parseString(xml)
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
The Goal is being set to [], but I would like to get the score value, which is 4.

It looks like you are searching for the wrong XML node. Check following line:
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
You probably are looking for following:
Goal = MatchData.getElementsByTagName("TeamData")[0].getAttribute("Score")
NOTE: Document.getElementsByTagName, Document.getElementsByTagNameNS, Element.getElementsByTagName, Element.getElementsByTagNameNS return a list of nodes, not just a scalar value.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark counting rows that contain string - python

Related

Extracting comments from XML file in Python

XML to table form in Excel

Parsing XML with namespace into dictionary

Parsing XML document into a pandas DataFrame

Finding values in XML document using Python

Categories

Resources