python parse conditional multiple lines in text file - python

I would like to parse a text file, but with multiple conditions:
Input:
something<br />
something <br />
something<br />
Modifications made by xy (xy) on 2019/12/10 10:40:23<br />
location: A --> B<br />
something<br />
something<br />
something<br />
Modifications made by xz (xz) on 2020/01/17 11:11:59<br />
analyzer: C --> D<br />
analyzer: B --> D<br />
analyzer: G --> D<br />
location: E --> F<br />
something<br />
something<br />
something
Task:
I need to find the "location: x --> y" and date before the location. The txt file can contains unknown number of location change.
Required Output:
2019/12/10 10:40:23, location: A --> B
2020/01/17 11:11:59, location: E --> F
I tried some code eg.:
with open('log.txt', 'r') as searchfile:
for line in searchfile:
if 'location' in line:
print (line)
but only find the locations and I don't know how to find the dates for them.
Thank you in advance.

Just keep track of corresponding times and locations as such:
with open('log.txt', 'r') as searchfile:
time = None
for line in searchfile:
if line.startswith('Modifications made by'):
time = line.split('on')[-1].strip()
elif line.startswith('location') and time is not None:
print(f'{time}, {line}')

Related

Parsing XML with Python - Accessing Values

I have recently got a RaspberryPi and have started to learn Python. To begin with I want to parse an XML file and I am doing this via the untangle library.
My XML looks like:
<?xml version="1.0" encoding="utf-8"?>
<weatherdata>
<location>
<name>Katherine</name>
<type>Administrative division</type>
<country>Australia</country>
<timezone id="Australia/Darwin" utcoffsetMinutes="570" />
<location altitude="176" latitude="-14.65012" longitude="132.17414" geobase="geonames" geobaseid="7839404" />
</location>
<sun rise="2019-02-04T06:33:52" set="2019-02-04T19:16:15" />
<forecast>
<tabular>
<time from="2019-02-04T06:30:00" to="2019-02-04T12:30:00" period="1">
<!-- Valid from 2019-02-04T06:30:00 to 2019-02-04T12:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="1.8" />
<!-- Valid at 2019-02-04T06:30:00 -->
<windDirection deg="314.8" code="NW" name="Northwest" />
<windSpeed mps="3.3" name="Light breeze" />
<temperature unit="celsius" value="26" />
<pressure unit="hPa" value="1005.0" />
</time>
<time from="2019-02-04T12:30:00" to="2019-02-04T18:30:00" period="2">
<!-- Valid from 2019-02-04T12:30:00 to 2019-02-04T18:30:00 -->
<symbol number="9" numberEx="9" name="Rain" var="09" />
<precipitation value="2.3" />
<!-- Valid at 2019-02-04T12:30:00 -->
<windDirection deg="253.3" code="WSW" name="West-southwest" />
<windSpeed mps="3.0" name="Light breeze" />
<temperature unit="celsius" value="29" />
<pressure unit="hPa" value="1005.0" />
</time>
</tabular>
</forecast>
</weatherdata>
From this I would like to be able to print out the from and to attributes of the <time> element as well as the value attribute in its child node <temperature>
I can correctly print out the temperature values if I run the Python script below:
for forecast in data.weatherdata.forecast.tabular.time:
print (forecast.temperature['value'])
but if I run
for forecast in data.weatherdata.forecast.tabular:
print ("time is " + forecast.time['from'] + "and temperature is " + forecast.time.temperature['value'])
I get an error:
print (forecast.time['from'] + forecast.time.temperature['value'])
TypeError: list indices must be integers, not str
Can anyone advise how I can correctly access these values?
forecast.time should be a list, as it does have multiple values, one for each <time> node.
Did you expect forecast.time['from'] to automatically aggregate that data?

How to pass <Br /> element when combine text in XML using python?

I've been trying to combine all text in the content element in XML using python.
I succeeded combining all content text but need to except content which is right below <'Br /> element.
<'Br /> element means Enter in adobe indesign program.
This XML is exported from adobe indesign.
This is example as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>BBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>EEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
and it's what i want as follow :
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Root>
<Story>
<ParagraphStyleRange>
<CharacterStyleRange>
<Content>AAA</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>AAABBB</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>CCC</Content>
<Br />
<Content>DDD</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Content>DDDEEE</Content>
</CharacterStyleRange>
<CharacterStyleRange>
<Br />
<Content>FFF</Content>
<Br />
</CharacterStyleRange>
</ParagraphStyleRange>
</Story>
</Root>
As you see, i don't want to add content text to next one if there is <'Br /> element right above the content that i want to add.
In detail, the first Content element text is AAA and next one is BBB.
in this case AAA should be attched in front of BBB.
and BBB is not attached in front of CCC because there is <'Br /> element right above CCC Content.
Would you help me how to recognize the <'Br /> element to pass?
this is what i'am doing code so far, but it doesn't work well...
tree = ET.parse("C:\\Br_test.xml")
root = tree.getroot()
for ParagraphStyleRange in root.findall('.//Story/ParagraphStyleRange'):
CharacterStyleRange_count = len(ParagraphStyleRange.findall('CharacterStyleRange'))
#print(CharacterStyleRange_count)
if int(CharacterStyleRange_count) >= 2 :
try :
Content_collect = ''
for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange'):
Br_count = len(CharacterStyleRange.findall('Br'))
print(Br_count)
if int(Br_count) == 0 :
for Content in CharacterStyleRange.findall('Content'):
Content_collect += Content.text
Content.text = str(Content_collect)
print(Content_collect)
#---- Code to delete Contents that are attached to next one---
#for CharacterStyleRange in ParagraphStyleRange.findall('CharacterStyleRange')[:-1]:
# for Content in CharacterStyleRange.findall('Content'):
# Content_remove = CharacterStyleRange.remove(Content)
except:
pass

Python ElementTree xml output to csv

I have the following XML file ('registerreads_EE.xml'):
<?xml version="1.0" encoding="us-ascii" standalone="yes"?>
<ReadingDocument xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ReadingStatusRefTable>
<ReadingStatusRef Ref="1">
<UnencodedStatus SourceValidation="SIMPLE">
<StatusCodes>
<Signal>XX</Signal>
</StatusCodes>
</UnencodedStatus>
</ReadingStatusRef>
</ReadingStatusRefTable>
<Header>
<IEE_System Id="XXXXXXXXXXXXXXX" />
<Creation_Datetime Datetime="2015-10-22T09:05:32Z" />
<Timezone Id="UTC" />
<Path FilePath="X:\XXXXXXXXXXXX.xml" />
<Export_Template Id="XXXXX" />
<CorrelationID Id="" />
</Header>
<ImportExportParameters ResubmitFile="false" CreateGroup="true">
<DataFormat TimestampType="XXXXXX" Type="XXXX" />
</ImportExportParameters>
<Channels>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825603:301" />
<Readings>
<Reading Value="3577.0" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3601.3" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825603" RequestSource="Scheduled" />
</Channel>
<Channel StartDate="2015-10-21T00:00:00-05:00" EndDate="2015-10-22T00:00:00-05:00">
<ChannelID ServicePointChannelID="73825604:301" />
<Readings>
<Reading Value="3462.5" ReadingTime="2015-10-21T00:00:00-05:00" StatusRef="1" />
<Reading Value="3501.5" ReadingTime="2015-10-22T00:00:00-05:00" StatusRef="1" />
</Readings>
<ExportRequest RequestID="152" EntityType="ServicePoint" EntityID="73825604" RequestSource="Scheduled" />
</Channel>
</Channels>
</ReadingDocument>
I want to parse the XML of the channel data into a csv file.
He is what I have written in Python 2.7.10:
import xml.etree.ElementTree as ET
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
print read[:-2],",",date[:10],",",entityid
tree.write(open('registerreads_EE.csv','w'))
Here is the screen output when the above is run:
3577 , 2015-10-21 , 73825603
3601 , 2015-10-22 , 73825603
3462 , 2015-10-21 , 73825604
3501 , 2015-10-22 , 73825604
The 'registerreads.csv' output file is like the original XML file, minus the first line.
I would like the printed output above outputted to a csv file with headers of read, date, entityid.
I am having difficulty with this. This is my first python program. Any help is appreciated.
Use the csv module not lxml module to write rows to csv file. But still use lxml to parse and extract content from xml file:
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('registerreads_EE.xml')
root = tree.getroot()[3]
with open('registerreads_EE.csv', 'w', newline='') as r:
writer = csv.writer(r)
writer.writerow(['read', 'date', 'entityid']) # WRITING HEADERS
for channel in tree.iter('Channel'):
for exportrequest in channel.iter('ExportRequest'):
entityid = exportrequest.attrib.get('EntityID')
for meterread in channel.iter('Reading'):
read = meterread.attrib.get('Value')
date = meterread.attrib.get('ReadingTime')
# WRITE EACH ROW ITERATIVELY
writer.writerow([read[:-2],date[:10],entityid])

PyQuery get text node

I'm using PyQuery to process this HTML:
<div class="container">
<strong>Personality: Strengths</strong>
<br />
Text
<br />
<br />
<strong>Personality: Weaknesses</strong>
<br />
Text
<br />
<br />
</div>
Now that I've got a variable e point to .container, I'm looping through its children:
for c in e.iterchildren():
print c.tag
but in this way I can't get text nodes (the two Text string)
How can I loop an element's children include text nodes?
you can do it likes
for c in e.children():
p = PyQuery(c)
print p.__str__()
#here re.sub remove html tag
This code could get the raw text of each node.
If you want to distinguish the text tag from others :
raw = p.__str__().strip()
a = raw.rfind(">")
if (a+1!=len(raw)) :
print 'is text'

Finding values in XML document using Python

I have following code that tries to get values from XML document:
from xml.dom import minidom
xml = """<SoccerFeed TimeStamp="20130328T152947+0000">
<SoccerDocument uID="f131897" Type="Result" />
<Competition uID="c87">
<MatchData>
<MatchInfo TimeStamp="20070812T144737+0100" Weather="Windy"Period="FullTime" MatchType="Regular" />
<MatchOfficial uID="o11068"/>
<Stat Type="match_time">91</Stat>
<TeamData TeamRef="t810" Side="Home" Score="4" />
<TeamData TeamRef="t2012" Side="Away" Score="1" />
</MatchData>
<Team uID="t810" />
<Team uID="t2012" />
<Venue uID="v2158" />
</SoccerDocument>
</SoccerFeed>"""
xmldoc = minidom.parseString(xml)
soccerfeed = xmldoc.getElementsByTagName("SoccerFeed")[0]
soccerdocument = soccerfeed.getElementsByTagName("SoccerDocument")[0]
#Match Data
MatchData = soccerdocument.getElementsByTagName("MatchData")[0]
MatchInfo = MatchData.getElementsByTagName("MatchInfo")[0]
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
The Goal is being set to [], but I would like to get the score value, which is 4.
It looks like you are searching for the wrong XML node. Check following line:
Goal = MatchData.getElementsByTagNameNS("Side", "Score")
You probably are looking for following:
Goal = MatchData.getElementsByTagName("TeamData")[0].getAttribute("Score")
NOTE: Document.getElementsByTagName, Document.getElementsByTagNameNS, Element.getElementsByTagName, Element.getElementsByTagNameNS return a list of nodes, not just a scalar value.

Categories