Detecting and replacing xml from string in python - python

I have a file that contains text as well as some xml content dumped into it. It looks something like this :
The authentication details : <id>70016683</id><password>password#123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.
I have tried using regex for this purpose :
xml_tag = re.search(r"<\w*>",line)
if xml_tag:
start_position = xml_tag.start()
xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
xml_pattern = r'{}'.format(xml_word)
stop_position = re.search(xml_pattern,line).stop()
But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.
Any advice would be helpful. Thanks in advance.
Edit :
I also want to apply the same logic to files that look like this :
The authentication details : ID <id>70016683</id> Password <password>password#123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
The above files may have more than one xml object in a line.
They may also have some plain text after the xml part.

The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:
txt = """[your sample text above]"""
lines = txt.splitlines()
entries = []
new_txt = ''
for line in lines:
entry = (line.replace(' <',' xxx<',1).split('xxx'))
if len(entry)==2:
entries.append(entry[1])
entry[1]="xml_obj"
line=''.join(entry)
else:
entries.append('none')
new_txt+=line+'\n'
for entry in entries:
print(entry)
print('---')
print(new_txt)
Output:
<id>70016683</id><password>password#123</password>
none
<request><id>90016133</id><password>password#3212</password></request>
<Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
---
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj

Related

Python xml File with added info in tag

I'm trying to parse this form:
<ReturnData documentCnt="3">
<IRS990PF documentId="IRS990PF">
<Organization501c3ExemptPFInd>X</Organization501c3ExemptPFInd>
<FMVAssetsEOYAmt>9325</FMVAssetsEOYAmt>
<MethodOfAccountingCashInd>X</MethodOfAccountingCashInd>
<AnalysisOfRevenueAndExpenses>
<ContriRcvdRevAndExpnssAmt>8917</ContriRcvdRevAndExpnssAmt>
<ScheduleBNotRequiredInd>X</ScheduleBNotRequiredInd>
<TotalRevAndExpnssAmt>8917</TotalRevAndExpnssAmt>
<TotalNetInvstIncmAmt>0</TotalNetInvstIncmAmt>
<OtherExpensesRevAndExpnssAmt referenceDocumentId="STM103">65</OtherExpensesRevAndExpnssAmt>
That last line has additional text in the label, "reference DocumentID="STM103". I'd like to capture:
That there is a referenceDocumentId
What the referenceDocumentId is (in this case, STM103)
When I use document.find('OtherExpensesRevAndExpnssAmt').text, I get "65".
What command gets me that STM information? Thanks.

Getting values from XML Url Python

I have an account in https://es.besoccer.com/ and they have an api for getting data in a xml.
I have this code in python for print the values of the xml I need:
from xml.dom import minidom
doc = minidom.parse("datos.xml")
partidos = doc.getElementsByTagName("matches")
for partido in partidos:
local = partido.getElementsByTagName("local")[0]
visitante = partido.getElementsByTagName("visitor")[0]
print("local:%s" % local.firstChild.data)
print("visitante:%s" % visitante.firstChild.data)
canales=partido.getElementsByTagName("channels")
for canal in canales:
nombre=canal.getElementsByTagName("name")[0]
print("canal:%s" % nombre.firstChild.data)
The problem is thatthe XML of this site is a url so I donĀ“t know how to read the xml directly form the url. Other problem is that the xml contains some tags that are a link, and python throughs a error with that tags that contains a url.
Read the API docs here: https://www.besoccer.com/api/documentacion
After you understand which API call you need to use, prepare the URL and the query arguments and use a library like requests in order to read the data.
Once you have the reply (assuming it is XML based) - you can use your code and parse it.

Get the 'Last saved by' (windows file) with python

How can I get the username value from the "Last saved by" property from any windows file?
e.g.: I can see this info right clicking on a word file and accessing the detail tab. See the picture below:
Does any body knows how can I get it using python code?
Following the comment from #user1558604, I searched a bit on google and reached a solution. I tested on extensions .docx, .xlsx, .pptx.
import zipfile
import xml.dom.minidom
# Open the MS Office file to see the XML structure.
filePath = r"C:\Users\Desktop\Perpetual-Draft-2019.xlsx"
document = zipfile.ZipFile(filePath)
# Open/read the core.xml (contains the last user and modified date).
uglyXML = xml.dom.minidom.parseString(document.read('docProps/core.xml')).toprettyxml(indent=' ')
# Split lines in order to create a list.
asText = uglyXML.splitlines()
# loop the list in order to get the value you need. In my case last Modified By and the date.
for item in asText:
if 'lastModifiedBy' in item:
itemLength = len(item)-20
print('Modified by:', item[21:itemLength])
if 'dcterms:modified' in item:
itemLength = len(item)-29
print('Modified On:', item[46:itemLength])
The result in the console is:
Modified by: adm.UserName
Modified On: 2019-11-08"

Feedparser only returns number not the entire URL

I am using a simple Python script to retrieve the latest RSS info
# RSS read
d = feedparser.parse("http://rss.kicker.de/news/wm")
### (1) Last RSS Feed
url = d.entries[1].id
It works fine as in above with i.e.
http://rss.kicker.de/news/f1news
Result: http://www.kicker.de/news/formel1/startseite/727510/artikel_vettel-jetzt-auf-augenhoehe-mit-hamilton.html#omrss
Not working:
https://www.fia.com/rss/news/
Result: 23278 at https://www.fia.com
What am I doing wrong here?
Regards,
ET
For some reason, the d.entries[1].id return this tag (I believe it's because the id returns the unique identifier of the entries) :
<guid isPermaLink="false">23278 at https://www.fia.com</guid>
If you want to take the url, you can use:
d.entries[1]['link']
References:
Doc for id
Doc for link

urllib2.urlopen not getting all content

I am a beginner in python to pull some data from reddit.com
More precisely, I am trying to send a request to http:www.reddit.com/r/nba/.json to get the JSON content of the page and then parse it for entries about a specific team or player.
To automate the data gathering, I am requesting the page like this:
import urllib2
FH = urllib2.urlopen("http://www.reddit.com/r/nba/.json")
rnba = FH.readlines()
rnba = str(rnba[0])
FH.close()
I am also pulling the content like this on a copy of the script, just to be sure:
FH = requests.get("http://www.reddit.com/r/nba/.json",timeout=10)
rnba_json = FH.json()
FH.close()
However, I am not getting the full data that is presented when I manually go to
http://www.reddit.com/r/nba/.json with either method, in particular when I call
print len(rnba_json['data']['children']) # prints 20-something child stories
but when I do the same loading the copy-pasted JSON string like this:
import json
import urllib2
fh = r"""{"kind": "Listing", "data": {"modhash": ..."""# long JSON string
r_nba = json.loads(fh) #loads the json string from the site into json object
print len(r_nba['data']['children']) #prints upwards of 100 stories
I get more story links. I know about the timeout parameter but providing it did not resolve anything.
What am I doing wrong or what can I do to get all the content presented when I pull the page in the browser?
To get the max allowed, you'd use the API like: http://www.reddit.com/r/nba/.json?limit=100

Categories