Detecting and replacing xml from string in python

Detecting and replacing xml from string in python - python

I have a file that contains text as well as some xml content dumped into it. It looks something like this :
The authentication details : <id>70016683</id><password>password#123</password>
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request>
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
I am using a python program to parse this file. I would like to replace the xml part with a place holder : xml_obj. The output should look something like this :
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj
At the same time I would also like to extract the replaced xml text and store it in a list. The list should contain None if the line doesn't have an xml object.
I have tried using regex for this purpose :
xml_tag = re.search(r"<\w*>",line)
if xml_tag:
start_position = xml_tag.start()
xml_word = xml_tag.group()[:1]+'/'+xml_tag.group()[1:]
xml_pattern = r'{}'.format(xml_word)
stop_position = re.search(xml_pattern,line).stop()
But this code retrieves the start and stop positions for only one xml tag and it's content for the first line and the entire format for the last line ( in the input file ). I would like to get all xml content irrespective of the xml structure and also replace it with 'xml_obj'.
Any advice would be helpful. Thanks in advance.
Edit :
I also want to apply the same logic to files that look like this :
The authentication details : ID <id>70016683</id> Password <password>password#123</password> Authentication details complete
The next step is to send the request.
The request : <request><id>90016133</id><password>password#3212</password></request> Request successful
Additional info includes <Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
The above files may have more than one xml object in a line.
They may also have some plain text after the xml part.

The following is a a little convoluted, but assuming that your actual text is correctly represented by the sample in your question, try this:
txt = """[your sample text above]"""
lines = txt.splitlines()
entries = []
new_txt = ''
for line in lines:
entry = (line.replace(' <',' xxx<',1).split('xxx'))
if len(entry)==2:
entries.append(entry[1])
entry[1]="xml_obj"
line=''.join(entry)
else:
entries.append('none')
new_txt+=line+'\n'
for entry in entries:
print(entry)
print('---')
print(new_txt)
Output:
<id>70016683</id><password>password#123</password>
none
<request><id>90016133</id><password>password#3212</password></request>
<Address><line1>House no. 341</line1><line2>A B Street</line2><City>Sample city</City></Address>
---
The authentication details : xml_obj
The next step is to send the request.
The request : xml_obj
Additional info includes xml_obj

Related

Python xml File with added info in tag

I'm trying to parse this form:
<ReturnData documentCnt="3">
<IRS990PF documentId="IRS990PF">
<Organization501c3ExemptPFInd>X</Organization501c3ExemptPFInd>
<FMVAssetsEOYAmt>9325</FMVAssetsEOYAmt>
<MethodOfAccountingCashInd>X</MethodOfAccountingCashInd>
<AnalysisOfRevenueAndExpenses>
<ContriRcvdRevAndExpnssAmt>8917</ContriRcvdRevAndExpnssAmt>
<ScheduleBNotRequiredInd>X</ScheduleBNotRequiredInd>
<TotalRevAndExpnssAmt>8917</TotalRevAndExpnssAmt>
<TotalNetInvstIncmAmt>0</TotalNetInvstIncmAmt>
<OtherExpensesRevAndExpnssAmt referenceDocumentId="STM103">65</OtherExpensesRevAndExpnssAmt>
That last line has additional text in the label, "reference DocumentID="STM103". I'd like to capture:
That there is a referenceDocumentId
What the referenceDocumentId is (in this case, STM103)
When I use document.find('OtherExpensesRevAndExpnssAmt').text, I get "65".
What command gets me that STM information? Thanks.

Getting values from XML Url Python

I have an account in https://es.besoccer.com/ and they have an api for getting data in a xml.
I have this code in python for print the values of the xml I need:
from xml.dom import minidom
doc = minidom.parse("datos.xml")
partidos = doc.getElementsByTagName("matches")
for partido in partidos:
local = partido.getElementsByTagName("local")[0]
visitante = partido.getElementsByTagName("visitor")[0]
print("local:%s" % local.firstChild.data)
print("visitante:%s" % visitante.firstChild.data)
canales=partido.getElementsByTagName("channels")
for canal in canales:
nombre=canal.getElementsByTagName("name")[0]
print("canal:%s" % nombre.firstChild.data)
The problem is thatthe XML of this site is a url so I don´t know how to read the xml directly form the url. Other problem is that the xml contains some tags that are a link, and python throughs a error with that tags that contains a url.

Read the API docs here: https://www.besoccer.com/api/documentacion
After you understand which API call you need to use, prepare the URL and the query arguments and use a library like requests in order to read the data.
Once you have the reply (assuming it is XML based) - you can use your code and parse it.

Get the 'Last saved by' (windows file) with python

How can I get the username value from the "Last saved by" property from any windows file?
e.g.: I can see this info right clicking on a word file and accessing the detail tab. See the picture below:
Does any body knows how can I get it using python code?

Following the comment from #user1558604, I searched a bit on google and reached a solution. I tested on extensions .docx, .xlsx, .pptx.
import zipfile
import xml.dom.minidom
# Open the MS Office file to see the XML structure.
filePath = r"C:\Users\Desktop\Perpetual-Draft-2019.xlsx"
document = zipfile.ZipFile(filePath)
# Open/read the core.xml (contains the last user and modified date).
uglyXML = xml.dom.minidom.parseString(document.read('docProps/core.xml')).toprettyxml(indent=' ')
# Split lines in order to create a list.
asText = uglyXML.splitlines()
# loop the list in order to get the value you need. In my case last Modified By and the date.
for item in asText:
if 'lastModifiedBy' in item:
itemLength = len(item)-20
print('Modified by:', item[21:itemLength])
if 'dcterms:modified' in item:
itemLength = len(item)-29
print('Modified On:', item[46:itemLength])
The result in the console is:
Modified by: adm.UserName
Modified On: 2019-11-08"

Feedparser only returns number not the entire URL

I am using a simple Python script to retrieve the latest RSS info
# RSS read
d = feedparser.parse("http://rss.kicker.de/news/wm")
### (1) Last RSS Feed
url = d.entries[1].id
It works fine as in above with i.e.
http://rss.kicker.de/news/f1news
Result: http://www.kicker.de/news/formel1/startseite/727510/artikel_vettel-jetzt-auf-augenhoehe-mit-hamilton.html#omrss
Not working:
https://www.fia.com/rss/news/
Result: 23278 at https://www.fia.com
What am I doing wrong here?
Regards,
ET

For some reason, the d.entries[1].id return this tag (I believe it's because the id returns the unique identifier of the entries) :
<guid isPermaLink="false">23278 at https://www.fia.com</guid>
If you want to take the url, you can use:
d.entries[1]['link']
References:
Doc for id
Doc for link

urllib2.urlopen not getting all content

I am a beginner in python to pull some data from reddit.com
More precisely, I am trying to send a request to http:www.reddit.com/r/nba/.json to get the JSON content of the page and then parse it for entries about a specific team or player.
To automate the data gathering, I am requesting the page like this:
import urllib2
FH = urllib2.urlopen("http://www.reddit.com/r/nba/.json")
rnba = FH.readlines()
rnba = str(rnba[0])
FH.close()
I am also pulling the content like this on a copy of the script, just to be sure:
FH = requests.get("http://www.reddit.com/r/nba/.json",timeout=10)
rnba_json = FH.json()
FH.close()
However, I am not getting the full data that is presented when I manually go to
http://www.reddit.com/r/nba/.json with either method, in particular when I call
print len(rnba_json['data']['children']) # prints 20-something child stories
but when I do the same loading the copy-pasted JSON string like this:
import json
import urllib2
fh = r"""{"kind": "Listing", "data": {"modhash": ..."""# long JSON string
r_nba = json.loads(fh) #loads the json string from the site into json object
print len(r_nba['data']['children']) #prints upwards of 100 stories
I get more story links. I know about the timeout parameter but providing it did not resolve anything.
What am I doing wrong or what can I do to get all the content presented when I pull the page in the browser?

To get the max allowed, you'd use the API like: http://www.reddit.com/r/nba/.json?limit=100

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detecting and replacing xml from string in python - python

Related

Python xml File with added info in tag

Getting values from XML Url Python

Get the 'Last saved by' (windows file) with python

Feedparser only returns number not the entire URL

urllib2.urlopen not getting all content

Categories

Resources