I am trying to read all the links in the tag and then trying to create wiki links out of it...basically I want to read each link from the xml file and then create wiki links with the last word(please see below on what I mean by lastword) of the link...for somereason am running into following error,what I am missing,please suggest
http://wiki.build.com/ca_builds/CIT (last word is CIT)
http://wiki.build.com/ca_builds/1.2_Archive(last word is 1.2_Archive)
INPUT XML:-
<returnLink>
http://wiki.build.com/ca_builds/CIT
http://wiki.build.com/ca_builds/1.2_Archive
</returnLink>
PYTHON code
def getReturnLink(xml):
"""Collects the link to return to the PL home page from the config file."""
if xml.find('<returnLink>') == -1:
return None
else:
linkStart=xml.find('<returnLink>')
linkEnd=xml.find('</returnLink>')
link=xml[linkStart+12:linkEnd].strip()
link = link.split('\n')
#if link.find('.com') == -1:
#return None
for line in link:
line = line.strip()
print "LINE"
print line
lastword = line.rfind('/') + 1
line = '['+link+' lastword]<br>'
linklis.append(line)
return linklis
OUTPUT:-
line = '['+link+' lastword]<br>'
TypeError: cannot concatenate 'str' and 'list' objects
EXPECTED OUTPUT:-
CIT (this will point to http://wiki.build.com/ca_builds/CIT
1.2_Archive (this will point to http://wiki.build.com/ca_builds/1.2_Archive 1.2_Archive)
Python standard library has xml parser. You can also support multiple <returnLink> elements and Unicode words in an url:
import posixpath
import urllib
import urlparse
from xml.etree import cElementTree as etree
def get_word(url):
basename = posixpath.basename(urlparse.urlsplit(url).path)
return urllib.unquote(basename).decode("utf-8")
urls = (url.strip()
for links in etree.parse(input_filename_or_file).iter('returnLink')
for url in links.text.splitlines())
wikilinks = [u"[{} {}]".format(url, get_word(url))
for url in urls if url]
print(wikilinks)
Note: work with Unicode internally. Convert the text to bytes only to communicate with outside world e.g., when writing to a file.
Example
[http://wiki.build.com/ca_builds/CIT#some-fragment CIT]
[http://wiki.build.com/ca_builds/Unicode%20%28%E2%99%A5%29 Unicode (♥)]
Intead of parsing XML by hand, use a library like lxml:
>>> s = """<returnLink>
... http://wiki.build.com/ca_builds/CIT
... http://wiki.build.com/ca_builds/1.2_Archive
... </returnLink>"""
>>> from lxml import etree
>>> xml_tree = etree.fromstring(s)
>>> links = xml_tree.text.split()
>>> for i in links:
... print '['+i+']'+i[i.rfind('/')+1:]
...
[http://wiki.build.com/ca_builds/CIT]CIT
[http://wiki.build.com/ca_builds/1.2_Archive]1.2_Archive
I'm not sure what you mean by wikilinks, but the above should give you an idea on how to parse the string.
I'm having some difficulty understanding you question, but it seems like you just want to return the string after the last '/' character in the link? You can do this with reverse find.
return link[link.rfind('/') + 1:]
Related
I can use the findall() API without and issue. Below is the simple case
import nltk
raw = "Management Discussion and Analysis"
raw = raw.lower()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
text.findall(r"<.*> <.*> <.*> <analysis>")
Output
management discussion and analysis
Now if I change the raw variable so that findall does not find anything.
import nltk
raw = "Management Discussion and Analysisss"
raw = raw.lower()
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
text.findall(r"<.*> <.*> <.*> <analysis>")
Output
So the question is how to distinguish at the caller side between success and failure?.
I also checked and debugged the library code and the implementation is to just print the content and return nothing. I found it little strange but do not know why API does not return anything.
hits = self._token_searcher.findall(regexp)
hits = [" ".join(h) for h in hits]
print(tokenwrap(hits, "; "))
Kindly advise.
With further working on this, I was able to think/implement a logic which worked for my requirement. I posted the answer so that somebody else can refer it in future.
def nltk_text_findall_object(nltkText, regexp):
outList = []
finalOutList = []
# now assign stdout handle with some text file so that
# nltk findall() API output/print could be redirected.
tempFileName = "tempFile.txt"
orig_out = sys.stdout
sys.stdout = open(tempFileName, "w")
nltkText.findall(regexp)
# now restore the stdout handle with original value.
sys.stdout.close()
sys.stdout = orig_out
# Now check for the content in the file and return the list.
file = open(tempFileName,"r")
raw = file.read()
file.close()
# nltk findall() API returns the list of strings separated
# by ; as per their current implementation.
outList = raw.split(";")
outList = [str(item).strip() for item in outList]
for item in outList:
if(len(item) > 1):
finalOutList.append(item)
# now we are done with the file, let's us delete it.
os.remove(tempFileName)
return finalOutList
Client Logic To Use Above Method
raw = "Management Discussion and Analysis"
raw = raw.lower()
tokens = nltk.word_tokenize(raw)
regex = r"<.*> <.*> <.*> <analysis>"
outList = nltk_text_findall_object(text, regex)
if(len(outList) == 0):
print("Did Not Found")
else:
print("Found")
Kindly somebody let me know in case there is better way to achieve/implement use case posted in my question.
The following code is designed to write a tuple, each containing a large paragraph of text, and 2 identifiers behind them, to a single line per each entry.
import urllib2
import json
import csv
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}
writer = csv.writer(open("./data/events.csv", "a"))
writer.writerow(["description", "category_id", "subcategory_id"])
def format_event(event):
return event["description"]["text"].encode("utf-8").rstrip("\n\r"), event["category_id"], event["subcategory_id"]
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
#print event
writer.writerow(event)
print "wrote out events for page - {}".format(x)
The ideal format would be to have each line contain a single paragraph, followed by the other fields listed above, yet here is a screenshot of how the data comes out.
If instead I this line to the following:
writer.writerow([event])
Here is how the file now looks:
It certainly looks much closer to what I want, but its got parenthesis around each entry which are undesirable.
EDIT
here is a snippet that contains a sample of the data Im working with.
Can you try writing to the CSV file directly without using using the csv module? You can write/append comma-delimited strings to the CSV file just like writing to typical text files. Also, the way you deal with removing \r and \n characters might not be working. You can use regex to find those characters and replace them with an empty string "":
import urllib2
import json
import re
base_url = "https://www.eventbriteapi.com/v3/events/search/?page={}"
def format_event(event):
ws_to_strip = re.compile(r"(\r|\n)")
description = re.sub(ws_to_strip, "", event["description"]["text"].encode("utf-8"))
return [description, event["category_id"], event["subcategory_id"]]
with open("./data/events.csv", "a") as events_file:
events_file.write(",".join(["description", "category_id", "subcategory_id"]))
for x in range(1, 2):
print "fetching page - {}".format(x)
formatted_url = base_url.format(str(x))
resp = urllib2.urlopen(formatted_url)
data = resp.read()
j_data = json.loads(data)
events = map(format_event, j_data["events"])
for event in events:
events_file.write(",".join(event))
print "wrote out events for page - {}".format(x)
Change your csv writer to be DictWriter.
Make a few tweaks:
def format_event(event):
return {"description": event["description"]["text"].encode("utf-8").rstrip("\n\r"),
"category_id": event["category_id"],
"subcategory_id": event["subcategory_id"]}
May be a few other small things you need to do, but using DictWriter and formatting your data appropriately has been the easiest way to work with csv files that I've found.
import sys
import os
import urllib
from xml.etree.ElementTree import ElementTree
from xml.etree.ElementTree import tostring
import flickrapi
api_key = ' '
api_password = ' '
photo_id='2124494179'
flickr= flickrapi.FlickrAPI(api_key, api_password)
#photos= flickr.photos_getinfo(photo_id='15295705890')
#tree=ElementTree(flickr.photos_getinfo(photo_id))
#image_id=open('photoIds.txt','r')
#Image_data=open('imageinformation','w')
#e=image_id.readlines(10)
#f= [s.replace('\r\n', '') for s in e]
#num_of_lines=len(f)
#image_id.close()
#i=0
#while i<269846:
# term=f[i]
#try:
photoinfo=flickr.photos_getinfo(photo_id=photo_id)
photo_tree=ElementTree(photoinfo)
#photo_tree.write('photo_tree')
#i+=1
#photo=photo_tree.getroot()
#photodata=photo.getiterator()
#for elem in owner.getiterator():
#for elem in photo.getiterator():
for elem in photo_tree.getroot():
farm=elem.attrib['farm']
id=elem.attrib['id']
server=elem.attrib['server']
#title=photo_tree.find('title').txt
#for child in elem.findall():
# username=child.attrib['username']
# location=child.attrib['location']
# user=elem.attrib['username']
print (farm)
print(id)
print(server)
#owner=photo_tree.findall('owner')
# print(username)
#filename="%s.txt"%(farm)
#f=open(filename,'w')
#f.write("%s"%farm)
#for elem in photo_tree.getiterator():
#for child in photo_tree.getiterator():
#print (child.attrib)
#owner=child.attrib['username']
I would like to read data from a file and pass it to flickrapi method to get images' information recursively using pythonand save it in a file as a text: image id=.... user name=... location=... tags=... and so on. I could save the attributes of the first element by using .getroot() but I tried to get the attributes of other element but it returns error. I want to save the attributes into txt file and read the image ids from a file so I can use these data in the algorithm I'm working on.
Since I figured out a away to solve the problem(I'm a beginner and know almost nothing about python), what we need to do is to iterator the object(since it's not saved as xml file) using tags name as follows:
photo_tree=ElementTree(photoinfo)
for elem in photo_tree.getroot():
uploaded=elem.attrib['dateuploaded']
uploaded=datetime.datetime.fromtimestamp(float(uploaded)).strftime('%Y-%m-%d %H:%M:%S')
for elem in photo_tree.getiterator(tag='dates'):
taken_date=elem.attrib['taken']
photo_info = open(head + 'filename/' + ('%d.txt') % (id),'a')
photo_info.write(str(id)+'\t'+uploaded+'\t'+taken_date+'\t'+'\n')
may it helps someone who is seeking a solution for same problem. Or may be there is an efficient way to solve this issue!!
Hello I am having trouble with a xml file I am using. Now what happens is whenever i try to get the msg tag i get an error preventing me from accessing the data. Here is the code I am writing so far.
from xml.dom import minidom
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
def xml_data ():
f = open('C:\opidea_2.xml', 'r')
data = f.read()
f.close()
dom = minidom.parseString(data)
ic = (dom.getElementsByTagName('logentry'))
dom = None
content = ''
for num in ic:
xmlDate = num.getElementsByTagName('date')[0].firstChild.nodeValue
content += xmlDate + '\n '
xmlMsg = num.getElementsByTagName('msg')
if xmlMsg !='' and len(xmlMsg) > 0:
xmlMsgc = xmlMsg[0].firstChild.nodeValue
content += " Comments: \n " + str(xmlMsg) + '\n\n'
else:
xmlMsgc = "No comment made."
content += xmlMsgc
print content
if __name__ == "__main__":
xml_data ()
Here is part of the xml if it helps.
<log>
<logentry
revision="33185">
<author>glv</author>
<date>2012-08-06T21:01:52.494219Z</date>
<paths>
<path
kind="file"
action="M">/branches/Patch_4_2_0_Branch/text.xml</path>
<path
kind="dir"
action="M">/branches/Patch_4_2_0_Branch</path>
</paths>
<msg>PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
</msg>
</logentry>
</log>
Now when i use xmlMsg = num.getElementsByTagName('msg')[0].toxml() I can get the code to work, I just have to do a lot of replacing and I rather not have to do that. Also I have date working using xmlDate = num.getElementsByTagName('date')[0].firstChild.nodeValue.
Is there something I am missing or doing wrong? Also here is the traceback.
Traceback (most recent call last):
File "C:\python\src\SVN_Email_copy.py", line 141, in <module>
xml_data ()
File "C:python\src\SVN_Email_copy.py", line 94, in xml_data
xmlMsg = num.getElementsByTagName('msg').firstChild.nodeValue
AttributeError: 'NodeList' object has no attribute 'firstChild'
I suggest a different approach. Below is a program that does what you want (I think...). It uses the ElementTree API instead of minidom. This simplifies things quite a bit.
You have posted several related questions concerning parsing of an XML file using minidom. I really think you should look into ElementTree (and for even more advanced stuff, check out ElementTree's "superset", lxml). Both these APIs are much easier to work with than minidom.
import xml.etree.ElementTree as ET
def xml_data():
root = ET.parse("opidea_2.xml")
logentries = root.findall("logentry")
content = ""
for logentry in logentries:
date = logentry.find("date").text
content += date + '\n '
msg = logentry.find("msg")
if msg is not None:
content += " Comments: \n " + msg.text + '\n\n'
else:
content += "No comment made."
print content
if __name__ == "__main__":
xml_data()
Output when using your XML sample (you may want to work a bit more on the exact layout):
2012-08-06T21:01:52.494219Z
Comments:
PATCH_BRANCH:N/A
BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:N/A
Adding the SVN log size requirement to the branch
I was doing the code wrong it seems. Here is how i was able to solve it.
if len(xmlMsg) > 0 and xmlMsg[0].firstChild != None:
xmlMsgc = xmlMsg[0].firstChild.nodeValue
xmlMsgpbr = xmlMsgc.replace('\n', ' ')
xmlMsgf.append(xmlMsgpbr)
else:
xmlMsgf = "No comments made"
I never checked if first child had any value or not. That's what I was missing. the other answers helped well but this is how i was able to get it to work. Thank you guys.
myNodeList.item( 0)
maybe...
http://docs.python.org/library/xml.dom.html
use this... print "%s" %(num.getElementsByTagName('date')[0].firstChild.data)
I'm writing script to export my links and their titles from chrome to html.
Chrome bookmarks stored as json, in utf encoding
Some titles are on Russian therefore they stored like that:
"name": "\u0425\u0430\u0431\u0440\ ..."
import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
fw.write(n + '<br>')
# print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")
Now, in chrome.html I got those displayed as \u0425\u0430\u0431...
How I can turn them back to Russian?
using python 2.5
**Edit: Solved!**
s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>
print s.decode('raw-unicode-escape').encode('utf-8')
Привет world!
That's what I needed, to convert str of \u041f... into unicode.
f = open("chrome.json", "r")
data = f.readlines()
f.close()
urls = [] # for links
names = [] # for link titles
ind = 0
for i in data:
if i.find('"url":') != -1:
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
ind += 1
fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")
By the way, it's not just Russian; non-ASCII characters are quite common in page names. Example:
name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'
As an alternative to fragile code like
urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",
you could process the input file using the json module. For Python 2.5, you can get simplejson.
Here's a script that emulates yours:
try:
import json
except ImportError:
import simplejson as json
import sys
def convert_file(infname, outfname):
def explore(folder_name, folder_info):
for child_dict in folder_info['children']:
ctype = child_dict.get('type')
name = child_dict.get('name')
if ctype == 'url':
url = child_dict.get('url')
# print "name=%r url=%r" % (name, url)
fw.write(name.encode('utf-8') + '<br>\n')
elif ctype == 'folder':
explore(name, child_dict)
else:
print "*** Unexpected ctype=%r ***" % ctype
f = open(infname, 'rb')
bmarks = json.load(f)
f.close()
fw = open(outfname, 'w')
fw.write("<html><body>\n")
for folder_name, folder_info in bmarks['roots'].iteritems():
explore(folder_name, folder_info)
fw.write("</body></html>")
fw.close()
if __name__ == "__main__":
convert_file(sys.argv[1], sys.argv[2])
Tested using Python 2.5.4 on Windows 7 Pro.
It's a JSON file, so read it using a JSON parser. That will give you a Unicode string directly, without you having to unescape it. This is going to be much more reliable (as well as simpler), since JSON strings are not the same format as Python strings.
(They're pretty similar and both use the \u format, but your current code will fall over badly for other escaped characters, not to mention that it relies on the exact attribute order and whitespace settings of a JSON file, which makes it very fragile indeed.)
import json, cgi, codecs
with open('chrome.json') as fp:
bookmarks= json.load(fp)
with codecs.open('chrome.html', 'w', 'utf-8') as fp:
fp.write(u'<html><body>\n')
for root in bookmarks[u'roots'].values():
for child in root['children']:
fp.write(u'%s' % (
cgi.escape(child[u'url']),
cgi.escape(child[u'name'])
))
fp.write(u'</body></html>')
Note also the use of cgi.escape to HTML-encode any < or & characters in the strings.
I'm not sure where you're trying to display the russian text, but in the interpreter you can do the following to see the Russian text:
s = '\u0425\u0430\u0431'
l = s.split('\u')
l.remove('')
for x in l:
print(unichr(int(x, 16))),
This will give the following output:
Х а б
If you're storing it in html, better off to leave it as '\u0425...' until you need to convert it.
Hope this helps.
You could include the utf-8 BOM, so chrome knows to read it as utf-8, not ascii:
fw = codecs.open("chrome.html","w","utf-8")
fw.write(codecs.BOM_UTF8.decode('utf-8'))
fw.write(u'你好')
Oh, but if you open fw in python, remember to use 'utf-8-sig' to strip the BOM.
Maybe you need to encode the unicode into utf-8, but I think codecs does that already, right: