Error with xmltodict - python

EDIT:
I can print rev['contributor'] for a while but then every try to access rev['contributor'] returns the following
TypeError: string indices must be integers
ORIGINAL POST:
I'm trying to extract data from an xml using xml to dict with the code:
import xmltodict, json
with open('Sockpuppet_articles.xml', encoding='utf-8') as xml_file:
dic_xml = xmltodict.parse(xml_file.read(), xml_attribs=False)
print("parsed")
for page in dic_xml['mediawiki']['page']:
for rev in page['revision']:
for user in open("Sockpuppet_names.txt", "r", encoding='utf-8'):
user = user.strip()
if 'username' in rev['contributor'] and rev['contributor']['username'] == user:
dosomething()
I get this error in the last line with the if-statement:
TypeError: string indices must be integers
Weird thing is, it works on another xml-file.

I got the same error when the next level has only one element.
...
## Read XML
pastas = [os.path.join(caminho, name) for name in os.listdir(caminho)]
pastas = filter(os.path.isdir, pastas)
for pasta in pastas:
for arq in glob.glob(os.path.join(pasta, "*.xml")):
xmlData = codecs.open(arq, 'r', encoding='utf8').read()
xmlDict = xmltodict.parse(xmlData, xml_attribs=True)["XMLBIBLE"]
bible_name = xmlDict["#biblename"]
list_verse = []
for xml_inBook in xmlDict["BIBLEBOOK"]:
bnumber = xml_inBook["#bnumber"]
bname = xml_inBook["#bname"]
for xml_chapter in xml_inBook["CHAPTER"]:
cnumber = xml_chapter["#cnumber"]
for xml_verse in xml_chapter["VERS"]:
vnumber = xml_verse["#vnumber"]
vtext = xml_verse["#text"]
...
TypeError: string indices must be integers
The error occurs when the book is "Obadiah". It has only one chapter.
Cliking CHAPTER value we see the following view. Then it's supposed xml_chapter will be the same. That is true only if the book has more then one chapter:
But the loop returns "#cnumber" instead of an OrderedDict.
I solved that converting the OrderedDict to List when has only one chapter.
...
if len(xml_inBook["CHAPTER"]) == 2:
xml_chapter = list(xml_inBook["CHAPTER"].items())
cnumber = xml_chapter[0][1]
for xml_verse in xml_chapter[1][1]:
vnumber = xml_verse["#vnumber"]
vtext = xml_verse["#text"]
...
I am using Python 3,6.

Related

Use regex to match 3 characters in string

I have a json payload that I need to match just the SDC in the vdcLocation.
{
"cmdbID":"d01aacda21b7c181aaaaa16dc4bcbca",
"serialNumber":"VBlock740-4239340361f4d-0f6d9d6ad46879",
"vdcLocation":"Data Center-San Diego (SDC)"
}
Here's the code I have so far, what am I missing?
import json
with open('test-payload.json') as json_file:
data = json.load(json_file)
serialNumber = data["serialNumber"]
dataCenter = data["vdcLocation"]
splittedSerialNumber = serialNumber.split("-") # returns splitted list
firstPart = splittedSerialNumber[0] # accessing the first part of the splitted list
splittedDataCenter = dataCenter.split("-")
lastPart = splittedDataCenter[1]
vdcLocationOnly = if (re.match^('[SDC]')$):
print(vdcLocationOnly)
print(serialNumber)
print(splittedSerialNumber)
print(firstPart)
print(splittedDataCenter)
print(lastPart)
One solution would be something like the following:
import json
import re
with open('test-payload.json') as json_file:
data = json.load(json_file)
serialNumber = data["serialNumber"]
dataCenter = data["vdcLocation"]
splittedSerialNumber = serialNumber.split("-") # returns splitted list
firstPart = splittedSerialNumber[0] # accessing the first part of the splitted list
splittedDataCenter = dataCenter.split("-")
lastPart = splittedDataCenter[1]
if "SDC" in dataCenter:
print("found SDC using in")
if re.search(r'\(SDC\)$', dataCenter):
print("found SDC using re")
print(serialNumber)
print(splittedSerialNumber)
print(firstPart)
print(splittedDataCenter)
print(lastPart)
The simplest approach would be to use "SDC" in dataCenter. But if your needs are a bit more complicated and you indeed need to use a regular expression then you probably want to use re.search (see the docs).

Python // create XML File

I'm struggling with printing out an xml-file.
What I wanna do is to iterate through my list of test numbers until the list is empty.
But I'm not getting to the point, where it even prints out anything, since I get a syntax error for the print command (***-Line).
I would be fine with getting it printed out to the console, but right now I'm stuck with the syntax error.
Any ideas on how to fix that syntax error?
import lxml.etree
import lxml.builder
E = lxml.builder.ElementMaker()
ITEM = E.item
attributeValue = E.attributeValue
part = E.part
a = 300 #Testwert
l = [10,20,30,44,76876,9009809] #Personennummern
i = 0 #Indexwert
while len(l) > 0:
Document = ITEM(
attributeValue(str("2019-12-03"), id="1026"),
attributeValue(str("2019-12-03"), id="1028"),
attributeValue(str("FATCA Schriftverkehr"), id="1023"),
attributeValue(str(l[i]), id="1022"),
attributeValue(str("FATCA"), id="1031"),
attributeValue(str(a), id = "1025", url="Test-URL", contenttype = "application/pdf/"),
part(str(a), type="base"),
type="1007"
)
***print lxml.etree.tostring(Document, pretty_print=True)
del l[i]

ValueError: can only parse strings python

I am trying to gather a bunch of links using xpath which need to be scraped from the next page however, I keep getting the error that can only parse strings? I tried looking at the type of lk and it was a string after I casted it? What seems to be wrong?
def unicode_to_string(types):
try:
types = unicodedata.normalize("NFKD", types).encode('ascii', 'ignore')
return types
except:
return types
def getData():
req = "http://analytical360.com/access-points"
page = urllib2.urlopen(req)
tree = etree.HTML(page.read())
i = 0
for lk in tree.xpath('//a[#class="sabai-file sabai-file-image sabai-file-type-jpg "]//#href'):
print "Scraping Vendor #" + str(i)
trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk)))
for ll in trees.xpath('//table[#id="archived"]//tr//td//a//#href'):
final = etree.HTML(urllib2.urlopen(unicode_to_string(ll)))
You should pass in strings not urllib2.orlopen.
Perhaps change the code like so:
trees = etree.HTML(urllib2.urlopen(unicode_to_string(lk)).read())
for i, ll in enumerate(trees.xpath('//table[#id="archived"]//tr//td//a//#href')):
final = etree.HTML(urllib2.urlopen(unicode_to_string(ll)).read())
Also, you don't seem to increment i.

KeyError and TypeError in my python web scraper

So sorry about this vague and confusing title. But there is no really better way for me to summarize my problem in one sentence.
I was trying to get the student and grade information from a french website. The link is this (http://www.bankexam.fr/resultat/2014/BACCALAUREAT/AMIENS?filiere=BACS)
My code is as follows:
import time
import urllib2
from bs4 import BeautifulSoup
regions = {'R\xc3\xa9sultats Bac Amiens 2014':'/resultat/2014/BACCALAUREAT/AMIENS'}
base_url = 'http://www.bankexam.fr'
tests = {'es':'?filiere=BACES','s':'?filiere=BACS','l':'?filiere=BACL'}
for i in regions:
for x in tests:
# create the output file
output_file = open('/Users/student project/'+ i + '_' + x + '.txt','a')
time.sleep(2) #compassionate scraping
section_url = base_url + regions[i] + tests[x] #now goes to the x test page of region i
request = urllib2.Request(section_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
# Find the maximum pages to go through
if soup.find('div','pagination'):
import re
page_info = soup.find('div','pagination')
pages = []
for i in page_info.find_all('a',re.compile('elt')):
try:
pages.append(int(i.string.encode('utf8')))
except ValueError:
continue
max_page = max(pages)
# Now goes through page 2 to max page
for i in range(1,max_page):
page_url = '&p='+str(i)+'#anchor'
section2_url = section_url+page_url
request = urllib2.Request(section2_url)
response = urllib2.urlopen(request)
soup = BeautifulSoup(response,'html.parser')
content = soup.find('div',id='zone_res')
for row in content.find_all('tr'):
if row.td:
student = row.find_all('td')
name = student[0].strong.string.encode('utf8').strip()
try:
school = student[1].strong.string.encode('utf8')
except AttributeError:
school = 'NA'
result = student[2].span.string.encode('utf8')
output_file.write ('%s|%s|%s\n' % (name,school,result))
A little more description about the code:
I created a 'regions' dictionary and 'tests' dictionary because there are 30 other regions I need to collect and I just include one here for showcase. And I'm just interested in the test results of three tests (ES, S, L) and so I created this 'tests' dictionary.
Two errors keep showing up,
one is
KeyError: 2
and the error is linked to line 12,
section_url = base_url + regions[i] + tests[x]
The other is
TypeError: cannot concatenate 'str' and 'int' objects
and this is linked to line 10.
I know there is a lot of information here and I'm probably not listing the most important info for you to help me. But let me know how I can do to fix this!
Thanks
The issue is that you're using the variable i in more than one place.
Near the top of the file, you do:
for i in regions:
So, in some places i is expected to be a key into the regions dictionary.
The trouble comes when you use it again later. You do so in two places:
for i in page_info.find_all('a',re.compile('elt')):
And:
for i in range(1,max_page):
The second of these is what is causing your exceptions, as the integer values that get assigned to i don't appear in the regions dict (nor can an integer be added to a string).
I suggest renaming some or all of those variables. Give them meaningful names, if possible (i is perhaps acceptable for an "index" variable, but I'd avoid using it for anything else unless you're code golfing).

python code returns none type object has no attribute error sometimes and works perfectly the other time

def dcrawl(link):
#importing the req. libraries & modules
from bs4 import BeautifulSoup
import urllib
#fetching the document
op = urllib.FancyURLopener({})
f = op.open(link)
h_doc = f.read()
#trimming for the base document
idoc1 = BeautifulSoup(h_doc)
idoc2 = str(idoc1.find(id = "bwStory"))
bdoc = BeautifulSoup(idoc2)
#extract the date as a string
dat = str(bdoc.div.div.string)[0:13]
date = dst(dat)
#extract the title as a string
title = str(bdoc.b.string)
#extract the full report as a string
freport = str(bdoc.find_all("p"))
#extract the place as a string
plc = bdoc.find(id = "bwStoryBody")
puni = plc.p.string
#encoding to ascii to eliminate discrepancies
pasi = puni.encode('ascii', 'ignore')
com = pasi.find("-")
place = pasi[:com]
the same conversion "bdoc.b.string" works here:
#extract the full report as a string
freport = str(bdoc.find_all("p"))
In the line:
plc = bdoc.find(id = "bwStoryBody")
plc returns some data. and plc.p returns the first <p>....<p>, but converting it to string doesn't work.
because puni returned a string object earlier, I stumbled upon unicode errors and so had to use the encode to handle the pasi result.
.find() returns None when an object was not found. Evidently some pages do not have the elements that you are looking for.
Test for it explicitly if you want to prevent attribute errors:
plc = bdoc.find(id = "bwStoryBody")
if plc is not None:
puni = plc.p.string
#encoding to ascii to eliminate discrepancies
#By default python processes in unicode
pasi = puni.encode('ascii', 'ignore')
com = pasi.find("-")
place = pasi[:com]

Categories