python memoryerror - large loop xml to mongodb

python memoryerror - large loop xml to mongodb - python

I downloaded a zip file from https://clinicaltrials.gov/AllPublicXML.zip, which contains over 200k xml files (most are < 10 kb in size), to a directory (see 'dirpath_zip' in the CODE) I created in ubuntu 16.04 (using DigitalOcean). What I'm trying to accomplish is loading all of these into MongoDB (also installed in the same location as the zip file).
I ran the CODE below twice and consistently failed when processing the 15988th file.
I've googled around and tried reading other posts regarding this particular error, but couldn't find a way to solve this particular issue. Actually, I'm not really sure what problem really is... any help is much appreciated!!
CODE:
import re
import json
import zipfile
import pymongo
import datetime
import xmltodict
from bs4 import BeautifulSoup
from pprint import pprint as ppt
def timestamper(stamp_type="regular"):
if stamp_type == "regular":
timestamp = str(datetime.datetime.now())
elif stamp_type == "filename":
timestamp = str(datetime.datetime.now()).replace("-", "").replace(":", "").replace(" ", "_")[:15]
else:
sys.exit("ERROR [timestamper()]: unexpected 'stamp_type' (parameter) encountered")
return timestamp
client = pymongo.MongoClient()
db = client['ctgov']
coll_name = "ts_"+timestamper(stamp_type="filename")
coll = db[coll_name]
dirpath_zip = '/glbdat/ctgov/all/alltrials_20180402.zip'
z = zipfile.ZipFile(dirpath_zip, 'r')
i = 0
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue
else:
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
i+=1
ERROR MESSAGE:
Traceback (most recent call last):
File "zip_to_mongo_alltrials.py", line 38, in <module>
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
File "/usr/local/lib/python3.5/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/local/lib/python3.5/dist-packages/bs4/builder/_lxml.py", line 118, in prepare_markup
for encoding in detector.encodings:
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 264, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 34, in chardet_dammit
return chardet.detect(s)['encoding']
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect
u.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed
st = prober.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/hebrewprober.py", line 224, in feed
aBuf = self.filter_high_bit_only(aBuf)
File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 53, in filter_high_bit_only
aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
File "/usr/lib/python3.5/re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
MemoryError

Try to push reading from file and inserting into db in another method.
Also add gc.collect() for garbage collection.
import gc;
def read_xml_insert(xmlfile):
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue;
else:
read_xml_insert(xmlfile);
i+=1
gc.collect()
`
Please see.

Related

How to check image transparency with Python for images stored as URLs?

I have csv file with urls to my images. I want to check which image has transparent logo. For some rows there are no urls provided, for some rows urls don't open (that's why I added requests exceptions).
The below code works for the first 3 rows in my csv file. However, for the fourth row which has no url and is blank it throws the following error:
NoTransparent!
Transparent
NoTransparent!
Traceback (most recent call last):
File "/path/to/transparency.py", line 13, in <module>
pic = plt.imread(x)
File "/usr/local/lib/python3.7/site-packages/matplotlib/pyplot.py", line 2230, in imread
return matplotlib.image.imread(fname, format)
File "/usr/local/lib/python3.7/site-packages/matplotlib/image.py", line 1486, in imread
with img_open(fname) as image:
File "/usr/local/lib/python3.7/site-packages/PIL/ImageFile.py", line 107, in __init__
self._open()
File "/usr/local/lib/python3.7/site-packages/PIL/PngImagePlugin.py", line 636, in _open
if self.fp.read(8) != _MAGIC:
AttributeError: 'float' object has no attribute 'read'
my code so far:
import pandas as pd
import requests
import matplotlib.pyplot as plt
df = pd.read_csv('/path/to/my/input.csv')
urls = df.T.values.tolist()[3]
msgid = df.T.values.tolist()[0]
imgType=[]
for x in urls:
if x == '':
continue
try:
pic = plt.imread(x)
if pic.shape[2] == 3:
imgType.append("Transparent")
print("Transparent")
elif pic.shape[2] == 4:
imgType.append("non-Transparent")
print("NoTransparent!")
else:
imgType.append("unrecognized")
print("unrecognized!")
except (requests.exceptions.ConnectionError, requests.exceptions.MissingSchema):
imgType.append('Down or No Img')
print('Non URL or Down')
except (requests.exceptions.Timeout):
imgType.append('timeout')
print('timeout')
df["imgType"] = imgType
df.to_csv('/path/to/output.csv')
Could someone help me solving this? It is probably related to requests exceptions. Thank you in advance!

Can't you just add a simple check?:
for x in urls:
if x == '': # or some other option to check for the blank urls
continue
try:
pic = plt.imread(x)
... etc. with your code

python beautiful-soap json - scrape one page but not the other similar ones

Im trying to scrape a nutritional website and the following code works
import requests
from bs4 import BeautifulSoup
import json
import re
page = requests.get("https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1")
soup = BeautifulSoup(page.content, 'html.parser')
scripts = soup.find_all("script")
for script in scripts:
if 'foodNutrients = ' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('foodNutrients =')[-1]
jsonStr = jsonStr.rsplit('fillSpanValues')[0]
jsonStr = jsonStr.rsplit(';',1)[0]
jsonStr = "".join(jsonStr.split())
valid_json = re.sub(r'([{,:])(\w+)([},:])', r'\1"\2"\3', jsonStr)
jsonObj = json.loads(valid_json)
# These are in terms of 100 grams. I also calculated for per serving
g_per_serv = int(jsonObj['FOODSERVING_WEIGHT_1'].split('(')[-1].split('g')[0])
for k, v in jsonObj.items():
if k == 'NUTRIENT_0':
conv_v = (float(v)*g_per_serv)/100
print ('%s : %s (per 100 grams) | %s (per serving %s' %(k, round(float(v)), round(float(conv_v)), jsonObj['FOODSERVING_WEIGHT_1'] ))
but when I try and use it on other almost identical webpages on the same domain it does not. For example if I use
page = requests.get("https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2")
I get the error
Traceback (most recent call last):
File "scrape_test_2.py", line 20, in <module>
jsonObj = json.loads(valid_json)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/benjamattesjaroen/anaconda3/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 5446 (char 5445)
looking at the source code for both pages they seem identical in the sense they both have
<script type="text/javascript">
<!--
foodNutrients = { NUTRIENT_142: ........
which is the part being scraped.
Ive been looking at this all day, does anyone know how to make this script work for both pages, what is the problem here?

I would switch to using hjson which allows unquoted keys and simply extract the entire foodNutrients variable and parse rather than manipulating strings over and over.
Your error:
Currently yours is failing due the number of elements in at least one of the source arrays being a different length and thus your regex to sanitize is inappropriate. We examine only the first known occurrence...
In first url, before you use regex to clean you have:
aifr:"[ -35, -10 ]"
after:
"aifr":"[-35,-10]"
In second you start with a different length array:
aifr:"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
after regex replace, instead of:
"aifr":"[ 163, 46, 209, 179, 199, 117, 11, 99, 7, 5, 82 ]"
you have:
"aifr":"[163,"46",209,"179",199,"117",11,"99",7,"5",82]"
i.e. invalid json. No more nicely delimited key:value pairs.
Nutshell:
Use hjson it's easier. Or update regex appropriately to handle variable length arrays.
import requests, re, hjson
urls = ['https://nutritiondata.self.com/facts/nut-and-seed-products/3071/1','https://nutritiondata.self.com/facts/vegetables-and-vegetable-products/2383/2']
p = re.compile(r'foodNutrients = (.*?);')
with requests.Session() as s:
for url in urls:
r = s.get(url)
jsonObj = hjson.loads(p.findall(r.text)[0])
serving_weight = jsonObj['FOODSERVING_WEIGHT_1']
g_per_serv = int(serving_weight.split('(')[-1].split('g')[0])
nutrient_0 = jsonObj['NUTRIENT_0']
conv_v = float(nutrient_0)*g_per_serv/100
print('%s : %s (per 100 grams) | %s (per serving %s' %(nutrient_0, round(float(nutrient_0)), round(float(conv_v)), serving_weight))

Trying to do dumpRAM using python

i'm trying to dump RAM for my virtualbox using python script but it give me an error and I just can it solve it
this my code :
import os
from struct import *
import optparse
import random
import string
machine_name = "OSC-2016"
def dump_ram():
output_file = "test.elf"
ram_file_name = "ram.bin"
dump_cmd = "vboxmanage debugvm %s dumpvmcore --filename %s" %(machine_name, output_file)
os.system(dump_cmd)
popen_cmd = "readelf --program-headers %s |grep -m1 -A1 LOAD" %(output_file)
file_info = os.popen(popen_cmd).read()
#print file_info
file_info = " ".join(file_info.split()) #remove duplicate spaces
file_info = file_info.split() #create a list by splitting on spaces
ram_start = int(file_info[1], 16)
ram_size = int(file_info[4], 16)
print "RAM SIZE is ", ram_size
ram = open(output_file, "rb").read()[ram_start:ram_start + ram_size]
ram_file_var = open(ram_file_name, "wb")
ram_file_var.write(ram)
ram_file_var.close()
def main():
dump_ram()
if __name__ == '__main__':
main()
And the error as it shown
Traceback (most recent call last):
File "C:/Users/ABC/Desktop/dumpRAM.py", line 42, in <module> main()
File "C:/Users/ABC/Desktop/dumpRAM.py", line 39, in main dump_ram()
File "C:/Users/ABC/Desktop/dumpRAM.py", line 27, in dump_ram
ram_start = int(file_info[1], 16)
IndexError: list index out of range
WHAT TO DO ?
it for my homework in operating system course and tomorrow is the final day , so I will be so helpful if you help , and i'm writing this cuz the stack overflow won't publish my question cuz it have too much code and less details and it requiring more details hhhhhhhh

Python requests stops working mid-file

I've got a csv file with URL's and I need to scrape metadata from those website. I'm using python requests for that reasons with code below:
from tempfile import NamedTemporaryFile
import shutil
import csv
from bs4 import BeautifulSoup
import requests
import re
import html5lib
import sys
#import logging
filename = 'TestWTF.csv'
#logging.basicConfig(level=logging.DEBUG)
#Get filename (with extension) from terminal
#filename = sys.argv[1]
tempfile = NamedTemporaryFile(delete=False)
read_timeout = 1.0
#Does actual scraping done, returns metaTag data
def getMetadata (url, metaTag):
r = requests.get("http://" + url, timeout=2)
data = r.text
soup = BeautifulSoup(data, 'html5lib')
metadata = soup.findAll(attrs={"name":metaTag})
return metadata
#Gets either keyword or description
def addDescription ( row ):
scrapedKeywordsData = getMetadata(row, 'keywords')
if not scrapedKeywordsData:
print row + ' NO KEYWORDS'
scrapedKeywordsData = getMetadata(row, 'description')
if not scrapedKeywordsData:
return ''
return scrapedKeywordsData[0]
def prepareString ( data ):
output = data
#Get rid of opening meta content
if output.startswith( '<meta content="' ):
output = data[15:]
#Get rid of closing meta content (keywords)
if output.endswith( '" name="keywords"/>' ):
output = output[:-19]
#Get rid of closing meta content (description)
if output.endswith( '" name="description"/>' ):
output = output[:-22]
return output
def iterator():
with open(filename, 'rb') as csvFile, tempfile:
reader = csv.reader(csvFile, delimiter=',', quotechar='"')
writer = csv.writer(tempfile, delimiter=',', quotechar='"')
i = 0
for row in reader:
try:
data = str(addDescription (row[1] ))
row[3] = prepareString( data )
except requests.exceptions.RequestException as e:
print e
except requests.exceptions.Timeout as e:
print e
except requests.exceptions.ReadTimeout as e:
print "lol"
except requests.exceptions.ConnectionError as e:
print "These aren't the domains we're looking for."
except requests.exceptions.ConnectTimeout as e:
print "Too slow Mojo!"
writer.writerow(row)
i = i + 1
print i
shutil.move(tempfile.name, filename)
def main():
iterator()
#Defining main function
if __name__ == '__main__':
main()
It works just fine but at some URL's (out of 3000 let's say maybe 2-3) it would just suddenly stop working and not progress to next one after timeout time.. So I have to kill it using Ctr+C which results in file not being saved.
I know it's a problem of catching exceptions but I cannot figure out which one or what to do with that problem.. I'm more than happy to simply ignore the one which is stuck on..
EDIT:
Added traceback:
^CTraceback (most recent call last):
File "blacklist.py", line 90, in <module>
main()
File "blacklist.py", line 85, in main
iterator()
File "blacklist.py", line 62, in iterator
data = str(addDescription (row[1] ))
File "blacklist.py", line 30, in addDescription
scrapedKeywordsData = getMetadata(row, 'keywords')
File "blacklist.py", line 25, in getMetadata
metadata = soup.findAll(attrs={"name":metaTag})
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 537, in _find_all
found = strainer.search(i)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1654, in search
found = self.search_tag(markup)
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1626, in search_tag
if not self._matches(attr_value, match_against):
File "/Library/Python/2.7/site-packages/bs4/element.py", line 1696, in _matches
if isinstance(markup, Tag):
KeyboardInterrupt
EDIT 2:
Example website for which script doesn't work: miniusa.com

No such file or directory: '<root>\n

I keep getting the followinge error whenever there is a function call to xml(productline), but if I replace the functioncall with file = open('config\\' + productLine + '.xml','r'), it seems to work, why?
def xml(productLine):
with open('config\\' + productLine + '.xml','r') as f:
return f.read()
def getsanityresults(productline):
xmlfile=xml(productline) // replace with file = open('config\\' + productLine + '.xml','r')
dom = minidom.parse(xmlfile)
data=dom.getElementsByTagName('Sanity_Results')
#print "DATA"
#print data
textnode = data[0].childNodes[0]
testresults=textnode.data
#print testresults
for line in testresults.splitlines():
#print line
line = line.strip('\r,\n')
#print line
line = re.sub(r'(http://[^\s]+|//[^\s]+|\\\\[^\s]+)', r'\1', line)
print line
#print line
resultslis.append(line)
print resultslis
return resultslis
Error:
Traceback (most recent call last):
File "C:\Dropbox\scripts\announce_build_wcn\wcnbuild_release.py", line 910, in <module>
main()
File "C:\Dropbox\scripts\announce_build_wcn\wcnbuild_release.py", line 858, in main
testresults=getsanityresults(pL)
File "C:\Dropbox\scripts\announce_build_wcn\wcnbuild_release.py", line 733, in getsanityresults
dom = minidom.parse(xmlfile)
File "C:\python2.7.3\lib\xml\dom\minidom.py", line 1920, in parse
return expatbuilder.parse(file)
File "C:\python2.7.3\lib\xml\dom\expatbuilder.py", line 922, in parse
fp = open(file, 'rb')
IOError: [Errno 2] No such file or directory: '<root>\n <PL name = "MSM8930.LA.2.0-PMIC-8917">\n

minidom.parse() expects either filename or file-object as a parameter but you are passing the content of the file, try this:
import os
from xml.dom import minidom
doc = minidom.parse(os.path.join('config', productline + '.xml'))
Unless you have specific requirements that favors minidom; use xml.etree.cElementTree to work with xml in Python. It is more pythonic and lxml that you might need in more complex cases supports its API so you don't need to learn twice.

I replace the functioncall with file = open('config\\' + productLine + '.xml','r'), it seems to work, why?
You've got two variables, with subtly different names:
xmlfile=xml(productline) // replace with file = open('config\\' + productLine + '.xml','r')
There's productline (lowercase l) and productLine (uppercase L).
If you use the same variable in both cases, you'll likely see more consistent results.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python memoryerror - large loop xml to mongodb - python

Related

How to check image transparency with Python for images stored as URLs?

python beautiful-soap json - scrape one page but not the other similar ones

Trying to do dumpRAM using python

Python requests stops working mid-file

No such file or directory: '<root>\n

Categories

Resources