How to load a .json file with python nltk

How to load a .json file with python nltk - python

I'm trying to load a .json file from an output of an application so I can feed it into different machine learning algorithms so I can classify the text, problem is I can't seem to figure out why NLTK is not loading my .json file, even if I try it with their own .json file, it doesn't seem to work. From what I gather based on the book, I should only need to import 'nltk' and I can use the function 'load' from 'nltk.data'. Can somebody help me realise what I am doing wrong?
Below is the code I used to try loading my the file from nltk.
import nltk
nltk.data.load('corpora/twitter_samples/negative_tweets.json')
After trying that out I got an error from it.
C:\Python34\python.exe "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py"
Traceback (most recent call last):
File "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py", line 7, in <module>
nltk.data.load('corpora/twitter_samples/negative_tweets.json')
File "C:\Python34\lib\site-packages\nltk\data.py", line 810, in load
resource_val = json.load(opened_resource)
File "C:\Python34\lib\json\__init__.py", line 268, in load
parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
File "C:\Python34\lib\json\__init__.py", line 312, in loads
s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'
Process finished with exit code 1
EDIT #1 : I'm using Python 3.4.1 and NLTK 3.
EDIT #2 : Below is another try I did but now using json.load()
import json
json.load('corpora/twitter_samples/negative_tweets.json')
But I encountered a similar error
C:\Python34\python.exe "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py"
Traceback (most recent call last):
File "C:/Users/JarvinLi/PycharmProjects/ThesisTrial1/Trial Loading.py", line 5, in <module>
json.load('corpora/twitter_samples/quotefileNeg.json')
File "C:\Python34\lib\json\__init__.py", line 265, in load
return loads(fp.read(),
AttributeError: 'str' object has no attribute 'read'
Process finished with exit code 1

If you want to access a new corpus with a specific format, you can extend the NLTK CorpusReader class as follow
from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.util import StreamBackedCorpusView, concat, ZipFilePathPointer
class StoryCorpusReader(CorpusReader):
corpus_view = StreamBackedCorpusView
def __init__(self, word_tokenizer=StoryTokenizer(), encoding="utf8"):
CorpusReader.__init__(
self, <folder_path>, <file_name>, encoding
)
for path in self.abspaths(self._fileids):
if isinstance(path, ZipFilePathPointer):
pass
elif os.path.getsize(path) == 0:
raise ValueError(f"File {path} is empty")
self._word_tokenizer = word_tokenizer
def docs(self, fileids=None):
return concat(
[
self.corpus_view(path, self._read_stories, encoding=enc)
for (path, enc, fileid) in self.abspaths(fileids, True, True)
]
)
def titles(self):
titles = self.docs()
standards_list = []
for jsono in titles:
text = jsono["title"]
if isinstance(text, bytes):
text = text.decode(self.encoding)
standards_list.append(text)
return standards_list
def _read_stories(self, stream):
stories = []
for i in range(10):
line = stream.readline()
if not line:
return stories
story = json.loads(line)
stories.append(story)
return stories
with a specific Tokenizer
from nltk.tokenize.api import TokenizerI
from nltk.tokenize.casual import _replace_html_entities
import typing
import re
REGEXPS = (
# HTML tags:
r"""<[^<>]+>""",
# email addresses
r"""[\w.+-]+#[\w-]+\.(?:[\w-]\.?)+[\w-]""")
class StoryTokenizer(TokenizerI):
_WORD_RE = None
def tokenize(self, text: str) -> typing.List[str]:
# Fix HTML character entities:
safe_text = _replace_html_entities(text)
# Tokenize
words = self.WORD_RE.findall(safe_text)
# Remove punctuation
words = [
word
for word in words
if re.match(f"[{re.escape(string.punctuation)}——–’‘“”×]", word.casefold())
== None
]
return words
#property
def WORD_RE(self) -> "re.Pattern":
# Compiles the regex for this and all future instantiations of TweetTokenizer.
if not type(self)._WORD_RE:
type(self)._WORD_RE = re.compile(
f"({'|'.join(REGEXPS)})",
re.VERBOSE | re.I | re.UNICODE,
)
return type(self)._WORD_RE

Related

Python: create jsonpickle from a class and unpack, Error AttributeError: type object ' ' has no attribute 'decode'

So, I have class that I use in a Flask app. I use this class in multiple pages, which is why I would like to save the creates class object in a pickle, and unpack it when I need it again. It just keeps on giving me errors.. I have a class that looks similar to this:
class files(name):
def __init__(self, name):
self.name = name
self.settings = Settings()
self.files_directory = self.settings.files_directory
self.files = self.create_list()
def store_files_from_folder(self):
loaded_files = []
files = list_files()
for file in files:
file_path = os.path.join(self.files_directory, file)
print('Loading file: {}'.format(file))
loaded_file = function_reads_in_files_from_folder(file_path, self.name)
loaded_files.append(loaded_file)
print('Loaded {} files'.format(len(loaded_files)))
and I'm trying to create the jsonpickle like this:
creates_class = files("Mario")
jsonpickle_test = jsonpickle.encode(creates_class, unpicklable=False)
result = jsonpickle.decode(jsonpickle_test, files)
But I get the following error:
Traceback (most recent call last):
File "C:\Users\lib\site-packages\IPython\core\interactiveshell.py", line 3343, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-8-23e9b5d176ac>", line 1, in <module>
result = jsonpickle.decode(jsonpickle_test, files)
File "C:\Users\lib\site-packages\jsonpickle\unpickler.py", line 41, in decode
data = backend.decode(string)
AttributeError: type object 'files' has no attribute 'decode'
And I can't get to resolve it. Could someone help me?

The problem is in the passed argument unpickable=False
unpicklable – If set to False then the output will not contain the information necessary to turn the JSON data back into Python objects, but a simpler JSON stream is produced.
You can avoid unpickable=False or load the produced data with json.loads to a dict and then use de kwargs arguments for the object creation
creates_class = files("Mario")
jsonpickle_test = jsonpickle.encode(creates_class, unpicklable=False)
result_dict = json.loads(jsonpickle_test)
create_class = files(**result_dict)

I get an JSON decode error when using Python and Google Vision to detect text on PDF file

I am trying to work with Google Vision and Python. I am using the sample files but I keep getting the same error message:
Traceback (most recent call last):
File "C:\Program Files (x86)\Python37-32\lib\site-packages\google\protobuf\jso
n_format.py", line 416, in Parse
js = json.loads(text, object_pairs_hook=_DuplicateChecker)
File "C:\Program Files (x86)\Python37-32\lib\json\__init__.py", line 361, in l
oads
return cls(**kw).decode(s)
File "C:\Program Files (x86)\Python37-32\lib\json\decoder.py", line 338, in de
code
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Program Files (x86)\Python37-32\lib\json\decoder.py", line 356, in ra
w_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "sample.py", line 72, in <module>
async_detect_document('gs://matr/file_1035.pdf','gs://matr/output/')
File "sample.py", line 59, in async_detect_document
json_string, vision.types.AnnotateFileResponse())
File "C:\Program Files (x86)\Python37-32\lib\site-packages\google\protobuf\jso
n_format.py", line 418, in Parse
raise ParseError('Failed to load JSON: {0}.'.format(str(e)))
google.protobuf.json_format.ParseError: Failed to load JSON: Expecting value: li
ne 1 column 1 (char 0).
I am guessing it has something to do with the resulting JSON file. It does produce a JSON file but i guess it should print it out to the command line. Here are the first few lines of the JSON file:
{
"inputConfig": {
"gcsSource": {
"uri": "gs://python-docs-samples-tests/HodgeConj.pdf"
},
"mimeType": "application/pdf"
},
I resulting file does load into a JSON object by using
data = json.load(jsonfile)
I have tried print (json_string) but I only get b'placeholder'
How can I get this to work? I am using Python 3.7.2
My code is below:
def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
from google.cloud import vision
from google.cloud import storage
from google.protobuf import json_format
import re
# Supported mime_types are: 'application/pdf' and 'image/tiff'
mime_type = 'application/pdf'
# How many pages should be grouped into each json output file.
batch_size = 2
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
input_config = vision.types.InputConfig(
gcs_source=gcs_source, mime_type=mime_type)
gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
output_config = vision.types.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size)
async_request = vision.types.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config,
output_config=output_config)
operation = client.async_batch_annotate_files(
requests=[async_request])
print('Waiting for the operation to finish.')
operation.result(timeout=180)
# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name=bucket_name)
# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
print(blob.name)
# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]
json_string = output.download_as_string()
response = json_format.Parse(
json_string, vision.types.AnnotateFileResponse())
# The actual response for the first page of the input file.
first_page_response = response.responses[0]
annotation = first_page_response.full_text_annotation
# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print(u'Full text:\n{}'.format(
annotation.text))
async_detect_document('gs://my_bucket/file_1035.pdf','gs://my_bucket/output/')

I received an answer from a user on a github page.
https://github.com/GoogleCloudPlatform/python-docs-samples/issues/2086#issuecomment-487635159
I had this issue and determined it was caused by the prefix being iterated as part of the bloblist. I can see that "output/" is listed as a file in your output, and subsequently has parsing attempted on it causing the error.
Try hardcoding a prefix something like prefix = 'output/out' and that folder won't be included in the list.
The demo code should probably be modified to handle this simple case a little better.
import re
def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
from google.cloud import vision
from google.cloud import storage
from google.protobuf import json_format
# Supported mime_types are: 'application/pdf' and 'image/tiff'
mime_type = 'application/pdf'
# How many pages should be grouped into each json output file.
batch_size = 2
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
input_config = vision.types.InputConfig(
gcs_source=gcs_source, mime_type=mime_type)
gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
output_config = vision.types.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size)
async_request = vision.types.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config,
output_config=output_config)
operation = client.async_batch_annotate_files(
requests=[async_request])
print('Waiting for the operation to finish.')
operation.result(timeout=180)
# Once the request has completed and the output has been
# written to GCS, we can list all the output files.
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name=bucket_name)
print ('prefix: ' + prefix)
prefix = 'output/out'
print ('prefix new: ' + prefix)
# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
print(blob.name)
# Process the first output file from GCS.
# Since we specified batch_size=2, the first response contains
# the first two pages of the input file.
output = blob_list[0]
json_string = output.download_as_string()
response = json_format.Parse(
json_string, vision.types.AnnotateFileResponse())
# The actual response for the first page of the input file.
first_page_response = response.responses[0]
annotation = first_page_response.full_text_annotation
# Here we print the full text from the first page.
# The response contains more information:
# annotation/pages/blocks/paragraphs/words/symbols
# including confidence scores and bounding boxes
print(u'Full text:\n{}'.format(
annotation.text))
async_detect_document('gs://my_bucket/my_file.pdf','gs://my_bucket/output/out')

python memoryerror - large loop xml to mongodb

I downloaded a zip file from https://clinicaltrials.gov/AllPublicXML.zip, which contains over 200k xml files (most are < 10 kb in size), to a directory (see 'dirpath_zip' in the CODE) I created in ubuntu 16.04 (using DigitalOcean). What I'm trying to accomplish is loading all of these into MongoDB (also installed in the same location as the zip file).
I ran the CODE below twice and consistently failed when processing the 15988th file.
I've googled around and tried reading other posts regarding this particular error, but couldn't find a way to solve this particular issue. Actually, I'm not really sure what problem really is... any help is much appreciated!!
CODE:
import re
import json
import zipfile
import pymongo
import datetime
import xmltodict
from bs4 import BeautifulSoup
from pprint import pprint as ppt
def timestamper(stamp_type="regular"):
if stamp_type == "regular":
timestamp = str(datetime.datetime.now())
elif stamp_type == "filename":
timestamp = str(datetime.datetime.now()).replace("-", "").replace(":", "").replace(" ", "_")[:15]
else:
sys.exit("ERROR [timestamper()]: unexpected 'stamp_type' (parameter) encountered")
return timestamp
client = pymongo.MongoClient()
db = client['ctgov']
coll_name = "ts_"+timestamper(stamp_type="filename")
coll = db[coll_name]
dirpath_zip = '/glbdat/ctgov/all/alltrials_20180402.zip'
z = zipfile.ZipFile(dirpath_zip, 'r')
i = 0
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue
else:
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
i+=1
ERROR MESSAGE:
Traceback (most recent call last):
File "zip_to_mongo_alltrials.py", line 38, in <module>
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
File "/usr/local/lib/python3.5/dist-packages/bs4/__init__.py", line 225, in __init__
markup, from_encoding, exclude_encodings=exclude_encodings)):
File "/usr/local/lib/python3.5/dist-packages/bs4/builder/_lxml.py", line 118, in prepare_markup
for encoding in detector.encodings:
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 264, in encodings
self.chardet_encoding = chardet_dammit(self.markup)
File "/usr/local/lib/python3.5/dist-packages/bs4/dammit.py", line 34, in chardet_dammit
return chardet.detect(s)['encoding']
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 30, in detect
u.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/universaldetector.py", line 128, in feed
if prober.feed(aBuf) == constants.eFoundIt:
File "/usr/lib/python3/dist-packages/chardet/charsetgroupprober.py", line 64, in feed
st = prober.feed(aBuf)
File "/usr/lib/python3/dist-packages/chardet/hebrewprober.py", line 224, in feed
aBuf = self.filter_high_bit_only(aBuf)
File "/usr/lib/python3/dist-packages/chardet/charsetprober.py", line 53, in filter_high_bit_only
aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)
File "/usr/lib/python3.5/re.py", line 182, in sub
return _compile(pattern, flags).sub(repl, string, count)
MemoryError

Try to push reading from file and inserting into db in another method.
Also add gc.collect() for garbage collection.
import gc;
def read_xml_insert(xmlfile):
soup = BeautifulSoup(z.read(xmlfile), 'lxml')
json_study = json.loads(re.sub('\s', ' ', json.dumps(xmltodict.parse(str(soup.find('clinical_study'))))).strip())
coll.insert_one(json_study)
for xmlfile in z.namelist():
print(i, 'parsing:', xmlfile)
if xmlfile == 'Contents.txt':
print(xmlfile, '==> entering "continue"')
continue;
else:
read_xml_insert(xmlfile);
i+=1
gc.collect()
`
Please see.

Unable to serialize JSON serialize dictionary to file in Python

Apologize pretty new to python and I'm not 100% sure why this is failing since all example code I see is really similar.
import io
import json
import argparse
from object_detection.helpers import average_bbox
ap = argparse.ArgumentParser()
ap.add_argument("-o","--output",required=True,help="Output file name.")
ap.add_argument("-c","--class",help="Object class name")
ap.add_argument("-a","--annotations",required=True,help="File path annotations are located in")
args = vars(ap.parse_args())
(avgW,avgH) = average_bbox(args["annotations"])
if args["class"] is None:
name = args["annotations"].split("/")[-1]
else:
name = args["class"]
with io.open(args["output"],'w') as f:
o = {}
o["class"] = name
o["avgWidth"] = avgW
o["avgHeight"] = avgH
f.write(json.dumps(o,f))
name, avgW and avgH are all valid values. avgW and avgH are numbers and name is a string. The output seems like a valid path to create a file.
Error I get is
Traceback (most recent call last):
File "compute_average_bbox.py", line 19, in <module>
with io.open(argparse["output"],'w') as f:
TypeError: 'module' object has no attribute '__getitem__'
Any help would be appreciated.

Parsing large XML file using 'xmltodict' module results in OverflowError

I have a fairly large XML File of about 3GB size that I am wanting to parse in streaming mode using 'xmltodict' utility. The code I have iterates through each item and forms a dictionary item and appends to the dictionary in memory, eventually to be dumped as json in a file.
I have the following working perfectly on a small xml data set:
import xmltodict, json
import io
output = []
def handle(path, item):
#do stuff
return
doc_file = open("affiliate_partner_feeds.xml","r")
doc = doc_file.read()
xmltodict.parse(doc, item_depth=2, item_callback=handle)
f = open('jbtest.json', 'w')
json.dump(output,f)
On a large file, I get the following:
Traceback (most recent call last):
File "jbparser.py", line 125, in <module>
**xmltodict.parse(doc, item_depth=2, item_callback=handle)**
File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse
parser.Parse(xml_input, True)
OverflowError: size does not fit in an int
The exact location of exception inside xmltodict.py is:
def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
namespace_separator=':', **kwargs):
handler = _DictSAXHandler(namespace_separator=namespace_separator,
**kwargs)
if isinstance(xml_input, _unicode):
if not encoding:
encoding = 'utf-8'
xml_input = xml_input.encode(encoding)
if not process_namespaces:
namespace_separator = None
parser = expat.ParserCreate(
encoding,
namespace_separator
)
try:
parser.ordered_attributes = True
except AttributeError:
# Jython's expat does not support ordered_attributes
pass
parser.StartElementHandler = handler.startElement
parser.EndElementHandler = handler.endElement
parser.CharacterDataHandler = handler.characters
parser.buffer_text = True
try:
parser.ParseFile(xml_input)
except (TypeError, AttributeError):
**parser.Parse(xml_input, True)**
return handler.item
Any way to get around this? AFAIK, the xmlparser object is not exposed for me to play around and change 'int' to 'long'. More importantly, what is really going on here?
Would really appreciate any leads on this. Thanks!

Try to use marshal.load(file) or marshal.load(sys.stdin) in order to unserialize the file (or to use it as a stream) instead of reading the whole file into memory and then parse it as a whole.
Here is an example:
>>> def handle_artist(_, artist):
... print artist['name']
... return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
... item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...
STDIN:
import sys, marshal
while True:
_, article = marshal.load(sys.stdin)
print article['title']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to load a .json file with python nltk - python

Related

Python: create jsonpickle from a class and unpack, Error AttributeError: type object ' ' has no attribute 'decode'

I get an JSON decode error when using Python and Google Vision to detect text on PDF file

python memoryerror - large loop xml to mongodb

Unable to serialize JSON serialize dictionary to file in Python

Parsing large XML file using 'xmltodict' module results in OverflowError

Categories

Resources