Loading an ontology from string in Python - python

We have a Pyspark pair RDD which stores the path of .owl files as key and the file contents as value.
I wish to carry out reasoning using Owlready2. To load an ontology from OWL files, the get_ontology() function is used. However, the given function expects an IRI (a sort of URL) to the file, whereas I have the file contents as a str in Python.
Is there a way I could make this work out?
I have tried the following:
Used get_ontology(file_contents).load() --> this obviously does not work as the function expects a file path.
Used get_ontology(file_contents) --> no error, but the ontology does not get loaded, so reasoning does not happen.

Answering my own question.
The load() function in Owlready2 has a couple of more arguments which are not mentioned anywhere in the documentation. The function definitions of the package can be seen here.
Quoting from there, def load(self, only_local = False, fileobj = None, reload = False, reload_if_newer = False, **args) is the function signature.
We can see that a fileobj can also be passed, which is None by default. Further, the line fileobj = open(f, "rb") tells us that the file needs to be read in binary mode.
Taking all this into consideration, the following code worked for our situation:
from io import BytesIO # to create a file-like object
my_str = RDDList[1][1] # the pair RDD cell with the string data
my_str_as_bytes = str.encode(my_str) # convert to binary
fobj = BytesIO(my_str_as_bytes)
abox = get_ontology("some-random-path").load(fileobj=fobj) # the path is insignificant, notice the 'fileobj = fobj'.

Related

How to get the path or the image object from a .jpg uploaded to a server using FastAPI's UploadFile?

I'm importing an image using the following code:
files = {
'file': open(r'C:/Users/jared/Deblur Project/curl requests/test.jpg', 'rb'),
}
response = requests.post('http://localhost:5000/net/image/evaluate_local', files=files)
print(response)
This sends 'test.jpg' over to the following route:
#app.post("/net/image/evaluate_local")
async def get_net_image_evaluate_local(file: UploadFile = File(...)):
image_path = file.read()
threshold = 0.75
model_path = "model.tflite"
prediction = analyze_images(model_path, image_path, threshold)
return prediction
Obviously, image_path = file.read() is not working, but it's showcasing what I'm trying to do. I need to provide an image path to the analyze_images() function, but I'm not exactly sure how to do so.
If I cannot provide it as a path, I am also trying to provide it as raw bytes array for the model to use.
image_path = file.read()
returns
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x02(ICC_PROFILE\x00\x01\x01\x00\x00\x02\x18\x00\x00\x00\x00\x02\x10\x00\x00mntrRGB XYZ \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00acsp\x00\x00\x00\x00\x00\x00...
which I am also unsure how to work with.
Anyone have any advice on how to proceed?
Unsure how to make it work as a file path, but I did manage to get it to work. Luckily the model I'm working with takes in those image bytes ( as returned by file.read() ).
So if I had image = file.read(), I could feed that image into a function that took in image as a variable, and then use it as follows:
img = np.uint8(tf.image.resize(tf.io.decode_image(image), [width, height], method=tf.image.ResizeMethod.BILINEAR))
There is no immediate file path because when you use FastAPI's/Starlette's UploadFile,
It uses a "spooled" file:
A file stored in memory up to a maximum size limit, and after passing this limit it will be stored in disk.
The underlying implementation is actually from Python's standard tempfile module for generating temporary files and folders. See the tempfile.SpooledTemporaryFile section:
This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
If you can work instead with the raw bytes of the image (as what you did using .read() in your other answer), then I think that's the better approach, as most image processing starts with the images bytes anyway. (Make sure to use await appropriately if you are calling the async methods!)
An alternative is, if a function expects a "file-like" object, then you can pass in the UploadFile.file object itself, which is the SpooledTemporaryFile object, which is the
... actual Python file that you can pass directly to other functions or libraries that expect a "file-like" object.
If you really need a file on disk and get a path to it, you can write the contents to a NamedTemporaryFile' which
... is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object.
...
The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.
#app.post("/net/image/evaluate_local")
async def get_net_image_evaluate_local(file: UploadFile = File(...)):
file_suffix = "".join(file.filename.partition(".")[1:])
with NamedTemporaryFile(mode="w+b", suffix=file_suffix) as file_on_disk:
file_contents = await file.read()
file_on_disk.write(file_contents)
image_path = file_on_disk.name
print(image_path)
threshold = 0.75
model_path = "model.tflite"
prediction = analyze_images(model_path, image_path, threshold)
return prediction
On my macOS, image_path prints out something like
/var/folders/3h/pdjwtnlx4p13chnw21rvwbtw0000gp/T/tmp1v52fm95.png
and that file would be available as long as the file is not yet closed.

Writing a Python pdfrw PdfReader object to an array of bytes / filestream

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)

How to read a JSON file in python? [duplicate]

I have some json files with 500MB.
If I use the "trivial" json.load() to load its content all at once, it will consume a lot of memory.
Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files:
process_file(json_file)
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't
change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a
program that parses just one, and pass each one in from a shell script, or from another python
process that calls your script via subprocess.Popen. This is a little less elegant, but if
nothing else works, it will ensure that you're not holding on to stale data from one file to the
next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.
It can be done by using ijson. The working of ijson has been very well explained by Jim Pivarski in the answer above. The code below will read a file and print each json from the list. For example, file content is as below
[{"name": "rantidine", "drug": {"type": "tablet", "content_type": "solid"}},
{"name": "nicip", "drug": {"type": "capsule", "content_type": "solid"}}]
You can print every element of the array using the below method
def extract_json(filename):
with open(filename, 'rb') as input_file:
jsonobj = ijson.items(input_file, 'item')
jsons = (o for o in jsonobj)
for j in jsons:
print(j)
Note: 'item' is the default prefix given by ijson.
if you want to access only specific json's based on a condition you can do it in following way.
def extract_tabtype(filename):
with open(filename, 'rb') as input_file:
objects = ijson.items(input_file, 'item.drugs')
tabtype = (o for o in objects if o['type'] == 'tablet')
for prop in tabtype:
print(prop)
This will print only those json whose type is tablet.
On your mention of running out of memory I must question if you're actually managing memory. Are you using the "del" keyword to remove your old object before trying to read a new one? Python should never silently retain something in memory if you remove it.
Update
See the other answers for advice.
Original answer from 2010, now outdated
Short answer: no.
Properly dividing a json file would take intimate knowledge of the json object graph to get right.
However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.
For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.
You would have to do some string content parsing to get the chunking of the json file right.
I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.
Another idea is to try load it into a document-store database like MongoDB.
It deals with large blobs of JSON well. Although you might run into the same problem loading the JSON - avoid the problem by loading the files one at a time.
If path works for you, then you can interact with the JSON data via their client and potentially not have to hold the entire blob in memory
http://www.mongodb.org/
"the garbage collector should free the memory"
Correct.
Since it doesn't, something else is wrong. Generally, the problem with infinite memory growth is global variables.
Remove all global variables.
Make all module-level code into smaller functions.
in addition to #codeape
I would try writing a custom json parser to help you figure out the structure of the JSON blob you are dealing with. Print out the key names only, etc. Make a hierarchical tree and decide (yourself) how you can chunk it. This way you can do what #codeape suggests - break the file up into smaller chunks, etc
You can parse the JSON file to CSV file and you can parse it line by line:
import ijson
import csv
def convert_json(self, file_path):
did_write_headers = False
headers = []
row = []
iterable_json = ijson.parse(open(file_path, 'r'))
with open(file_path + '.csv', 'w') as csv_file:
csv_writer = csv.writer(csv_file, ',', '"', csv.QUOTE_MINIMAL)
for prefix, event, value in iterable_json:
if event == 'end_map':
if not did_write_headers:
csv_writer.writerow(headers)
did_write_headers = True
csv_writer.writerow(row)
row = []
if event == 'map_key' and not did_write_headers:
headers.append(value)
if event == 'string':
row.append(value)
So simply using json.load() will take a lot of time. Instead, you can load the json data line by line using key and value pair into a dictionary and append that dictionary to the final dictionary and convert it to pandas DataFrame which will help you in further analysis
def get_data():
with open('Your_json_file_name', 'r') as f:
for line in f:
yield line
data = get_data()
data_dict = {}
each = {}
for line in data:
each = {}
# k and v are the key and value pair
for k, v in json.loads(line).items():
#print(f'{k}: {v}')
each[f'{k}'] = f'{v}'
data_dict[i] = each
Data = pd.DataFrame(data_dict)
#Data will give you the dictionary data in dataFrame (table format) but it will
#be in transposed form , so will then finally transpose the dataframe as ->
Data_1 = Data.T

Read files with fileinput.input() from X line

here my question, it is probably not very complex but I am learning Python. I'm trying to read multiple files (all of them with the same format), at the same time a have to begin reading them from line 32, somehow I don't find the most efficient way to do so.
Here my code until now:
for file in fileinput.input():
entries = [f.strip().split("\t") for f in file].readlines()[32:]
which gives the error: AttributeError: 'list' object has no attribute 'readlines'
I know another possibility would be:
sources = open(sys.argv[1], "r").readlines()[32:]
and then just on the command line python3.2 script.py data/*.csv. But this seems not to work properly.
I am thanked for any help.
You can use openhook argument.
According to the module documentation:
You can control how files are opened by providing an opening hook via
the openhook parameter to fileinput.input() or FileInput(). The hook
must be a function that takes two arguments, filename and mode, and
returns an accordingly opened file-like object. Two useful hooks are
already provided by this module.
import fileinput
def skip32(filename, mode):
f = open(filename, mode)
for i in range(32):
f.readline()
return f
entries = [line.strip().split('\t') for line in fileinput.input(openhook=skip32)]
BTW, the last line can be replaced with (using csv module):
import csv
entries = list(csv.reader(fileinput.input(openhook=skip32), delimiter='\t'))
It's just a little syntax
entries = [f.strip().split("\t") for f in file].readlines()[32:]
should be:
entries = [f.strip().split("\t") for f in file.readlines()][32:]

Upload and parse csv file with google app engine

I'm wondering if anyone with a better understanding of python and gae can help me with this. I am uploading a csv file from a form to the gae datastore.
class CSVImport(webapp.RequestHandler):
def post(self):
csv_file = self.request.get('csv_import')
fileReader = csv.reader(csv_file)
for row in fileReader:
self.response.out.write(row)
I'm running into the same problem that someone else mentions here - http://groups.google.com/group/google-appengine/browse_thread/thread/bb2d0b1a80ca7ac2/861c8241308b9717
That is, the csv.reader is iterating over each character and not the line. A google engineer left this explanation:
The call self.request.get('csv') returns a String. When you iterate over a
string, you iterate over the characters, not the lines. You can see the
difference here:
class ProcessUpload(webapp.RequestHandler):
def post(self):
self.response.out.write(self.request.get('csv'))
file = open(os.path.join(os.path.dirname(__file__), 'sample.csv'))
self.response.out.write(file)
# Iterating over a file
fileReader = csv.reader(file)
for row in fileReader:
self.response.out.write(row)
# Iterating over a string
fileReader = csv.reader(self.request.get('csv'))
for row in fileReader:
self.response.out.write(row)
I really don't follow the explanation, and was unsuccessful implementing it. Can anyone provide a clearer explanation of this and a proposed fix?
Thanks,
August
Short answer, try this:
fileReader = csv.reader(csv_file.split("\n"))
Long answer, consider the following:
for thing in stuff:
print thing.strip().split(",")
If stuff is a file pointer, each thing is a line. If stuff is a list, each thing is an item. If stuff is a string, each thing is a character.
Iterating over the object returned by csv.reader is going to give you behavior similar to iterating over the object passed in, only with each item CSV-parsed. If you iterate over a string, you'll get a CSV-parsed version of each character.
I can't think of a clearer explanation than what the Google engineer you mentioned said. So let's break it down a bit.
The Python csv module operates on file-like objects, that is a file or something that behaves like a Python file. Hence, csv.reader() expects to get a file object as it's only required parameter.
The webapp.RequestHandler request object provides access to the HTTP parameters that are posted in the form. In HTTP, parameters are posted as key-value pairs, e.g., csv=record_one,record_two. When you invoke self.request.get('csv') this returns the value associated with the key csv as a Python string. A Python string is not a file-like object. Apparently, the csv module is falling-back when it does not understand the object and simply iterating it (in Python, strings can be iterated over by character, e.g., for c in 'Test String': print c will print each character in the string on a separate line).
Fortunately, Python provides a StringIO class that allows a string to be treated as a file-like object. So (assuming GAE supports StringIO, and there's no reason it shouldn't) you should be able to do this:
class ProcessUpload(webapp.RequestHandler):
def post(self):
self.response.out.write(self.request.get('csv'))
# Iterating over a string as a file
stringReader = csv.reader(StringIO.StringIO(self.request.get('csv')))
for row in stringReader:
self.response.out.write(row)
Which will work as you expect it to.
Edit I'm assuming that you are using something like a <textarea/> to collect the csv file. If you're uploading an attachment, different handling may be necessary (I'm not all that familiar with Python GAE or how it handles attachments).
You need to call csv_file = self.request.POST.get("csv_import") and not csv_file = self.request.get("csv_import").
The second one just gives you a string as you mentioned in your original post. But accessing via self.request.POST.get gives you a cgi.FieldStorage object.
This means that you can call csv_file.filename to get the object’s filename and csv_file.type to get the mimetype.
Furthermore, if you access csv_file.file, it’s a StringO object (a read-only object from the StringIO module), not just a string. As ig0774 mentioned in his answer, the StringIO module allows you to treat a string as a file.
Therefore, your code can simply be:
class CSVImport(webapp.RequestHandler):
def post(self):
csv_file = self.request.POST.get('csv_import')
fileReader = csv.reader(csv_file.file)
for row in fileReader:
# row is now a list containing all the column data in that row
self.response.out.write(row)

Categories