Reading from a file using pickle and for loop in python - python

I have a file in which I have dumped a huge number of lists.Now I want to load this file into memory and use the data inside it.I tried to load my file using the "load" method of "pickle", However, for some reason it just gives me the first item in the file. actually I noticed that it only load the my first list into memory and If I want to load my whole file(a number of lists) then I have to iterate over my file and use "pickle.load(filename)" in each of the iterations i take.
The problem is that I don't know how to actually implement it with a loop(for or while), because I don't know when I reach the end of my file.
an example would help me a lot.
thanks

How about this:
lists = []
infile = open('yourfilename.pickle', 'r')
while 1:
try:
lists.append(pickle.load(infile))
except (EOFError, UnpicklingError):
break
infile.close()

Related

Issues reading in large .gz files

I am reading in a large zipped json file ~4GB. I want to read in the first n lines.
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
line_n = f.readlines(1)
print(ast.literal_eval(line_n[0])['events']) # a dictionary object
This works fine when I want to read a single line. If now try and read in a loop e.g.
no_of_lines = 1
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in range(no_of_lines):
line_n = f.readlines(line)
print(ast.literal_eval(line_n[0])['events'])
My code takes forever to execute, even if that loop is of length 1. I'm assuming this behaviour has something to do with how gzip read files, perhaps when I loop it tries to obtain information about the file length which causes the long execution time? Can anyone shed some light on this and potentially provide an alternative way of doing this?
An edited first line of my data:
['{"events": {"category": "EVENT", "mac_address": "123456", "co_site": "HSTH"}}\n']
You are using the readlines() method, which reads all lines from a file simultaneously. This can cause performance issues when reading huge files, as Python needs to load all the lines into memory at once.
An alternative is to use the iter() method to iterate over the lines of the file, without having to load all the lines into memory at once:
with gzip.open('/path/to/my/data/data.json.gz','rt') as f:
for line in f:
print(ast.literal_eval(line)['events'])

Only the last saved dataset exists when using h5py

I am trying to save several datasets into a hdf5 file by h5py module, but it seems only the last one is saved. I think that because when a break statement was added, the first dataset is saved instead.
The code in problem is below. How can I fix it?
set_num = 0
for cur in data["init"]:
'''
got result as a list
'''
ipt = h5py.File(output_file, "w")
s = str(set_num)
ipt[s] = result
'''
create an attribute for ipt[s]
'''
set_num += 1
ipt.close()
#break
I apologize if there's any silly mistake.
You are opening, closing the file on each pass of the for: loop, and you have the attribute set to "w", meaning that it is overwriting the existing file during each pass.
My recommendation is to instead open the file using the with clause, and nest the for: loop inside of that, which would make the intent clearer and obviate the need to explicitly close the file. This example might help (though it is not tested, so may need modifications):
with h5py.File(output_file, "w") as ipt:
for set_num, curr in enumerate(data["init"]):
s = str(set_num)
ipt[s] = result
You only get the last dataset because you are opening the file in write mode ('w'), inside your loop. Simple solution is use append mode ( a'). Better, move the file open outside the loop and use the with...as: context manager.

Process is coming out of for loop for IJson Iteams

I am using python 3.9 and trying :
with open(file, 'r') as fl:
val = ijson.items(fl, '<my_key>.item', use_float=True)
for i in val:
print(i)
After some time print statement is not printing anything on jupyter console, but that jupyte cell still run for a very long time.
Is it like, even if I parse specific elements, ijson scan complete file from start to end?, if YES, how can i restrict this behaviour(if it is possible).
Note: Instead of content->print am writing content->into some file, I can see file contents are not changing after some time, but process keeps running.
I have tried all sorts of closing file operations etc. Nothing work so far.
Thanks in advance.
The only way to break out of the ijson iteration is to break it yourself (i.e., actually break from the for loop). This is because, as you suggest, ijson goes through reads input files fully. This in turn is because your path (<my_key>.item) could appear again in your file after the initial set of results you are seeing (keys are not required to be unique in JSON).

Python pointers

I was asked to write a program to find string "error" from a file and print matched lines in python.
Will first open a file with read more
i use fh.readlines and store it in a variable
After this, will use for loop and iterate line by line. check for the string "error".print those lines if found.
I was asked to use pointers in python since assigning file content to a variable consumes time when logfile contains huge output.
I did research on python pointers. But not found anything useful.
Could anyone help me out writing the above code using pointers instead of storing the whole content in a variable.
There are no pointers in python, although something like pointer can be implemented, but is not worth the efforts for your case.
As pointed out in the solution of this link,
Read large text files in Python, line by line without loading it in to memory .
You can use something like:
with open("log.txt") as infile:
for line in infile:
if "error" in line:
print(line.strip()) .
The context managers will close the file automatically and it only reads one line at a time. When the next line is read, the previous one will be garbage collected unless you have stored a reference to it somewhere else.
You can use a dictionary by using key-pair value. Just dump the log file into dictionary wherein the key would be words and value would be the line number. So if you search for string "error" you will get the line numbers they are present it and accordingly you can print them. Since searching in dictionary or hashtable is in constant time O(1) it will take less time. But yes storing might take time depends if you avoid collision.
I used below code instead of putting the data in a variable and then for loop.
for line in open('c182573.log','r').readlines():
if ('Executing' in line):
print line
So there is no way that we can implement pointers or reference in python.
Thanks all
There are no pointers in python.
But something like pointer can be implemented, but for your case it's not required.
Try Below Code
with open('test.txt') as f:
content = f.readlines()
for i in content:
if "error" in i:
print(i.strip())
Even if you want to understand Python variables as pointers go to this link
http://scottlobdell.me/2013/08/understanding-python-variables-as-pointers/

How to load a big text file efficiently in python

I have a text file containing 7000 lines of strings. I got to search for a specific string based upon few params.
Some are saying that the below code wouldn't be efficient (speed and memory usage).
f = open("file.txt")
data = f.read().split() # strings as list
First of all, if don't even make it as a list, how would I even start searching at all?
Is it efficient to load the entire file? If not, how to do it?
To filter anything, we need to search for that we need to read it right!
A bit confused
iterate over each line of the file, without storing it. This will make for program memory Efficient.
with open(filname) as f:
for line in f:
if "search_term" in line:
break

Categories