User may give a bunch of urls as command line args. All URLs given in the past are serialized with pickle. The script checks all given URLs, if they are unique then they are serialized and appended to a file. At least that's what should be happening. Nothing is being appended. However when I open the file in write mode,the new, unique URL is written. So what gives? Code is:
def get_new_urls():
if(len(urls.URLs) != 0): # check if empty
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
toDump = []
for arg in urls.URLs:
if (arg in cereal):
print("Duplicate URL {0} given, ignoring it.".format(arg))
else:
toDump.append(arg)
except Exception as e:
print("Holy bleep something went wrong: {0}".format(e))
return(toDump)
urlsToDump = get_new_urls()
print(urlsToDump)
# TODO: append new URLs
if(urlsToDump):
with open(urlFile, 'ab') as f:
pickle.dump(urlsToDump, f)
# TODO check HTML of each page against the serialized copy
with open(urlFile, 'rb') as f:
try:
cereal = pickle.load(f)
print(cereal)
except EOFError: # your URL file is empty, bruh
pass
Pickle writes out the data you give it in a special format, e.g. it will write some header/metadata/etc, to the file you give it.
It is not intended to work this way; concatenating two pickle files doesn't really make sense. To achieve a concatenation of your data, you'd need to first read whatever is in the file into your urlsToDump, then update your urlsToDump with any new data, and then finally dump it out again (overwriting the whole file, not appending).
After
with open(urlFile, 'rb') as f:
you need a while loop, to repeatedly unpickle (repeatedly read) from the file until hitting EOF.
Related
I created some data and stored it several times like this:
with open('filename', 'a') as f:
pickle.dump(data, f)
Every time the size of file increased, but when I open file
with open('filename', 'rb') as f:
x = pickle.load(f)
I can see only data from the last time.
How can I correctly read file?
Pickle serializes a single object at a time, and reads back a single object -
the pickled data is recorded in sequence on the file.
If you simply do pickle.load you should be reading the first object serialized into the file (not the last one as you've written).
After unserializing the first object, the file-pointer is at the beggining
of the next object - if you simply call pickle.load again, it will read that next object - do that until the end of the file.
objects = []
with (open("myfile", "rb")) as openfile:
while True:
try:
objects.append(pickle.load(openfile))
except EOFError:
break
There is a read_pickle function as part of pandas 0.22+
import pandas as pd
obj = pd.read_pickle(r'filepath')
The following is an example of how you might write and read a pickle file. Note that if you keep appending pickle data to the file, you will need to continue reading from the file until you find what you want or an exception is generated by reaching the end of the file. That is what the last function does.
import os
import pickle
PICKLE_FILE = 'pickle.dat'
def main():
# append data to the pickle file
add_to_pickle(PICKLE_FILE, 123)
add_to_pickle(PICKLE_FILE, 'Hello')
add_to_pickle(PICKLE_FILE, None)
add_to_pickle(PICKLE_FILE, b'World')
add_to_pickle(PICKLE_FILE, 456.789)
# load & show all stored objects
for item in read_from_pickle(PICKLE_FILE):
print(repr(item))
os.remove(PICKLE_FILE)
def add_to_pickle(path, item):
with open(path, 'ab') as file:
pickle.dump(item, file, pickle.HIGHEST_PROTOCOL)
def read_from_pickle(path):
with open(path, 'rb') as file:
try:
while True:
yield pickle.load(file)
except EOFError:
pass
if __name__ == '__main__':
main()
I developed a software tool that opens (most) Pickle files directly in your browser (nothing is transferred so it's 100% private):
https://pickleviewer.com/ (formerly)
Now it's hosted here: https://fire-6dcaa-273213.web.app/
Edit: Available here if you want to host it somewhere: https://github.com/ch-hristov/Pickle-viewer
Feel free to host this somewhere.
What I need to do is to write some messages on a .txt file, close it and send it to a server. This happens in a infinite loop, so the code should look more or less like this:
from requests_toolbelt.multipart.encoder import MultipartEncoder
num = 0
while True:
num += 1
filename = f"example{num}.txt"
with open(filename, "w") as f:
f.write("Hello")
f.close()
mp_encoder = MultipartEncoder(
fields={
'file': ("file", open(filename, 'rb'), 'text/plain')
}
)
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
time.sleep(10)
The post works if the file is created manually inside my working directory, but if I try to create it and write on it through code, I receive this response message:
500 - Internal Server Error
System.IO.IOException: Unexpected end of Stream, the content may have already been read by another component.
I don't see the file appearing in the project window of PyCharm...I even used time.sleep(10) because at first, I thought it could be a time-related problem, but I didn't solve the problem. In fact, the file appears in my working directory only when I stop the code, so it seems the file is held by the program even after I explicitly called f.close(): I know the with function should take care of closing files, but it didn't look like that so I tried to add a close() to understand if that was the problem (spoiler: it was not)
I solved the problem by using another file
with open(filename, "r") as firstfile, open("new.txt", "a+") as secondfile:
secondfile.write(firstfile.read())
with open(filename, 'w'):
pass
r = requests.post("my_url/save_file", data=mp_encoder, headers=my_headers)
if r.status_code == requests.codes.ok:
os.remove("new.txt")
else:
print("File not saved")
I make a copy of the file, empty the original file to save space and send the copy to the server (and then delete the copy). Looks like the problem was that the original file was held open by the Python logging module
Firstly, can you change open(f, 'rb') to open("example.txt", 'rb'). In open, you should be passing file name not a closed file pointer.
Also, you can use os.path.abspath to show the location to know where file is written.
import os
os.path.abspath('.')
Third point, when you are using with context manager to open a file, you don't close the file. The context manger supposed to do it.
with open("example.txt", "w") as f:
f.write("Hello")
I apologize for the vague definition of my problem in the title, but I really can't figure out what sort of problem I'm dealing with. So, here it goes.
I have python file:
edit-json.py
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})'
This python script basically creates a file rooms.json(if it doesn't exist), grabs the data(array) from the json file, empties the json file, then finally writes the new data into the file. All this is done in the function add_rooms(), which is then called at the end of the script, pretty simple stuff.
So, here's the problem, I run the file once, nothing weird happens, i.e the file is created and the data inside it is:
{"rooms": [{"name": "friends"}]}
But the weird stuff happens when the run the script again.
What I should see:
{"rooms": [{"name": "friends"}, {"name": "friends"}]}
What I see instead:
I apologize I had to post the image because for some reason I couldn't copy the text I got.
and I can't obviously run the script again(for the third time) because the json parser gives error due to those characters
I obtained this result in an online compiler. In my local windows system, I get extra whitespace instead of those extra symbols.
I can't figure out what causes it. Maybe I'm not doing file handling incorrectly? or is it due to the json module? or am I the only one getting this result?
When you truncate the file, the file pointer is still at the end of the file. Use f.seek(0) to move back to the start of the file:
import os, json
def add_rooms(data):
if(not os.path.exists('rooms.json')):
with open('rooms.json', 'w'): pass
with open('rooms.json', 'r+') as f:
d = f.read() # take existing data from file
f.truncate(0) # empty the json file
f.seek(0) # <<<<<<<<< add this line
if(d == ''): rooms = [] # check if data is empty i.e the file was just created
else: rooms = json.loads(d)['rooms']
rooms.append({'name': data['roomname'], 'active': 1})
f.write(json.dumps({"rooms": rooms})) # write new data(rooms list) to the json file
add_rooms({'roomname': 'friends'})
I have a group of .jsonl.gz files.
I can read them using the script:
import json
import gzip
with gzip.open(filepath, "r") as read_file: # file path ends with .jsonl.gz
try:
# read gzip file which contains a list of json files (json lines)
# each json file is a dictionary of nested dictionaries
json_list = list(read_file)
except:
print("fail to read thezip ")
Then I do some processing and get some .json files and store them in a list.
for num, json_file in enumerate(json_list):
try:
j_file = json.loads(json_file)
(...some code...)
except:
print("fail")
My question is what is the right way to write them again into .jsonl.gz again?
This is my attempt
jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
for dict in list_of_nested_dictionaries:
content.append(json.dumps(dict).encode('utf-8'))
f.write(content)
But I got this error:
TypeError: memoryview: a bytes-like object is required, not 'list'
Then I tried just to gzip the list of dictionaries as is:
jsonfilename = 'valid_' +str(num)+'.jsonl.gz'
with gzip.open(jsonfilename, 'wb') as f:
f.write(json.dumps(list_of_nested_dictionaries).encode('utf-8'))
But the problem here that it gzips the whole list as one block, and when I read it back I got one element which is the whole stored list but not a list of json files as I got from the first step.
this is the code that i use for reading
with gzip.open('valid_3.jsonl.gz', "r" , ) as read_file:
try:
json_list = list(read_file) # read zip file
print(len(json_list))# I got 1 here
except:
print("fail")
json_list[0].decode('utf-8')
f.write(content) takes a byte-string, but you're passing it a list of byte-strings.
f.writelines(content) will iterate over and write each byte-string from the list.
Edit: by the way, gzip is meant for compressing a single file. If you need to compress multiple files into one, I suggest to pack them together in a tarball first and then gzip that.
the solution is simply like this
with gzip.open(jsonfilename, 'wb') as f:
for dict in list_of_nested_dictionaries:
content.append((json.dumps(dict)+'\n').encode('utf-8'))
f.writelines(content)
Is there a method of creating a text file without opening a text file in "w" or "a" mode? For instance If I wanted to open a file in "r" mode but the file does not exist then when I catch IOError I want a new file to be created
e.g.:
while flag == True:
try:
# opening src in a+ mode will allow me to read and append to file
with open("Class {0} data.txt".format(classNo),"r") as src:
# list containing all data from file, one line is one item in list
data = src.readlines()
for ind,line in enumerate(data):
if surname.lower() and firstName.lower() in line.lower():
# overwrite the relevant item in data with the updated score
data[ind] = "{0} {1}\n".format(line.rstrip(),score)
rewrite = True
else:
with open("Class {0} data.txt".format(classNo),"a") as src:
src.write("{0},{1} : {2}{3} ".format(surname, firstName, score,"\n"))
if rewrite == True:
# reopen src in write mode and overwrite all the records with the items in data
with open("Class {} data.txt".format(classNo),"w") as src:
src.writelines(data)
flag = False
except IOError:
print("New data file created")
# Here I want a new file to be created and assigned to the variable src so when the
# while loop iterates for the second time the file should successfully open
At the beginning just check if the file exists and create it if it doesn't:
filename = "Class {0} data.txt"
if not os.path.isfile(filename):
open(filename, 'w').close()
From this point on you can assume the file exists, this will greatly simplify your code.
No operating system will allow you to create a file without actually writing to it. You can encapsulate this in a library so that the creation is not visible, but it is impossible to avoid writing to the file system if you really want to modify the file system.
Here is a quick and dirty open replacement which does what you propose.
def open_for_reading_create_if_missing(filename):
try:
handle = open(filename, 'r')
except IOError:
with open(filename, 'w') as f:
pass
handle = open(filename, 'r')
return handle
Better would be to create the file if it doesn't exist, e.g. Something like:
import sys, os
def ensure_file_exists(file_name):
""" Make sure that I file with the given name exists """
(the_dir, fname) = os.path.split(file_name)
if not os.path.exists(the_dir):
sys.mkdirs(the_dir) # This may give an exception if the directory cannot be made.
if not os.path.exists(file_name):
open(file_name, 'w').close()
You could even have a safe_open function that did something similar prior to opening for read and returning the file handle.
The sample code provided in the question is not very clear, specially because it invokes multiple variables that are not defined anywhere. But based on it here is my suggestion. You can create a function similar to touch + file open, but which will be platform agnostic.
def touch_open( filename):
try:
connect = open( filename, "r")
except IOError:
connect = open( filename, "a")
connect.close()
connect = open( filename, "r")
return connect
This function will open the file for you if it exists. If the file doesn't exist it will create a blank file with the same name and the open it. An additional bonus functionality with respect to import os; os.system('touch test.txt') is that it does not create a child process in the shell making it faster.
Since it doesn't use the with open(filename) as src syntax you should either remember to close the connection at the end with connection = touch_open( filename); connection.close() or preferably you could open it in a for loop. Example:
file2open = "test.txt"
for i, row in enumerate( touch_open( file2open)):
print i, row, # print the line number and content
This option should be preferred to data = src.readlines() followed by enumerate( data), found in your code, because it avoids looping twice through the file.