How to import a txt file into i pickle file? - python

am new to Python and working a bit on pickle files.
I have already a pickle file called training.pickle and a txt file called danish.txt
I would like to import the danish.txt to the training.pickle file but i don't know how to do ?
I have tried some thing but am sure its wrong :-)
import pickle
file1=open('danish.txt','r')
file2=open('training.pickle','r')
obj=[file1.read(), file2.read()]
outfile.write("obj,training.pickle")

I don't know much about pickle but if you're just trying to add the data from "danish.txt" to the pickle file you should be able to just open the .txt, store the data in a variable, and then write the data in the pickle.
To demonstrate my thinking:
f = open("danish.txt", "r+")
data = f.read()
output = data
f.close() #this reads the .txt file
and then afterward you'd write "output" into the pickle file via whatever method you use to write a string variable to a pickle file.
P.S. as I said I don't know much about pickle, but if it works anything like writing to a .txt you'd have to change the r to a w because r means opening it in read mode. If its just reading it can't write, or atleast that's how it works with .txts. Also, if there's no particular reason why you're using a pickle to store data, why not just use a .txt?

Related

how to write csv to "variable" instead of file?

I'm not sure how to word my question exactly, and I have seen some similar questions asked but not exactly what I'm trying to do. If there already is a solution please direct me to it.
Here is what I'm trying to do:
At my work, we have a few pkgs we've built to handle various data types. One I am working with is reading in a csv file into a std_io object (std_io is our all-purpose object class that reads in any type of data file).
I am trying to connect this to another pkg I am writing, so I can make an object in the new pkg, and covert it to a std_io object.
The problem is, the std_io object is meant to read an actual file, not take in an object. To get around this, I can basically write my data to temp.csv file then read it into a std_io object.
I am wondering if there is a way to eliminate this step of writing the temp.csv file.
Here is my code:
x #my object
df = x.to_df() #object class method to convert to a pandas dataframe
df.to_csv('temp.csv') #write data to a csv file
std_io_obj = std_read('temp.csv') #read csv file into a std_io object
Is there a way to basically pass what the output of writing the csv file would be directly into std_read? Does this make sense?
The only reason I want to do this is to avoid having to code additional functionality into either of the pkgs to directly accept an object as input.
Hope this was clear, and thanks to anyone who contributes.
For those interested, or who may have this same kind of issue/objective, here's what I did to solve this problem.
I basically just created a temporary named file, linked a .csv filename to this temp file, then passed it into my std_read function which requires a csv filename as an input.
This basically tricks the function into thinking it's taking the name of a real file as an input, and it just opens it as usual and uses csvreader to parse it up.
This is the code:
import tempfile
import os
x #my object I want to convert to a std_io object
text = x.to_df().to_csv() #object class method to convert to a pandas dataframe then generate the 'text' of a csv file
filename = 'temp.csv'
with tempfile.NamedTemporaryFile(dir = os.path.dirname('.')) as f:
f.write(text.encode())
os.link(f.name, filename)
stdio_obj = std_read(filename)
os.unlink(filename)
del f
FYI - the std_read function essentially just opens the file the usual way, and passes it into csvreader:
with open(filename, 'r') as f:
rdr = csv.reader(f)

How to parse WIkidata JSON (.bz2) file using Python?

I want to look at entities and relationships using Wikidata. I downloaded the Wikidata JSON dump (from here .bz2 file, size ~ 18 GB).
However, I cannot open the file, it's just too big for my computer.
Is there a way to look into the file without extracting the full .bz2
file. Especially using Python, I know that there is a PHP dump
reader (here ), but I can't use it.
I came up with a strategy that allows to use json module to access information without opening the file:
import bz2
import json
with bz2.open(filename, "rt") as bzinput:
lines = []
for i, line in enumerate(bzinput):
if i == 10: break
tweets = json.loads(line)
lines.append(tweets)
In this way lines will be a list of dictionaries that you can easly manipulate and, for example, reduce their size by removing keys that you don't need.
Note also that (obviously) the condition i==10 can be arbitrarly changed to fit anyone(?) needings. For example, you may parse some line at a time, analyze them and writing on a txt file the indices of the lines you really want from the original file. Than it will be sufficient to read only those lines (using a similar condition in i in the for loop).
you can use BZ2File interface to manipulate the compressed file. But you can NOT use json module to access information for it, it will take too much space. You will have to index the file meaning you have to read the file line by line and save position and length of interesting object in a Dictionary (hashtable) and then you could extract a given object and load it with the json module.
You'd have to do line-by-line processing:
import bz2
import json
path = "latest.json.bz2"
with bz2.BZ2File(path) as file:
for line in file:
line = line.decode().strip()
if line in {"[", "]"}:
continue
if line.endswith(","):
line = line[:-1]
entity = json.loads(line)
# do your processing here
print(str(entity)[:50] + "...")
Seeing as WikiData is now 70GB+, you might wish to process it directly from the URL:
import bz2
import json
from urllib.request import urlopen
path = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2"
with urlopen(path) as stream:
with bz2.BZ2File(path) as file:
...

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.
You may want to give ijson a try : https://pypi.python.org/pypi/ijson
Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Create hash table from the contents of a file

How can I open a text file, read the contents of the file and create a hash table from this content? So far I have tried:
import json
json_data = open(/home/azoi/Downloads/yes/1.txt).read()
data = json.loads(json_data)
pprint(data)
I suggest this solution:
import json
with open("/home/azoi/Downloads/yes/1.txt") as f:
data=json.load(f)
pprint(data)
The with statement ensures that your file is automatically closed whatever happens and that your program throws the correct exception if the open fails. The json.load function directoly loads data from an open file handle.
Additionally, I strongly suggest reading and understanding the Python tutorial. It's essential reading and won't take too long.
To open a file you have to use the open statment correctly, something like:
json_data=open('/home/azoi/Downloads/yes/1.txt','r')
where the first string is the path to the file and the second is the mode: r = read, w = write, a = append

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?
When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)
a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

Categories