Converting a very large JSON file to CSV - python

I have a JSON file that is about 8GB in size. When I try to convert the file using this script:
import csv
import json
infile = open("filename.json","r")
outfile = open("data.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.write(row)
I get this error:
Traceback (most recent call last):
File "E:/Thesis/DataDownload/PTDataDownload/demo.py", line 9, in <module>
for row in json.loads(infile.read()):
MemoryError
I'm sure this has to do with the size of the file. Is there a way to ensure the file will convert to a CSV without the error?
This is a sample of my JSON code:
{"id":"tag:search.twitter.com,2005:905943958144118786","objectType":"activity","actor":{"objectType":"person","id":"id:twitter.com:899030045234167808","link":"http://www.twitter.com/NAJajsjs3","displayName":"NAJajsjs","postedTime":"2017-08-19T22:07:20.000Z","image":"https://pbs.twimg.com/profile_images/905943685493391360/2ZavxLrD_normal.jpg","summary":null,"links":[{"href":null,"rel":"me"}],"friendsCount":23,"followersCount":1,"listedCount":0,"statusesCount":283,"twitterTimeZone":null,"verified":false,"utcOffset":null,"preferredUsername":"NAJajsjs3","languages":["tr"],"favoritesCount":106},"verb":"post","postedTime":"2017-09-08T00:00:45.000Z","generator":{"displayName":"Twitter for iPhone","link":"http://twitter.com/download/iphone"},"provider":{"objectType":"service","displayName":"Twitter","link":"http://www.twitter.com"},"link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","body":"#thugIyfe Beyonce do better","object":{"objectType":"note","id":"object:search.twitter.com,2005:905943958144118786","summary":"#thugIyfe Beyonce do better","link":"http://twitter.com/NAJajsjs3/statuses/905943958144118786","postedTime":"2017-09-08T00:00:45.000Z"},"inReplyTo":{"link":"http://twitter.com/thugIyfe/statuses/905942854710775808"},"favoritesCount":0,"twitter_entities":{"hashtags":[],"user_mentions":[{"screen_name":"thugIyfe","name":"dari.","id":40542633,"id_str":"40542633","indices":[0,9]}],"symbols":[],"urls":[]},"twitter_filter_level":"low","twitter_lang":"en","display_text_range":[10,27],"retweetCount":0,"gnip":{"matching_rules":[{"tag":null,"id":6134817834619900217,"id_str":"6134817834619900217"}]}}
(sorry for the ugly formatting)
An alternative may be that I have about 8000 smaller json files that I combined to make this file. They are each within their own folder with just the single json in the folder. Would it be easier to convert each of these individually and then combine them into one csv?
The reason I am asking this is because I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to read this json as a csv!

Would it be easier to convert each of these individually and then combine them into one csv?
Yes, it certainly would
For example, this will put each JSON object/array (whatever is loaded from the file) onto its own line of a single CSV.
import json, csv
from glob import glob
with open('out.csv', 'w') as f:
for fname in glob("*.json"): # Reads all json from the current directory
with open(fname) as j:
f.write(str(json.load(j)))
f.write('\n')
Use glob pattern **/*.json to find all json files in nested folders
Not really clear what for row in ... was doing for your data since you don't have an array. Unless you wanted each JSON key to be a CSV column?

Yes, it is absolutely can be done in a very easy way. I opened a 4GB json file in a few seconds. For me, I dont need to convert to csv. But it can be done in a very easy way.
start the mongodb with Docker.
create a temporary database on mongodb, e.g. test
copy the json file to into the Docker container
run mongoimport command
docker exec -it container_id mongoimport --db test --collection data --file /tmp/data.json --jsonArray
run the mongo export command to export to csv
docker exec -it container_id mongoexport --db test --collection data --csv --out data.csv --fields id,objectType

Related

import cvs into postgres with python script

I am trying to import a CSV file of IP addresses into Postgres via python script. This is what I am at
Python script
since this is for testing. this is how test csv file is. CSV FILE
Also this is the error I am getting
Error
I ran same python script with text file, same error.
Also, I tried manually uploading the same file via pgadmin. No issue. so its probably something I am missing in my code.
Also, i am able to connect to DB as in the screenshot above so not connection issue for issue.
Thanks in advance.
You do not open anywhere the actual file. You are trying to iterate over the file name.
You need to read the file lines and pass those lines into execute/execute_many.
Sample code:
import csv
with open("test.csv", "r") as my_file:
readers = csv.reader(my_file)
for line in reader:
cur.execute("INSERT INTO x(y) VALUES (%s)", line)
cur.commit()

save JSON in chunks with PyMongo

I have a decent sized collection in MongoDB and I need to export the entire thing to JSON using PyMongo. Right now I'm just doing:
import json
results = db.collection_name.find()
with open('collection-data.json', 'w') as f:
json.dump(list(results), f)
This ends up crashing the kernel because it eats up all my memory. Is there a way to save the collection in chunks so that I don't retrieve all of the data at one time?
Try this in your shell:
mongoexport --db <database-name> --collection <collection-name> --out output.json

How to open a .data file extension

I am working on side stuff where the data provided is in a .data file. How do I open a .data file to see what the data looks like and also how do I read from a .data file programmatically through python? I have Mac OSX
NOTE: The Data I am working with is for one of the KDD cup challenges
Kindly try using Notepad or Gedit to check delimiters in the file (.data files are text files too). After you have confirmed this, then you can use the read_csv method in the Pandas library in python.
import pandas as pd
file_path = "~/AI/datasets/wine/wine.data"
# above .data file is comma delimited
wine_data = pd.read_csv(file_path, delimiter=",")
It vastly depends on what is in it. It could be a binary file or it could be a text file.
If it is a text file then you can open it in the same way you open any file (f=open(filename,"r"))
If it is a binary file you can just add a "b" to the open command (open(filename,"rb")). There is an example here:
Reading binary file in Python and looping over each byte
Depending on the type of data in there, you might want to try passing it through a csv reader (csv python module) or an xml parsing library (an example of which is lxml)
After further into from above and looking at the page the format is:
Data Format
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
Therefore see this answer:
parsing a tab-separated file in Python
I would advise trying to process one line at a time rather than loading the whole file, but if you have the ram why not...
I suspect it doesnt open in sublime because the file is huge, but that is just a guess.
To get a quick overview of what the file may content you could do this within a terminal, using strings or cat, for example:
$ strings file.data
or
$ cat -v file.data
In case you forget to pass the -v option to cat and if is a binary file you could mess your terminal and therefore need to reset it:
$ reset
I was just dealing with this issue myself so I thought I would share my answer. I have a .data file and was unable to open it by simply right clicking it. MACOS recommended I open it using Xcode so I tried it but it did not work.
Next I tried open it using a program named "Brackets". It is a text editing program primarily used for HTML and CSS. Brackets did work.
I also tried PyCharm as I am a Python Programmer. Pycharm worked as well and I was also able to read from the file using the following lines of code:
inf = open("processed-1.cleveland.data", "r")
lines = inf.readlines()
for line in lines:
print(line, end="")
It works for me.
import pandas as pd
# define your file path here
your_data = pd.read_csv(file_path, sep=',')
your_data.head()
I mean that just take it as a csv file if it is seprated with ','.
solution from #mustious.

Batch convert json to csv python

Similar to this question batch process text to csv using python
I've got a batch of json files that need to be converted to csv so that they can be imported into Tableau.
The first step was to get json2csv ( https://github.com/evidens/json2csv ) working, which I did. I can successfully convert a single file via the command line.
Now I need an operation that goes through the files in a directory and converts each in a single batch operation using that json2csv script.
TIA
I actually created a jsontocsv python script to run myself. It basically reads the json file in chunks, and then goes through determining the rows and columns of the csv file.
Check out Opening A large JSON file in Python with no newlines for csv conversion Python 2.6.6 for the details of what was done and how it built the .csv from the json. The actual conversion would depend on your actual json format.
The json parse function with a chunk size of 0x800000 was what was used to read in the json data.
If the data becomes available at specific times, you can set this up using crontab.
I used
from optparse import OptionParser
to get the input and output files as arguments as well as setting the various options that were required for the analysis and mapping.
You can also use a batch script in the given directory
for f in *.json; do
mybase=`basename $f .json`
json2csv $f -o ${mybase}.csv
done
alternatively, use find with the -exec {} option
If you want all the json files to go into a single .csv file you can use
json2csv *.json -o myfile.csv

Python: Converting Entire Directory of JSON to Python Dictionaries to send to MongoDB

I'm relatively new to Python, and extremely new to MongoDB (as such, I'll only be concerned with taking the text files and converting them). I'm currently trying to take a bunch of .txt files that are in JSON to move them into MongoDB. So, my approach is to open each file in the directory, read each line, convert it from JSON to a dictionary, and then over-write that line that was JSON as a dictionary. Then it'll be in a format to send to MongoDB
(If there's any flaw in my reasoning, please point it out)
At the moment, I've written this:
"""
Kalil's step by step iteration / write.
JSON dumps takes a python object and serializes it to JSON.
Loads takes a JSON string and turns it into a python dictionary.
So we return json.loads so that we can take that JSON string from the tweet and save it as a dictionary for Pymongo
"""
import os
import json
import pymongo
rootdir='~/Tweets'
def convert(line):
line = file.readline()
d = json.loads(lines)
return d
for subdir, dirs, files in os.walk(rootdir):
for file in files:
f=open(file, 'r')
lines = f.readlines()
f.close()
f=open(file, 'w')
for line in lines:
newline = convert(line)
f.write(newline)
f.close()
But it isn't writing.
Which... As a rule of thumb, if you're not getting the effect that you're wanting, you're making a mistake somewhere.
Does anyone have any suggestions?
When you decode a json file you don't need to convert line by line as the parser will iterate over the file for you (that is unless you have one json document per line).
Once you've loaded the json document you'll have a dictionary which is a data structure and cannot be directly written back to file without first serializing it into a certain format such as json, yaml or many others (the format mongodb uses is called bson but your driver will handle the encoding for you).
The overall process to load a json file and dump it into mongo is actually pretty simple and looks something like this:
import json
from glob import glob
from pymongo import Connection
db = Connection().test
for filename in glob('~/Tweets/*.txt'):
with open(filename) as fp:
doc = json.load(fp)
db.tweets.save(doc)
a dictionary in python is an object that lives within the program, you can't save the dictionary directly to a file unless you pickle it (pickling is a way to save objects in files so you can retrieve it latter). Now I think a better approach would be to read the lines from the file, load the json which converts that json to a dictionary and save that info into mongodb right away, no need to save that info into a file.

Categories