Convert CSV to mongoimport-friendly JSON using Python - python

I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.
A snippet of my CSV looks like this:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
The desired JSON output (except the charset) that works with mongoimport is below:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.
The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?
I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.
Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.

Python standard library (plus simplejson for decimal encoding support) has all you need:
import csv, simplejson, decimal, codecs
data = open("in.csv")
reader = csv.DictReader(data, delimiter=",", quotechar='"')
with codecs.open("out.json", "w", encoding="utf-8") as out:
for r in reader:
for k, v in r.items():
# make sure nulls are generated
if not v:
r[k] = None
# parse and generate decimal arrays
elif k == "loc":
r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
# generate a number
elif k == "geonameid":
r[k] = int(v)
out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")
Where "in.csv" contains your big csv file. The above code is tested as working on Python 2.6 & 2.7, with about 100MB csv file, producing a properly encoded UTF-8 file. Without surrounding brackets, array quoting nor comma delimiters, as requested.
It is also worth noting that passing both the ensure_ascii and use_decimal parameters is required for the encoding to work properly (in this case).
Finally, being based on simplejson, the python stdlib json package will also gain decimal encoding support sooner or later. So only the stdlib will ultimately be needed.

Maybe you could try importing the csv directly into mongodb using
mongoimport -d <dB> -c <collection> --type csv --file location.csv --headerline

Related

How to read a newline delimited JSON file from R?

I have a newline delimited (i.e., each JSON object is confined to 1 line in the file):
{"name": "json1"}
{"name": "json2"}
{"name": "json3"}
In Python I can easily read it as follows (I must use the encoding encoding='cp850' to read my real data):
import json
objs = []
with open("testfile.json", encoding='cp850') as f:
for line in f:
objs.append(json.loads(line))
How can I do a similar trick in R?
At the end I want to get a data.frame:
library("jsonlite")
library("data.table")
d <- fromJSON("testfile.json", flatten=FALSE)
df <- as.data.frame(d)
We can use stream_in from jsonlite
library(jsonlite)
out <- stream_in(file('testfile.json'))
out
# name
#1 json1
#2 json2
#3 json3
str(out)
#'data.frame': 3 obs. of 1 variable:
#$ name: chr "json1" "json2" "json3"
You can read & process the data to a proper format and then parse the JSON
jsonlite::fromJSON(sprintf('[%s]', paste(readLines('text.json', warn = FALSE),
collapse = ',')))
# name
# 1 json1
# 2 json2
# 3 json3
(you can use one of the many alternatives as JSON package e.g.
jsonlite a more R-like package as it will mainly work with data frames
RJSONIO a more Python-ic package working mainly with lists
or yet another one)
The fastest way to read in newline json is probably stream_in from ndjson. It uses underlying C++ libraries (although I think it is still single threaded). Still, I find it much faster than (the still very good) jsonlite library. As a bonus, the json is flattened by default.
library(ndjson)
out<- ndjson::stream_in(path='./testfile.json')

Malformed CSV quoting

I pass data from SAS to Python using CSV format. Have a problem with a quoting format SAS uses. Strings like "480 КЖИ" ОАО aren't quoted, but Python csv module thinks they're.
dat = ['18CA4,"480 КЖИ" ОАО', '1142F,"""Росдорлизинг"" Российская дор,лизинг,компания"" ОАО"']
for i in csv.reader(dat):
print(i)
>>['18CA4', '480 КЖИ ОАО']
>>['1142F', '"Росдорлизинг" Российская дор,лизинг,компания" ОАО']
The 2nd string is fine, but I need 480 КЖИ ОАО string to be "480 КЖИ" ОАО. Don't find such an option in csv module. Maybe it's possible to force proc export to quote all " chars?
UPD: Here's a similar problem Python CSV : field containing quotation mark at the beginning
UPD2: #Quentin have asked for details. Here they're: I have SAS8.2 connected to 9.1 server. I download custom format data from server side with proc format cntlout=..; proc download... So i get a dictionary-like dataset <key>, <value>. Then i pass this dataset in CSV format using proc export via DDE interface to Python. But proc export quotes only strings which include delimiter (comma) as i understand. So i think, i need SAS to quote quotation marks too or Python to unquote only those strings which include commas.
UPDATE: switching from proc export via DDE to direct reading of dataset with a modified SAS7BDAT Python module hugely improved performance. And i got rid of the problem above.
SAS will add extra quotes if the value has quotes in it already.
data _null_;
file log dsd ;
string='"480 КЖИ" ОАО';
put string;
run;
Generates this result:
"""480 КЖИ"" ОАО"
Perhaps the quotes are being removed at some other point in the flow from SAS to Python? Try saving the CSV file to a disk and having Python read from the disk file.

python limit must be integer

I'm trying to run the following code but for some reason I get the following error: "TypeError: limit must be an integer".
Reading csv data file
import sys
import csv
maxInt = sys.maxsize
decrement = True
while decrement:
decrement = False
try:
**csv.field_size_limit(maxInt)**
except OverflowError:
maxInt = int(maxInt/10)
decrement = True
with open("Data.csv", 'rb') as textfile:
text = csv.reader(textfile, delimiter=" ", quotechar='|')
for line in text:
print ' '.join(line)
The error occurs in the starred line. I have only added the extra bit above the csv read statement as the file was too large to read normally. Alternatively, I could change the file to a text file from csv but I'm not sure whether this will corrupt the data further I can't actually see any of the data as the file is >2GB and hence costly to open.
Any ideas? I'm fairly new to Python but I'd really like to learn a lot more.
I'm not sure whether this qualifies as an answer or not, but here are a few things:
First, the csv reader automatically buffers per line of the CSV, so the file size shouldn't matter too much, 2KB or 2GB, whatever.
What might matter is the number of columns or amount of data inside the fields themselves. If this CSV contains War and Peace in each column, then yeah, you're going to have an issue reading it.
Some ways to potentially debug are to run print sys.maxsize, and to just open up a python interpreter, import sys, csv and then run csv.field_size_limit(sys.maxsize). If you are getting some terribly small number or an exception, you may have a bad install of Python. Otherwise, try to take a simpler version of your file. Maybe the first line, or the first several lines and just 1 column. See if you can reproduce the smallest possible case and remove the variability of your system and the file size.
On Windows 7 64bit with Python 2.6, maxInt = sys.maxsize returns 9223372036854775807L which consequently results in a TypeError: limit must be an integer when calling csv.field_size_limit(maxInt). Interestingly, using maxInt = int(sys.maxsize) does not change this. A crude workaround is to simlpy use csv.field_size_limit(2147483647) which of course cause issues on other platforms. In my case this was adquat to identify the broken value in the CSV, fix the export options in the other application and remove the need for csv.field_size_limit().
-- originally posted by user roskakori on this related question

Python codecs module

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export
14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode together to form a unicode.
The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

How can I work with Gzip files which contain extra data?

I'm writing a script which will work with data coming from instrumentation as gzip streams. In about 90% of cases, the gzip module works perfectly, but some of the streams cause it to produce IOError: Not a gzipped file. If the gzip header is removed and the deflate stream fed directly to zlib, I instead get Error -3 while decompressing data: incorrect header check. After about half a day of banging my head against the wall, I discovered that the streams which are having problems contain a seemingly-random number of extra bytes (which are not part of the gzip data) appended to the end.
It strikes me as odd that Python cannot work with these files for two reasons:
Both Gzip and 7zip are able to open these "padded" files without issue. (Gzip produces the message decompression OK, trailing garbage ignored, 7zip succeeds silently.)
Both the Gzip and Python docs seem to indicate that this should work: (emphasis mine)
Gzip's format.txt:
It must be possible to
detect the end of the compressed data with any compression method,
regardless of the actual size of the compressed data. In particular,
the decompressor must be able to detect and skip extra data appended
to a valid compressed file on a record-oriented file system, or when
the compressed data can only be read from a device in multiples of a
certain block size.
Python's gzip.GzipFile`:
Calling a GzipFile object’s close() method does not close fileobj, since you might wish to append more material after the compressed data. This also allows you to pass a StringIO object opened for writing as fileobj, and retrieve the resulting memory buffer using the StringIO object’s getvalue() method.
Python's zlib.Decompress.unused_data:
A string which contains any bytes past the end of the compressed data. That is, this remains "" until the last byte that contains compression data is available. If the whole string turned out to contain compressed data, this is "", the empty string.
The only way to determine where a string of compressed data ends is by actually decompressing it. This means that when compressed data is contained part of a larger file, you can only find the end of it by reading data and feeding it followed by some non-empty string into a decompression object’s decompress() method until the unused_data attribute is no longer the empty string.
Here are the four approaches I've tried. (These examples are Python 3.1, but I've tested 2.5 and 2.7 and had the same problem.)
# approach 1 - gzip.open
with gzip.open(filename) as datafile:
data = datafile.read()
# approach 2 - gzip.GzipFile
with open(filename, "rb") as gzipfile:
with gzip.GzipFile(fileobj=gzipfile) as datafile:
data = datafile.read()
# approach 3 - zlib.decompress
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:])
# approach 4 - zlib.decompressobj
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj()
data = decompressor.decompress(gzipfile.read()[10:])
Am I doing something wrong?
UPDATE
Okay, while the problem with gzip seems to be a bug in the module, my zlib problems are self-inflicted. ;-)
While digging into gzip.py I realized what I was doing wrong — by default, zlib.decompress et al. expect zlib-wrapped streams, not bare deflate streams. By passing in a negative value for wbits, you can tell zlib to skip the zlib header and decrompress the raw stream. Both of these work:
# approach 5 - zlib.decompress with negative wbits
with open(filename, "rb") as gzipfile:
data = zlib.decompress(gzipfile.read()[10:], -zlib.MAX_WBITS)
# approach 6 - zlib.decompressobj with negative wbits
with open(filename, "rb") as gzipfile:
decompressor = zlib.decompressobj(-zlib.MAX_WBITS)
data = decompressor.decompress(gzipfile.read()[10:])
This is a bug. The quality of the gzip module in Python falls far short of the quality that should be required in the Python standard library.
The problem here is that the gzip module assumes that the file is a stream of gzip-format files. At the end of the compressed data, it starts from scratch, expecting a new gzip header; if it doesn't find one, it raises an exception. This is wrong.
Of course, it is valid to concatenate two gzip files, eg:
echo testing > test.txt
gzip test.txt
cat test.txt.gz test.txt.gz > test2.txt.gz
zcat test2.txt.gz
# testing
# testing
The gzip module's error is that it should not raise an exception if there's no gzip header the second time around; it should simply end the file. It should only raise an exception if there's no header the first time.
There's no clean workaround without modifying the gzip module directly; if you want to do that, look at the bottom of the _read method. It should set another flag, eg. reading_second_block, to tell _read_gzip_header to raise EOFError instead of IOError.
There are other bugs in this module. For example, it seeks unnecessarily, causing it to fail on nonseekable streams, such as network sockets. This gives me very little confidence in this module: a developer who doesn't know that gzip needs to function without seeking is badly unqualified to implement it for the Python standard library.
I had a similar problem in the past. I wrote a new module that works better with streams. You can try that out and see if it works for you.
I had exactly this problem, but none of this answers resolved my issue. So, here is what I did to solve the problem:
#for gzip files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|16)
#for zlib files
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS)
#automatic header detection (zlib or gzip):
unzipped = zlib.decompress(gzip_data, zlib.MAX_WBITS|32)
Depending on your case, it might be necessary to decode your data, like:
unzipped = unzipped.decode()
https://docs.python.org/3/library/zlib.html
I couldn't make it to work with the above mentioned techniques. so made a work around using zipfile package
import zipfile
from io import BytesIO
mock_file = BytesIO(data) #data is the compressed string
z = zipfile.ZipFile(file = mock_file)
neat_data = z.read(z.namelist()[0])
Works perfect

Categories