I am trying to convert the list of jsons into list of pipe delimited strings. One of the problem that some json tags can be missed in some lines and the final list has to include ""|""|"" empty strings in place of missed tags. There is also no guarantee that every json will have same tags in same sequence. The number of pipe delimited strings has to remain the same. I changed json module to simplejson and use multiprocessing with strongs CPU's(32). But the result still poor. The use of pyinstaller doesn't improve anything. I definitely need community help.
I tested different json parsers: simplejson, ujson, json, yajl and nothing helped. Then I came into cjson and it decreased time from 28 minutes to 1.5 minutes. I have no idea "why?" but works like a rocket despite other solutions also implemented in "C".
Related
This question already has answers here:
CSV writing strings of text that need a unique delimiter
(3 answers)
How to give 2 characters for "delimiter" using csv module?
(2 answers)
Closed 3 years ago.
I need to write my program output to a file. However, some of the fields already contain spaces, commas, semicolons, tabs. So I do not want to use spaces, tabs, commas as a field separator. The data are from the web so there are possibilities of server admins using wrong format.
I though of using any made up string, like my name. But this can be unprofessional if the output might be used by other researchers or so.
Are there any recommendations in this matter? What should I use if I am afraid to use commas, semicolons, tabs, spaces as separator?
EDIT:
For those answers suggesting using json or csv module, please note that I want to load the file into a MySQL database. I just can specify to MySQL that fields are separated by [some separator]. I also need a simple solution.
Use commas (or tabs), but use a proper serializer that knows how to escape characters on write, and unescape them on read. The csv module knows how to do this and seems to match your likely requirements.
Yes, you could try to find some random character that never appears in your data, but that just means you'll die horribly if that character ever does appear, and it means producing output that no existing parser knows how to handle. CSV is a well-known format (if surprisingly complex, with varied dialects), and can likely be parsed by existing libraries in whatever language needs to consume it.
JSON (handled in Python by the json module) is often useful as well as a language-agnostic format, as is pickle (though it's only readable in Python), but from what you described, CSV is probably the go to solution to start with.
Generally, good separators can be any kind of normal, keyboard-typable symbol that isn't used anywhere else in the data. My suggestion would be either '|' or '/'.
CSV files typically use quotes around fields that contain field separator characters, and use a backslash or another quote to escape a literal quote.
CSV is not a well defined format, however, and there are many variants implemented by different vendors. If you want a better-rounded text format that can store structured data you should look into using one of the better defined serialization formats such as JSON and YAML instead.
This question already has answers here:
Is there a memory efficient and fast way to load big JSON files?
(11 answers)
Closed last month.
I have some large json encoded files. The smallest is 300MB; the rest are multiple GB, anywhere from around 2GB to 10GB+.
I seem to run out of memory when trying to load the files in Python.
I tried using this code to test performance:
from datetime import datetime
import json
print datetime.now()
f = open('file.json', 'r')
json.load(f)
f.close()
print datetime.now()
Not too surprisingly, this causes a MemoryError. It appears that json.load() calls json.loads(f.read()), which is trying to dump the entire file into memory first, which clearly isn't going to work.
How I can solve this cleanly?
I know this is old, but I don't think this is a duplicate. While the answer is the same, the question is different. In the "duplicate", the question is how to read large files efficiently, whereas this question deals with files that won't even fit in to memory at all. Efficiency isn't required.
The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.
The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.
The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.
Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.
I am making an API call that gets a JSON response. However, as the response is huge and I don't need all the information received, I am parsing only the required key:values to a dictionary which I am using to write to a CSV file. Is it a good practice to do? Should I parse the JSON data directly to create the CSV file?
Like all things performance-related, don't bother optimizing until it becomes a problem. What you're doing is the normal, simple approach, so keep doing it until you hit real bottlenecks. A "huge response" is a relative thing. To some a "huge" response might be several kilobytes, while others might consider several megabytes, or hundreds of megabytes to be huge.
If you ever do hit a bottleneck, the first thing you should do is profile your code to see where the performance problems are actually occurring and try to optimize only those parts. Don't guess; For all you know, the CSV writer could turn out to be the poor performer.
Remember, those JSON libraries have been around a long time, have strong test coverage and have been battle tested in the field by many developers. Any custom solution you try to create is going to have none of that.
If u want to write only particular key:value pairs into csv file, it is better to convert json into python dictionary with selected key:value pairs and write that into csv file.
Say I have a lot of json lines to process and I only care about the specific fields in a json line.
{blablabla, 'whatICare': 1, blablabla}
{blablabla, 'whatICare': 2, blablabla}
....
Is there any way to extract whatICare from these json lines withoud loads them? Since the json lines are very long it may be slow to build objects from json..
Not any reliable way without writing your own parsing code.
But check out ujson! It can be 10x faster than python's built in json library, which is a bit on the slow side.
No, you will have to load and parse the JSON before you know what’s inside and to be able to filter out the desired elements.
That being said, if you worry about memory, you could use ijson which is an iterative parser. Instead of loading all the content at once, it is able to load only what’s necessary for the next iteration. So if you your file contains an array of objects, you can load and parse one object at a time, reducing the memory impact (as you only need to keep one object in memory, plus the data you actually care about). But it won’t become faster, and it also won’t magically skip data you are not interested in.
I'm trying to do something simple which is removing all tags from HTML code that are in rows of a very large csv file (3 GB). I tried using beautiful soup with the following code
remove_tags=['p','li','ul','pre','h1']
soup=BeautifulSoup(row[1])
for tag in remove_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
However, with such a large file, I quickly run out of memory and a memory error occurs (I even have a lot of RAM on my machine so this must use A LOT of memory). So I was wondering if anyone knew of a less memory intensive method of doing this. Perhaps regex could work by just removing everything with <> (however, I have no idea how to use regex)
Note: I want to remove HTML tags of all kinds. The remove_tags list in the above code was just constructed because those were all of I could see in the data, so if there is a method where tag names do not need to be specified, that would work too.
Using a (very) naive Regex approach:
import re
re.sub(r'<[^>]+>', '', row)