JSONDecodeError reading a large JSON file - python

I would like to read a large JSON file that I have before created through some web-scraping. However, when I try to read in the file, I get the following error message:
JSONDecodeError: Expecting ',' delimiter: line 1364567 column 2 (char 1083603504)
However, 1364567 is the very last line and it seems to be correct right there. Therefore I expect that the error is somewhere else in the file before, for example that somewhere there are brackets that are opened but not closed. What do you suggest how I can track down the problem and fix it? I can also provide a link to the file, but it is quite large (1.05 GB).
I use the following code to read the json file
import json
with open("file.json") as f:
data = json.load(f)
Thank you very much!
Edit: The problem was solved as follows: The end of the JSON file looked normal, i.e. an additional line with fields and information and a closing bracket ]. json.load complained about a missing comma, i.e. not recognizing that the last bracket indicated indeed that the file ended. Therefore there must have been opening brackets [ before in the file, that were not closed. Luckily those were due to some hiccups with the scraping at the beginning of the file, such that some manual search with Sublime Text allowed me just to delete those opening brackets and read the file without problems. Anyways, thank you very much for your suggestions and I am sure I will use them the next time I have a problem with JSON!

You can Use any powerful IDE such as pycharm, Atom, sublime they each have plugins for json formatting
and you can always validate json using online tools but it would be heavy for them to process
Hope this information might help

You can use this to check your json format before running codes. Just to make sure where the problem is and fix it
https://jsonformatter.curiousconcept.com/

Since you want the last line, and you are using Python, one of the good solutions could be to actually read the last line(s) and print them, to see where the problem is.
For that, there is actually a module you can use, file_read_backwards, which does this efficiently.
For details see this SO answer: https://stackoverflow.com/a/41418158/50003

Related

How to open a text file which has more than 500k lines, without using any iteration?

I am working with text files and looping over them, python works well with files of 10k to 20k lines, most of them are of that length, few text files are over 100k lines, where the code just stops or just keeps buffering, how can we improve the speed or open the text file directly, even if there has to be any iteration, it should be pretty quick, and I want my text file opened in a string format, so no readlines.
I'm confused as to your "without iteration" parameter. You're already looping over the files and using a simple open or some other method, so what is it that you're wanting to change? As you didn't post your code at all there's nothing to work from to understand what might be happening or to suggest changes to.
Also, it sounds like you're hitting system limits, not a limit of python itself. For a question like this it would be worthwhile to give your system parameters alongside the code so that someone can get a full picture when responding.
Typically I just do something similar to the good ol' standby that won't destroy your memory:
fhand = open('file.extension')
for line in fhand:
# do the thing you need to do with each line
You can see a more detailed explanation here or in Dr Chuck's handy free textbook under the "Files" section.

Python 3+ How to edit a line in a text file

This has been asked, and it has been answered. Though, the answers do not favor my situation or what I'm trying to achieve.
I have 100's of text files I need to update. I need to update the same one line in each file. I want to open up a text file and only modify a single line. I do not want to re-write the text file using write() or writelines(). I also do not want to use fileinput.input() because that too is re-writing the text file using print statements.
The files contain thousands of lines of critical data and I cannot trust that Python will recreate everything correctly in the files (I understand this will probably never happen). There are several lines that pertain to footer data which is non-critical and that's what I am updating.
How can one update a single line in a text file, without recreating the file. Appending one line in a text must be possible.
Thanks in advance
I don't think this is possible in any programming language at the file level. Files simply don't work that way -- especially text-based files which you are likely replacing with different-length data in the middle. You would need raw disk level access (making this a stupidly difficult problem).
If you really want to pursue it, check the raw disk question here:
Is it possible to get writing access to raw devices using python with windows?
EVEN THEN: I'm pretty sure at some level AT LEAST an entire block of data will be read from the drive and re-written (this is physically how drives work if I recall).

New line with invisible character

I'm sure this has been answered before but after attempting to search for others who had the problem I didn't have much luck.
I am using csv.reader to parse a CSV file. The file is in the correct format, but on one of the lines of the CSV file I get the notification "list index out of range" indicating that the formatting is wrong. When I look at the line, I don't see anything wrong. However, when I go back to the website where I got the text, I see a square/rectangle symbol where there is a space. This symbol must be leading csv.reader to treat that as a new line symbol.
A few questions: 1) What is this symbol and why can't I see it in my text files? 2) How do I avoid having these treated as new lines? I wonder if the best way is to find and replace them given that I will be processing the file multiple times in different ways.
Here is the symbol:
Update: When I copy and paste the symbol into Google it searches for  (a-circumflex). However, when I copy and paste  into my documents, it shows up correctly. That leads me to believe that the symbol is not actually Â.
This looks like a charset problem. The "Â" is latin-1 for a non-breaking space in UTF-8. Assuming you are running Windows, you are using one of the latins as character set. UTF-8 is the default encoding for OSX and Linux-based OSs. The OS locale is used as default locale in most text editors, and thus encode files created with those programs as latin-1. A lot of programmers on OSX have problems with non-breaking spaces because it is very easy to mistakenly type it (it is Option+Spacebar) and impossible to see.
In python >= 3.1, the csv reader supports dialects for solving those kind of problems. If you know what program was used to create the csv file, you can manually specify a dialect, like 'excel'. You can use a csv sniffer to automatically deduce it by peeking into the file.
Life Management Advice: If you happen to see weird characters anywhere, always assume charset problems. There is an awesome charset problem debug table HERE.

txt file appears blank with .write() python

I am on a windows machine and am trying to write a couple thousand lines to a text file using ipython. To test this I am just trying to get some text to appear in the file.
my code is as follows:
path="\Users\\*****\Desktop"
with open(path+'newheaders.txt','wb') as f:
f.write('new text')
This question (.write not working in Python) is answered and seems like it should have solved my issue but when I open the text file it is still blank.
I tested the file using the code below and the text appears to be there.
with open(path+'newheaders.txt','r') as f:
print f.read()
any ideas?
This 'should' work as written. A few things to try (I would put this in a comment but I lack sufficient reputation):
Delete the file and make sure the program is creating the file
Try writing as 'wt' rather than binary to see if we can narrow down the problem that way.
Remove all the business with the path and just try to write the file in the current directory.
What text editor are you using? Is it possible it's not refreshing the blank file?

Can't get Python to read until the end of a file

I have tried this a few different ways, but the result always seems to be the same. I can't get Python to read until the end of the file here. It stops only about halfway through. I've tried Binary and ASCII modes, but both of these have the same result. I've also checked for any special characters in the file where it cuts off and there are none. Additionally, I've tried specifying how much to read and it still cuts off at the same place.
It goes something like this:
f=open("archives/archivelog", "r")
logtext=f.read()
print logtext
It happens whether I call it from bash, or from python, whether I'm a normal user or the root.
HOWEVER, it works fine if the file is in the same directory as I am.
f=open("archivelog", "r")
logtext=f.read()
print logtext
This works like a dream. Any idea why?
The Python reference manual about read() says:
Also note that when in non-blocking mode, less data than was requested
may be returned, even if no size parameter was given.
There is also a draft PEP about that matter, which apparently was not accepted. A PEP is a Python Enhancement Proposal.
So the sad state of affairs is that you cannot rely on read() to give you the full file in one call.
If the file is a text file I suggest you use readlines() instead. It will give you a list containing every line of the file. As far as I can tell readlines() is reliable.
Jumping off from Kelketek's answer:
I can't remember where I read about this, but basically the Python garbage collector runs "occasionally", with no guarantees about when a given object will be collected. The flush() function does the same: http://docs.python.org/library/stdtypes.html#file.flush. What I've gathered is that flush() puts the data in some buffer for writing and it's up to your OS to decide when to do it. Probably one or both of these was your problem.
Were you reading in the file soon after writing it? That could cause a race condition (http://en.wikipedia.org/wiki/Race_condition), which is a class of generally weird, possibly random/hard-to-reproduce bugs that you don't normally expect from a high-level language like Python.
The read method returns the file contents in chunks. You have to call it again until it returns an empty string ('').
http://docs.python.org/tutorial/inputoutput.html#methods-of-file-objects
Ok, gonna write this in notepad first so I don't press 'enter' too early...
I have solved the problem, but I'm not really sure WHY the solution solves the problem.
As it turns out, the reason why one was able to read through and not the other was because the one that was cut off early was created with the Python script, whereas the other had been created earlier.
Even though I closed the file, the file did not appear to be fully written to disk, OR, when I was grabbing it, it was only what was in buffer. Something like that.
By doing:
del f
And then trying to grab the file, I got the whole file. And yes, I did use f.close after writing the file.
So, the problem is solved, but can anyone give me the reason why I had to garbage collect manually in this instance? I didn't think I'd have to do this in Python.

Categories