Processing huge json file in Python - ValueError

Processing huge json file in Python - ValueError - python

I have this piece of code to process a big file in Python:
import urllib2, json, csv
import requests
def readJson(url):
"""
Read a json file.
:param url: url to be read.
:return: a json file.
"""
try:
response = urllib2.urlopen(url)
return json.loads(response.read(), strict=False)
except urllib2.HTTPError as e:
return None
def getRoadsTopology():
nodes = []
edges = []
url = "https://data.cityofnewyork.us/api/geospatial/svwp-sbcd?method=export&format=GeoJSON"
data = readJson(url)
print "Done reading road bed"
print "Processing road bed..."
v_index = 0;
roads = 0
for road in data['features']:
n_index = len(nodes)
# (long, lat)
coordinates = road['geometry']['coordinates'][0]
for i in range(0, len(coordinates)):
lat_long = coordinates[i]
nodes.append((lat_long[1], lat_long[0]))
for i in range(n_index, len(nodes)-1-n_index):
print i, i+1
edges.append((i, i+1))
return nodes, edges
Sometimes it works, but a lot of times I get the same error at different lines:
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting : delimiter: line 7 column 4 (char 74317829)
File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting , delimiter: line 5 column 1 (char 72149996)
I'm wondering what causes these error, and at different lines, and how I could solve it.
The site that provide this file has also a successful presentation of it:
https://data.cityofnewyork.us/City-Government/road/svwp-sbcd

It looks like your JSON input is malformed. The error is being thrown from raw_decode, which is part of the JSON library--so it's dumping before it even gets to your processing code. The inconsistency of the results would lead me to think maybe the JSON is somehow getting corrupted, or not completely delivered.
My next step would be to pull the JSON from the source, store in a local file, lint it to make sure it's valid, then test your program by from that file directly.
Update:
Curious, I downloaded the file several times. A couple of them came out being far too small. It seems the real size is around 121M. Once I got a couple of those consistently, I ran your program against it, replacing your url-loader with a file loader. It works perfectly, unless I have too little RAM, which then yields a segfault.
I had the most success downloading the file on a virtual server on DigitalOcean--it successfully got it every time. When doing it on my local machine, the file was truncated, which leads me to believe that perhaps the server sending you the JSON is cutting off the stream after some timeout period. The DigitalOcean server has a massive throughput, averaging 12 MB/s, pulling the entire file in 10 seconds. My local machine could only pull less than 1MB/s, and couldn't finish. It stopped at 2 minutes, only having pulled 75Mb. The sending server probably has a 2 minute time limit on requests.
This would explain why their page works, but your script struggles to get it all. The map data is being processed by another server that can pull the data from the source in the time allowed, then streamed piece by piece as needed to the web viewer.

Related

Why these Python send / receive socket functions work if invoked slowly, but fail if invoked quickly in a row?

I have a client and a server, where the server needs to send a number of text files to the client.
The send file function receives the socket and the path of the file to send:
CHUNKSIZE = 1_000_000
def send_file(sock, filepath):
with open(filepath, 'rb') as f:
sock.sendall(f'{os.path.getsize(filepath)}'.encode() + b'\r\n')
# Send the file in chunks so large files can be handled.
while True:
data = f.read(CHUNKSIZE)
if not data:
break
sock.send(data)
And the receive file function receives the client socket and the path where to save the incoming file:
CHUNKSIZE = 1_000_000
def receive_file(sock, filepath):
with sock.makefile('rb') as file_socket:
length = int(file_socket.readline())
# Read the data in chunks so it can handle large files.
with open(filepath, 'wb') as f:
while length:
chunk = min(length, CHUNKSIZE)
data = file_socket.read(chunk)
if not data:
break
f.write(data)
length -= len(data)
if length != 0:
print('Invalid download.')
else:
print('Done.')
It works by sending the file size as the first line, then sending the text file line by line.
Both are invoked in loops in the client and the server, so that files are sent and saved one by one.
It works fine if I put a breakpoint and invoke these functions slowly. But If I let the program run uninterrupted, it fails when reading the size of the second file:
File "/home/stark/Work/test/networking.py", line 29, in receive_file
length = int(file_socket.readline())
ValueError: invalid literal for int() with base 10: b'00,1851,-34,-58,782,-11.91,13.87,-99.55,1730,-16,-32,545,-12.12,19.70,-99.55,1564,-8,-10,177,-12.53,24.90,-99.55,1564,-8,-5,88,-12.53,25.99,-99.55,1564,-8,-3,43,-12.53,26.54,-99.55,0,60,0\r\n'
Clearly a lot more data is being received by that length = int(file_socket.readline()) line.
My questions: why is that? Shouldn't that line read only the size given that it's always sent with a trailing \n?
How can I fix this so that multiple files can be sent in a row?
Thanks!

It seems like you're reusing the same connection and what happens is your file_socket being buffered means... you've actually recved more from your socket then you'd think with your read loop.
I.e. the receiver consumes more data from your socket and next time you attempt to readline() you end up reading rest of the previous file up to the new line contained therein or of the next length information.
This also means your initial problem actually is you've skipped a while. Effect of which is next read line is not an int you expected and hence the observed failure.
You can say:
with sock.makefile('rb', buffering=0) as file_socket:
instead to force the file like access being unbuffered. Or actually handle the receiving and buffering and parsing of incoming bytes (understanding where one file ends and the next one begins) on your own (instead of file like wrapper and readline).

You have to understand that socket communication is based on TCP/IP, does not matter if it's same machine (you use loopback in such cases) or different machines. So, you've got some IP addresses between which the connection is established. Going further, it involves accessing your network adapter, ie takes relatively long in comparison to accessing eg. RAM. Additionally, the adapter itself manages when to send particular data frames (lower ISO/OSI layers). Basically, in case of TCP there's ACK required, but on standard PC this is usually not some industrial, real-time ethernet.
So, in your code, you've got a while True loop without any sleep and you don't check what does sock.send returns. Even if something goes wrong with particular data frame, you ignore it and try to send next. On first glance it appears that something has been cached and receiver received what was flushed once connection was re-established.
So, first thing which you should do is check if sock.send indeed returned number of bytes sent. If not, I believe the frame should be re-sent. Another thing which I strongly recommend in such cases is think of some custom protocol (this is usually called application layer in context of OSI/ISO stack). For example, you might have 4 types of frames: START, FILESIZE, DATA, END, assign unique ID and start each frame with the identifier. Then, START is gonna be empty, FILESIZE gonna contain single uint16, DATA is gonna contain {FILE NUMBER, LINE NUMBER, LINE_LENGTH, LINE} and END is gonna be empty. Then, once you've got entire frame on the client, you can safely assemble the information you received.

loading Behave test results in JSON file in environment.py's after_all() throws error

I'm trying to send my Behave test results to an API Endpoint. I set the output file to be a new JSON file, run my test, and then in the Behave after_all() send the JSON result via the requests package.
I'm running my Behave test like so:
args = ['--outfile=/home/user/nathan/results/behave4.json',
'--for mat=json.pretty']
from behave.__main__ import main as behave_main
behave_main(args)
In my environment.py's after_all(), I have:
def after_all(context):
data = json.load(open('/home/user/myself/results/behave.json', 'r')) # This line causes the error
sendoff = {}
sendoff['results'] = data
r = requests.post(MyAPIEndpoint, json=sendoff)
I'm getting the following error when running my Behave test:
HOOK-ERROR in after_all: ValueError: Expecting object: line 124 column 1
(char 3796)
ABORTED: By user.
The reported error is here in my JSON file:
[
{
...
} <-- line 124, column 1
]
However, behave.json is outputted after the run and according to JSONLint it is valid JSON. I don't know the exact details of after_all(), but I think the issue is that the JSON file isn't done writing by the time I try to open it in after_all(). If I try json.load() a second time on the behave.json file after the file is written, it runs without error and I am able to view my JSON file at the endpoint.
Any better explanation as to why this is happening? Any solution or change in logic to get past this?

Yes, it seems as though the file is still in the process of being written when I try to access it in after_all(). I put in a small delay before I open the file in my code, then I manually viewed the behave.json file and saw that there was no closing ] after the last }.
That explains that. I will create a new question to find out how to get by this, or if a change in a logic is required.

After sending data over a socket, it isn't being converted back to list

So I can send strings just fine. What I want to do though is to more or less send a string representation of a list. I know quite a few to convert a list into something that should be able to sent as a string and then converted back.
#on sending
l = [1,2,3,4]
l_str = str(l)
#on receiving
l = ast.literal_eval(received_data)
## or pickle
l = pickle.dumps([1,2,3,4])
##then
l = pick.loads(received_data)
My issues however seems to be that something odd is happening between the receiving and sending.
Right now I have this
msg = pickle.dumps([sys.stdin.readline(), person])
s.send(msg)
where sys.stdin.readline() is the line typed into the console and person is a variable containing someone's name.
I then receive it like so.
d1 = sock.recv(4096)
pickles = False
try:
d1 = pickle.loads(d1)
pickles = True
It doesn't matter if I just make the list string by my first method and then use ast.literal_eval or use pickle, it never actually converts back to the list I want.
I currently have a try statement in place because I know there will be times where I will actually not be getting back something that was dumped using pickle or what not, so the idea is that it should fail on those and in the except just continue as if the data received was formatted correctly.
The error that is produced when I try to unpickle them for instance is
Traceback (most recent call last):
File "telnet.py", line 75, in <module>
d1 = pickle.loads(d1)
File "/usr/local/lib/python2.7/pickle.py", line 1382, in loads
return Unpickler(file).load()
File "/usr/local/lib/python2.7/pickle.py", line 858, in load
dispatch[key](self)
KeyError: '\r'
The pickle.loads never succeeds because pickles is never True...... Any ideas?
EDIT: I have the overall solution. The issue for me was actually not in the file you see in the error, that being telnet.py, but in another file. I didn't realize that the intermediate server was receiving input and changing it. However, after some suggestions, I realized that was what exactly was happening.

My issue actually came from another file. At the time I did not realize that this being a chat server / client was important. However, the chat server was actually sending data back to the client that it formatted.. honestly I don't know how that didn't hit me but thats what happened.

PySerial skips/loses data on serial acquisition

I have a data acquisition system that produces ASCII data. The data is acquired over USB with serial communication protocol (virtual serial, as the manufacturer of the box claims). I have a Python program/script that uses PySerial with a PySide GUI that plots the acquired data and saves it to HDF5 files. I'm having a very weird problem and I don't know how to tackle it. I wish you guys could help me and provide advice on how you would debug this problem.
How the problem shows up: The problem is that if I use a software like Eltima Data Logger, the data acquired looks fine. However, if I use my software (with PySerial), some chunks of the data seems to be missing. What's weird is that the missing chunk of the data is incompatible with the method of reading. I read line by line, and what is missing from the data is like 100 bytes or 64 bytes chunks that sometimes include newlines!!! I know what's missing because the device buffers the data on an SD Card before sending it to the computer. This made me believe for a long time that the hardware has a problem, until I used this software, Eltima, that showed that it's acquiring the data fine.
The following is the configuration of Eltima:
My configuration:
This whole thing is running in a QThread.
The following is the methods I use in my code (with some minor polishing to make it reusable here):
self.obj = serial.Serial()
self.obj.port = instrumentName
self.obj.baudrate = 115200
self.obj.bytesize = serial.EIGHTBITS
self.obj.parity = serial.PARITY_ODD
self.obj.stopbits = serial.STOPBITS_ONE
self.obj.timeout = 1
self.obj.xonxoff = False
self.obj.rtscts = False
self.obj.dsrdtr = False
self.obj.writeTimeout = 2
self.obj.open()
The algorithm I use for reading, is that I have a loop that looks for a specific header line, and once found, it keeps pushing lines into buffer until a specific end line is found; and this data is finally processed. Following is my code:
try:
# keep reading until a header line is found that indicates the beginning of a batch of data
while not self.stopped:
self.line = self.readLine()
self.writeDumpFileLine(self.line)
if self.line == DataBatch.d_startString:
print("Acquiring batch, line by line...")
self.dataStrQueue.append(self.line)
break
# after the header line, keep reading until a specific string is found
while not self.stopped:
self.line = self.readLine()
self.writeDumpFileLine(self.line)
self.dataStrQueue.append(self.line)
if self.line == DataBatch.d_endString:
break
except Exception as e1:
print("Exception while trying to read. Error: " + str(e1))
The self.writeDumpFileLine() takes the line from the device and dumps it in a file directly before processing for debugging purposes. These dump files have confirmed the problem of missing chunks.
The implementation of self.readLine() is quite simple:
def readLine(self):
lineData = decodeString(self.obj.readline())
lineData = lineData.replace(acquisitionEndlineChar, "")
return lineData
I would like to point out that I also have an implementation that pulls thousands of lines and parses them based on inWaiting(), and this method has the same problem too!
Now I'm starting to wonder: Is it PySerial? What else could be causing this problem?
Thank you so much for any efforts. If you require any additional information, please ask!
UPDATE:
Actually I have just confirmed that the problem can be reproduced by getting the system to lag a little bit. I use PyCharm to program this software, and while the program is running, if I press Ctrl+S to save, the GUI of PyCharm freezes a little bit (and hence its terminal). Repeating this many times causes the problem in a reproducible manner!!!!

Plot.ly API throws randomly JSON decoding error while creating graph

I have following problem with Python 2.7 and Plot.ly API and I am not sure whats going on and where is the problem. Before I write to authors I am going to try to ask here. I have a script that scans a specific websites, their links and analyzes content (words, counts, etc). The result is plotted by Plotly as a bar graph. Everything works fine, script is run every 30 minutes. But what happens every day few times is, that the method that handles data upload through API, like response = py.plot([data]), says "ValueError: No JSON object could be decoded" (data is not empty, counting works fine). What I don't understand is that:
1) It was working with the same script code few minutes ago
2) It doesn't matter what data I put inside the variable data (like simple numbers for x and y)
3) After the above mentioned error, the data are sent and published, but the descriptors - layouts (axis setup, title, size of graph) are not because they are set in the next step separately and script is terminated at the position of creating response (well I could merge that together, but the error still appears and I'd like to know why)
4) when I create empty .py file with basic example like:
import plotly
py = plotly.plotly(username='someUname', key='someApiKey')
x0 = ['a', 'b', 'c'];
y0 = [20, 14, 23];
data = {'x': x0, 'y': y0,'type': 'bar'}
response = py.plot([data])
url = response['url']
filename = response['filename']
Then the result is the same, no JSON object could be decoded, to be exact.
Traceback (most recent call last):
File "<module1>", line 10, in <module>
File "C:\Python27\lib\site-packages\plotly-0.4-py2.7.egg\plotly\plotly.py", line 69, in plot
r = self.__makecall(args, un, key, origin, kwargs)
File "C:\Python27\lib\site-packages\plotly-0.4-py2.7.egg\plotly\plotly.py", line 142, in __makecall
r = json.loads(r.text)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Data are published but I am not able to set layouts. At the time when the word counting script works fine, this small piece of example code works as well.
Does anyone have the same experience? Well I am not a coding pro, but it seems that the problem could be somewhere outside of my code. Or, maybe I missed something, anyway I am not able to debug/understand the reason.
Thank you for tips

Chris here, from Plotly. Thanks for reporting the issue. You definitely aren't doing anything wrong on your end! This error arises because of some transmission issue from plotly to your desktop. The API expects a string in JSON format from the plotly server but received something different. I'll look into it further. Definitely email me if it happens again! --chris[at]plot.ly

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.