Reading json twitter streams with python

Reading json twitter streams with python - python

I am pretty new at working with python and coding in general so I feel as though this answer is something that I do not understand about how python works.
I have been working with using Tweepy to collect streams of data from python to measure sentiment with different things. That part worked fine. When I ran the program, I had the data write to a txt file and then was trying to use the data within that file to see things such as common words or locations. But I am running into problems when I am reading the data. I have been searching online and found a number of different ways that people have read the data but as I am unfamiliar with json files in general, I don't understand why these methods would work or not.
The main error I seem to be running into is something similar to this:
JSONDecodeError: Expecting value: line 1 column 1 (char 0).
From my understanding, this means that the data is not reading in correctly as can't be read as a json file. But I have also experienced the error where it reads like this:
JSONDecodeError: Expecting value: line 4 column 1 (char 0).
I don't understand why it would change or not. I have tried reading the file in as the original txt file and then saving it again as a json file. The first error I received when trying it as a json file with the second coming from the txt file.
I have read a number of different threads discussing similar problems but they keep giving me these types of errors. Just as an example, here is what my code looked like for the most recent error:
import json
source = open("../twitterdata24.json")
json_data = json.load(source)
One of my other attempts:
import json
tweets = []
for line in open("fileinfo"):
tweets.append(json.load(line))
One other point of interest, the data I am working with contains many individual tweets and from what I have read, I think there is a problem with each individual tweet being a new dictionary, so I tried to make the whole data file a list using [] but that just moved the error down a line.
So if there is anything anyone could tell me or point me to that would help me understand what I am supposed to do to read this data, I would really appreciate it.
Thanks
Edit:
Here is a small sample of the data. The whole data file is little large so here are the first two tweets in the file.
https://drive.google.com/file/d/1l6uiCzBTYf-SqUpCThQ3WDXmslMcUnPA/view?usp=sharing

Looking at your sample data, I suspect that the problem is that it isn't a valid json document. You effectively have data like:
{"a": "b"}
{"c": "d"}
{"a": "b"} is valid json, and {"c": "d"} is valid json, but {"a": "b"}\n{"c": "d"} is not valid json. This explains why `json.load(source) fails.
You're on the right track with your second attempt: by reading through the file line-by-line, you can extract the valid json data objects individually. But your implementation has two problems:
line is a string and you can't call json.load on a string. That's what json.loads is for.
you can't convert an empty line to a json object.
So if you check for empty lines and use loads, you should be able to fill your tweets list without any problems.
import json
tweets = []
with open("sampledata.txt") as source:
for line in source:
if line.strip():
tweets.append(json.loads(line))
print("Succesfully loaded {} tweets.".format(len(tweets)))
Result:
Succesfully loaded 2 tweets.

Related

Extracting N JSON objects contained in a single line from a text file in Python 2.7?

I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.

If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.

JSON format strings in a .TXT file that I want to parse in Python

I am currently extracting Tweets from Twitter using Twitter IDs. The tool I am using comes with the dataset (Twitter IDs) that I have downloaded online and will be using for my masters dissertation. The tool takes the Twitter IDs and extracts the information from the Tweets, storing each Tweet as a JSON string in a .TXT file.
Below is a link to my OneDrive, where I have 2 files:
https://1drv.ms/f/s!At39YLF-U90fhJwCdEuzAc2CGLC_fg
1) Extracted Tweet information, each as a JSON string in a .txt file
2) Extracted Tweet information, each as a JSON string in what I believe is a .json file. The reason I say 'believe' is because the tool I am using automatically creates a file that contains '.json' at the end of the filename but in a .TXT format. I have simply renamed the file by removing 'TXT.' from the end
Below is code I have written (it is simple but the more I look for alternative code online, the more confused I become):
import pandas as pd
dftest = pd.read_json('test.json', lines=True)
The following error appears when I run the code:
ValueError: Unexpected character found when decoding array value (2)
I have run the first few Tweet arrays into a free online JSON parser and it breaks out the features of the Tweet exactly how I am wishing it to (to my knowledge this confirms they Tweet arrays are in a JSON format). This can be seen in the screenshot below:
I would be grateful if people could:
1) Confirm the extracted Tweets are in fact in a JSON string format
2) Confirm if the filename is automatically saved as 'text.json.txt' and I remove 'txt' from the filename, does this become a .json file?
3) Suggest how to get my very short Python script to work. The ultimate aim is to identify the features in each Tweet that I want (e.g. "created_at", "text", "hashtags", "location" etc.) in a Dataframe, so I can then save it to a .csv file.

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.

You may want to give ijson a try : https://pypi.python.org/pypi/ijson

Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Loading extremely large JSON file without knowing the schema?

I'm trying to load an extremely large JSON file in Python. I've tried:
import json
data = open('file.json').read()
loaded = json.loads(data)
but that gives me a SIGKILL error.
I've tried:
import pandas as pd
df = pd.read_json('file.json')
and I get an out-of-memory error.
I'd like to try to use ijson to stream my data and only pull a subset into it at a time. However, you need to know what the schema of the JSON file is so that you know what events to look for. I don't actually know what the schema of my JSON file is. So, I have two questions:
Is there a way to load or stream a large json file in Python without knowing the schema? Or a way to convert a JSON file into another format (or into a postgresql server, for example)?
Is there a tool for spitting out what the schema of my JSON file is?
UPDATE:
Used head file.json to get an idea of what my JSON file looks like. From there it's a bit easier.

I would deal with smaller pieces of the file. Take a look at Lazy Method for Reading Big File in Python?. You can adapt the proposed answer to parse your JSON object by object.

You can read in chunks, something like this
f=open("file.json")
while True:
data = f.read(1024)
if not data:
break
yield data
Line by line option
data = []
with open('file') as f:
for line in f:
data.append(json.loads(line))
Also look at
https://www.dataquest.io/blog/python-json-tutorial/
Look for more answers with jsonline

How to convert a large Json file into a csv using python

(Python 3.5)
I am trying to parse a large user review.json file (1.3gb) into python and convert to a .csv file. I have tried looking for a simple converter tool online, most of which accept a file size maximum of 1Mb or are super expensive.
as i am fairly new to python i guess i ask 2 questions.
is it even possible/ efficient to do so or should i be looking for another method?
I tried the following code, it only is reading the and writing the top 342 lines in my .json doc then returning an error.
Blockquote
File "C:\Anaconda3\lib\json__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Anaconda3\lib\json\decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
JSONDecodeError: Extra data
This is the code im using
import csv
import json
infile = open("myfile.json","r")
outfile = open ("myfile.csv","w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.writerow(row)
my .json example:
Link To small part of Json
My thoughts is its some type of error related to my for loop, with json.loads... but i do not know enough about it. Is it possible to create a dictionary{} and take convert just the values "user_id", "stars", "text"? or am i dreaming.
Any suggestions or criticism are appreciated.

This is not a JSON file; this is a file containing individual lines of JSON. You should parse each line individually.
for row in infile:
data = json.loads(row)
writer.writerow(data)

Sometimes it's not as easy as having one JSON definition per line of input. A JSON definition can spread out over multiple lines, and it's not necessarily easy to determine which are the start and end braces reading line by line (for example, if there are strings containing braces, or nested structures).
The answer is to use the raw_decode method of json.JSONDecoder to fetch the JSON definitions from the file one at a time. This will work for any set of concatenated valid JSON definitions. It's further described in my answer here: Importing wrongly concatenated JSONs in python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.