How to open large twitter file (30GB+) in Python?

How to open large twitter file (30GB+) in Python? - python

I'm wondering what's the proper script to open large Twitter files streamed using tweepy on python 3. I've used the following with smaller files but now that my data collection is above 30GB+ I'm getting memory errors:
with open('data.txt') as f:
tweetStream = f.read().splitlines()
tweet = json.loads(tweetStream[0])
print(tweet['text'])
print(tweet['user']['screen_name'])
I've been unable to find what I need online so far so any help would be much appreciated.

Don't try and create an object that contains the entire file. Instead, as each line contains a tweet, work on the file one line at a time:
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
print(tweet['text'])
print(tweet['user']['screen_name'])
Perhaps store relevant tweets to another file or database, or produce a stastical summay. For example:
total = 0
about_badgers = 0
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
total +=1
if "badger" in tweet['text'].lower():
about_badgers += 1
print("Of " + str(total) +", " + str(about_badgers) +" were about badgers.")
Catch errors relating to unparseable lines like this:
with open('data.txt') as f:
for line in f:
try:
tweet = json.loads(line)
print(tweet['text'])
except json.decoder.JSONDecodeError:
# Do something useful, like write the failing line to an error log
pass
print(tweet['user']['screen_name'])

Related

How to get more than 100 tweets using twython python

I am scraping the data from tweeter using a hashtag. My code below works perfectly. However, I would like to get 10 000 tweets and save them in the same JSON folder (Or save them in separate folder and then combine into one). When I run the code and print the length of my data frame, it prints only 100 tweets.
import json
credentials = {}
credentials['CONSUMER_KEY'] = ''
credentials['CONSUMER_SECRET'] = ''
credentials['ACCESS_TOKEN'] = ''
credentials['ACCESS_SECRET'] = ''
# Save the credentials object to file
with open("twitter_credentials.json", "w") as file:
json.dump(credentials, file)
# Import the Twython class
from twython import Twython
import json
# Load credentials from json file
with open("twitter_credentials.json", "r") as file:
creds = json.load(file)
# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
data = python_tweets.search(q='#python', result_type='mixed', count=10000)
with open('tweets_python.json', 'w') as fh:
json.dump(data, fh)
data1 = pd.DataFrame(data['statuses'])
print("\nSample size:")
print(len(data1))
OUTPUT:
Sample size:
100
I have seen some answers where I can use max_id. I have tried to write the code but this is wrong.
max_iters = 50
max_id = ""
for call in range(0,max_iters):
data = python_tweets.search(q='#python', result_type='mixed', count=10000, 'max_id': max_id)
File "<ipython-input-69-1063cf5889dc>", line 4
data = python_tweets.search(q='#python', result_type='mixed', count=10000, 'max_id': max_id)
^
SyntaxError: invalid syntax
Could you please tell me how can I get 10 000 tweets saved into one JSON file?

As from their docs here, you can use generator and get as much result as available.
results = python_tweets.cursor(twitter.search, q='python', result_type='mixed')
with open('tweets_python.json', 'w') as fh:
for result in results:
json.dump(result, fh)
Also, if you want to do max_id approach, argument should be passed as follows
python_tweets.search(q='#python', result_type='mixed', count=10000, max_id=max_id)

how do I use the line I get from the file in equation

I am processing on json files with python programming. I want to compare the data from json file with the lines in file.txt and get the output according to the result.
what should I replace with filex[0] in the code?
filename = 'paf.json'
with open(filename, 'r') as f:
for line in f:
if line.strip():
tweet = json.loads(line)
file1=open("file.txt","r")
filex=file1.readlines()
for linex in filex:
lines=linex
for char in tweet:
if str(tweet['entities']['urls'][0]['expanded_url']) == filex[0]:
print(str(tweet['created_at']))
break

It's hard to tell exactly what you're asking, but I suspect you want to loop through the lines in both files in parallel.
with open("paf.json", "r") as json_file, open("file.txt", "r") as text_file:
for json_line, text_line in zip(json_file, text_file):
tweet = json.loads(json_line)
if tweet['entities']['urls'][0]['expanded_url'] == text_line:
print(tweet['created_at'])
This will tell you if line N in the text file matches the URL in line N in the JSON file.

Python loop that writes files from requests

am trying to write a loop that gets .json from an url via requests, then writes the .json to a .csv file. Then I need it to it over and over again until my list of names (.txt file) is finished(89 lines). I can't get it to go over the list, it just get the error:
AttributeError: module 'response' has no attribute 'append'
I can´t find the issue, if I change 'response' to 'responses' I get also an error
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
OSError: [Errno 22] Invalid argument: "listan-['A..
I can't seem to find a loop fitting for my purpose. Since I am a total beginner of python I hope I can get some help here and learn more.
My code so far.
#Opens the file with pricelists
pricelists = []
with open('prislistor.txt', 'r') as f:
for i, line in enumerate(f):
pricelists.append(line.strip())
# build responses
responses = []
for pricelist in pricelists:
response.append(requests.get('https://api.example.com/3/prices/sublist/{}/'.format(pricelist), headers=headers))
#Format each response
fullData = []
for response in responses:
parsed = json.loads(response.text)
listan=(json.dumps(parsed, indent=4, sort_keys=True))
#Converts and creates a .csv file.
fullData.append(parsed['Prices'])
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw.writeheader()
for data in fullData:
dw = csv.DictWriter(outf, data[0].keys())
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))

Can you make the below changes in the place where you are making the api call(import json library as well) and see?
import json
responses = []
for pricelist in pricelists:
response = requests.get('https://api.example.com/3/prices/sublist/{}/'.format(pricelist), headers=headers)
response_json = json.loads(response.text)
responses.append(response_json)
and the below code also should be in a loop which loops through items in pricelists
for pricelist in pricelists:
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw.writeheader()
for data in fullData:
dw = csv.DictWriter(outf, data[0].keys())
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))

Finally got it working. Got a help from another questions I created here at the forum. #waynelpu
The misstake I did was to not put the code into a loop.
Here is the code that worked like a charm.
pricelists = []
with open('prislistor.txt', 'r') as f:
for i, line in enumerate(f): # from here on, a looping code block start with 8 spaces
pricelists = (line.strip())
# Keeps the indents
response = requests.get('https://api.example.se/3/prices/sublist/{}/'.format(pricelists), headers=headers)
#Formats it
parsed = json.loads(response.text)
listan=(json.dumps(parsed, indent=4, sort_keys=True))
#Converts and creates a .csv file.
data = parsed['Prices']
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw = csv.DictWriter(outf, data[0].keys())
dw.writeheader()
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))
# codes here is outside the loop but still INSIDE the 'with' block, so you can still access f here
# codes here leaves all blocks

I'm unable to write tweets to a json file with _json due to AttributeError

This is my code:
searched_tweets = []
new_tweets = api.search(q=query, count=remaining_tweets, since_id=str(since_id), max_id=str(max_id))
searched_tweets.append(new_tweets)
def write_tweets(tweets, filename):
with open(filename + ".json", "w") as f:
for tweet in tweets:
json.dump(tweet, f)
f.write('\n')
write_tweets(searched_tweets, "testfile")
And I get this error when I try to write tweets to a new json file.
AttributeError: 'SearchResults' object has no attribute '_json'
I'm using Python 3x.

You're appending the search result as a whole to the searched_tweets list, so that now consists of a single item.
You should remove the searched_tweets stuff altogether and just pass new_tweets to your function.

Can't get any data on Python

I collected a lot of tweet. Then I want to output only English tweets. I can get all of tweet included non-English tweet. But if I append some code for i in range (0,1000): if tweet['statuses'][i][u'lang']==u'en': for getting only English tweet, it can't be collected like that. And there are no error.
In [1]: runfile('C:/Users/Desktop/tweets.py', wdir='C:/Users/Desktop')
It just runs and there("C:/Users/Desktop/A.txt") are no data. My code is as follows. What should I do with it?
try:
import json
except ImportError:
import simplejson as json
tweets_filename = 'C:/Users/Desktop/tweet.txt' #Original tweet
tweets_file = open(tweets_filename, "r")
for line in tweets_file:
try:
tweet = json.loads(line.strip())
for i in range (0,1000): #This is the part for filtering English tweet
if tweet['statuses'][i][u'lang']==u'en': #Same part
if 'text' in tweet:
print (tweet['created_at'])
print (tweet['text'])
hashtags = []
for hashtag in tweet['entities']['hashtags']:
hashtags.append(hashtag['text'])
print(hashtags)
output = "C:/Users/Desktop/A.txt" #Only English tweet path
out_file = open(output, 'a')
out_file.write(tweet['user']['name'] + "," + tweet['text'] + "\n \n")
out_file.close()
except:
continue

You have to read the lines of tweet_file, like this;
lines = tweet_file.readlines()
for line in lines:
...
Also, if you want to see the errors. Donät catch them. Some good reading Zen of Python

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to open large twitter file (30GB+) in Python? - python

Related

How to get more than 100 tweets using twython python

how do I use the line I get from the file in equation

Python loop that writes files from requests

I'm unable to write tweets to a json file with _json due to AttributeError

Can't get any data on Python

Categories

Resources