I am scraping the data from tweeter using a hashtag. My code below works perfectly. However, I would like to get 10 000 tweets and save them in the same JSON folder (Or save them in separate folder and then combine into one). When I run the code and print the length of my data frame, it prints only 100 tweets.
import json
credentials = {}
credentials['CONSUMER_KEY'] = ''
credentials['CONSUMER_SECRET'] = ''
credentials['ACCESS_TOKEN'] = ''
credentials['ACCESS_SECRET'] = ''
# Save the credentials object to file
with open("twitter_credentials.json", "w") as file:
json.dump(credentials, file)
# Import the Twython class
from twython import Twython
import json
# Load credentials from json file
with open("twitter_credentials.json", "r") as file:
creds = json.load(file)
# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
data = python_tweets.search(q='#python', result_type='mixed', count=10000)
with open('tweets_python.json', 'w') as fh:
json.dump(data, fh)
data1 = pd.DataFrame(data['statuses'])
print("\nSample size:")
print(len(data1))
OUTPUT:
Sample size:
100
I have seen some answers where I can use max_id. I have tried to write the code but this is wrong.
max_iters = 50
max_id = ""
for call in range(0,max_iters):
data = python_tweets.search(q='#python', result_type='mixed', count=10000, 'max_id': max_id)
File "<ipython-input-69-1063cf5889dc>", line 4
data = python_tweets.search(q='#python', result_type='mixed', count=10000, 'max_id': max_id)
^
SyntaxError: invalid syntax
Could you please tell me how can I get 10 000 tweets saved into one JSON file?
As from their docs here, you can use generator and get as much result as available.
results = python_tweets.cursor(twitter.search, q='python', result_type='mixed')
with open('tweets_python.json', 'w') as fh:
for result in results:
json.dump(result, fh)
Also, if you want to do max_id approach, argument should be passed as follows
python_tweets.search(q='#python', result_type='mixed', count=10000, max_id=max_id)
Related
I'm using API call through which I get data in every iteration but the issue is I'm confused that how I can save data in JSON file of every iteration.
language : Python
Version : 3.9
import virustotal_python
from pprint import pprint
folder_path = 'C:/Users/E-TIME/PycharmProjects/FYP script/263 Hascodes in Txt Format'
count = 0
for file in glob.glob(os.path.join(folder_path, '*.txt')):
with open(file, 'r') as f:
lines = f.read()
l = lines.split(" ")
l = l[0].split('\n')
for file_id in range(0,3):
with virustotal_python.Virustotal(
"ab8421085f362f075cc88cb1468534253239be0bc482da052d8785d422aaabd7") as vtotal:
resp = vtotal.request(f"files/{l[file_id]}/behaviours")
data = resp.data
pprint(data)
I'm trying to make a feature with a Discord bot where it counts how many commands have been run and it stores it on a JSON file named data.json. Here is the code:
import json
# Read 'data.json'
with open("data.json", "r") as d:
datar = json.load(d)
com = datar.get("commands-run")
# Functions
def command_increase():
commands_run = com + 1
dataw = open("data.json", "w")
dataw["commands-run"] = commands_run
And here is the JSON file:
{
"commands-run": 0
}
And this is the error I get when I run a command and it tries to increase the value in the JSON file:
TypeError: '_io.TextIOWrapper' object does not support item assignment
On top of that, it also completely wipes the JSON file. By that, I mean that it just clears everything. Even the brackets.
When you do a json.load, it loads your json data into a dict.
You can increase your command counter in this dict and re-write the dict back into your json file at the end of it.
import json
# Read 'data.json'
with open("data.json", "r") as d:
datar = json.load(d)
com = datar.get("commands-run")
# Functions
def command_increase():
commands_run = com + 1
datar["commands-run"] = commands_run
with open("data.json", "w") as dataw:
dataw.write(json.dumps(datar, indent=4))
command_increase()
I'm trying to tokenize all my tweets that I previously saved in json file. I follow this example: https://marcobonzanini.com/2015/03/09/mining-twitter-data-with-python-part-2/
import re
import json
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:#[\w_]+)', # #-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
When I add this at the end everything is working:
tweet = 'RT #marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))
I want to tokenize my tweets that I saved in JSON file and the website suggests to do it this way:
with open('mytweets.json', 'r') as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet['text'])
do_something_else(tokens)
This is how I'm trying to open my JSON file:
with open('data/digitalhealth.json', 'r') as f:
... for line in f:
... tweet = json.loads(line)
... tokens = preprocess(tweet['text'])
... do_something_else(tokens)
And this is what python returns:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
TypeError: list indices must be integers, not str
Does anyone know how to sort this out? I'm new to all this and I really don't have any idea what to do.
This is my code for collecting Data from Twitter's API:
import tweepy
import json
API_KEY = 'xxx'
API_SECRET = 'xxx'
TOKEN_KEY = 'xxx'
TOKEN_SECRET = 'xxx'
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(TOKEN_KEY, TOKEN_SECRET)
api = tweepy.API(auth, wait_on_rate_limit=True)
query = '#digitalhealth'
cursor = tweepy.Cursor(api.search, q=query, lang="en")
for page in cursor.pages():
tweets = []
for item in page:
tweets.append(item._json)
with open('Twitter_project/digitalhealth.json', 'wb') as outfile:
json.dump(tweets, outfile)
How do I change it now so I will have only dictionaries?
Thanks all of you for your help! I really appreciate it
For some reason you're storing your JSON dictionaries in lists... You should try to store them as dictionaries since that'd be much easier for you, but if you want to access them now then simply do: tweet[0] to access the dictionary, and from there you can access the dictionary data like so tweet[0]['text']. Still, look into reformatting the JSON properly.
am trying to write a loop that gets .json from an url via requests, then writes the .json to a .csv file. Then I need it to it over and over again until my list of names (.txt file) is finished(89 lines). I can't get it to go over the list, it just get the error:
AttributeError: module 'response' has no attribute 'append'
I canĀ“t find the issue, if I change 'response' to 'responses' I get also an error
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
OSError: [Errno 22] Invalid argument: "listan-['A..
I can't seem to find a loop fitting for my purpose. Since I am a total beginner of python I hope I can get some help here and learn more.
My code so far.
#Opens the file with pricelists
pricelists = []
with open('prislistor.txt', 'r') as f:
for i, line in enumerate(f):
pricelists.append(line.strip())
# build responses
responses = []
for pricelist in pricelists:
response.append(requests.get('https://api.example.com/3/prices/sublist/{}/'.format(pricelist), headers=headers))
#Format each response
fullData = []
for response in responses:
parsed = json.loads(response.text)
listan=(json.dumps(parsed, indent=4, sort_keys=True))
#Converts and creates a .csv file.
fullData.append(parsed['Prices'])
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw.writeheader()
for data in fullData:
dw = csv.DictWriter(outf, data[0].keys())
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))
Can you make the below changes in the place where you are making the api call(import json library as well) and see?
import json
responses = []
for pricelist in pricelists:
response = requests.get('https://api.example.com/3/prices/sublist/{}/'.format(pricelist), headers=headers)
response_json = json.loads(response.text)
responses.append(response_json)
and the below code also should be in a loop which loops through items in pricelists
for pricelist in pricelists:
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw.writeheader()
for data in fullData:
dw = csv.DictWriter(outf, data[0].keys())
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))
Finally got it working. Got a help from another questions I created here at the forum. #waynelpu
The misstake I did was to not put the code into a loop.
Here is the code that worked like a charm.
pricelists = []
with open('prislistor.txt', 'r') as f:
for i, line in enumerate(f): # from here on, a looping code block start with 8 spaces
pricelists = (line.strip())
# Keeps the indents
response = requests.get('https://api.example.se/3/prices/sublist/{}/'.format(pricelists), headers=headers)
#Formats it
parsed = json.loads(response.text)
listan=(json.dumps(parsed, indent=4, sort_keys=True))
#Converts and creates a .csv file.
data = parsed['Prices']
with open('listan-{}.csv'.format(pricelists), 'w') as outf:
dw = csv.DictWriter(outf, data[0].keys())
dw.writeheader()
for row in data:
dw.writerow(row)
print ("The file list-{}.csv is created!".format(pricelists))
# codes here is outside the loop but still INSIDE the 'with' block, so you can still access f here
# codes here leaves all blocks
I'm wondering what's the proper script to open large Twitter files streamed using tweepy on python 3. I've used the following with smaller files but now that my data collection is above 30GB+ I'm getting memory errors:
with open('data.txt') as f:
tweetStream = f.read().splitlines()
tweet = json.loads(tweetStream[0])
print(tweet['text'])
print(tweet['user']['screen_name'])
I've been unable to find what I need online so far so any help would be much appreciated.
Don't try and create an object that contains the entire file. Instead, as each line contains a tweet, work on the file one line at a time:
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
print(tweet['text'])
print(tweet['user']['screen_name'])
Perhaps store relevant tweets to another file or database, or produce a stastical summay. For example:
total = 0
about_badgers = 0
with open('data.txt') as f:
for line in f:
tweet = json.loads(line)
total +=1
if "badger" in tweet['text'].lower():
about_badgers += 1
print("Of " + str(total) +", " + str(about_badgers) +" were about badgers.")
Catch errors relating to unparseable lines like this:
with open('data.txt') as f:
for line in f:
try:
tweet = json.loads(line)
print(tweet['text'])
except json.decoder.JSONDecodeError:
# Do something useful, like write the failing line to an error log
pass
print(tweet['user']['screen_name'])