Notify About New Tweets - python

I'm trying to make a simple script which tells me when a Twitter account has a new tweet.
import urllib
def CurrentP(array, string):
count = 0
for a_ in array:
if a_ == string:
return count
count = count + 1
twitters = ["troyhunt", "codinghorror"]
last = []
site = "http://twitter.com/"
for twitter in twitters:
source = site+twitter
for line in urllib.urlopen(source):
if line.find(twitter+"/status") != -1:
id = line.split('/')[3]
if id != last[CurrentP(twitters,twitter)]:
print "[+] New tweet + " + twitter
last[CurrentP(twitters,twitter)] = id
But get this error when I try to run the script
File "twitter.py", line 16 in ?
for line in urllib.urlopen(source):
TypeError: iteration over non-sequence
What did I do wrong?

Web Scraping is not the most economical way of retrieving data, Twitter does provides it own API, which returns data in nice JSON format which is very easy to parse and get the relevant inforation, The nice thing is that there are many python libraries available which do the same for you , like Tweepy, making the data extraction as simple as this example.

Related

I want to search a file of tweets to find the most popular hashtags used

For a python project I have been asked to collect tweets over a certain period of time about a certain topic. I now have a file with hundreds of tweets. How do I search for most popular hashtags in that file to create a word cloud?
Let us suppose that your corpus is stored as a list and all the special characters are already removed. I am using functions from sklearn
corpus = ['the text of your tweet','quote in it']
vectorizer = TfidfVectorizer(stop_words='english')
v = vectorizer.fit_transform(corpus)
names = vectorizer.get_feature_names()
dense = v.todense()
final_list = dense.tolist()
df = pd.DataFrame(final_list, columns=names)
Cloud = WordCloud(background_color="white", max_words=50).generate_from_frequencies(df.T.sum(axis=1))
I will suppose that you have the ID of each tweet
you need to send a GET request to this url
"
https://twitter.com/i/api/graphql/6n-3uwmsFr53-5z_w5FTVw/TweetDetail?variables=%7B%22focalTweetId<YOUR_TWEET_ID> with_rux_injections%22%3Afalse%2C%22includePromotedContent%22%3Atrue%2C%22withCommunity%22%3Atrue%2C%22withQuickPromoteEligibilityTweetFields%22%3Atrue%2C%22withBirdwatchNotes%22%3Afalse%2C%22withSuperFollowsUserFields%22%3Atrue%2C%22withDownvotePerspective%22%3Afalse%2C%22withReactionsMetadata%22%3Afalse%2C%22withReactionsPerspective%22%3Afalse%2C%22withSuperFollowsTweetFields%22%3Atrue%2C%22withVoice%22%3Atrue%2C%22withV2Timeline%22%3Atrue%2C%22__fs_responsive_web_like_by_author_enabled%22%3Afalse%2C%22__fs_dont_mention_me_view_api_enabled%22%3Atrue%2C%22__fs_interactive_text_enabled%22%3Atrue%2C%22__fs_responsive_web_uc_gql_enabled%22%3Afalse%2C%22__fs_responsive_web_edit_tweet_api_enabled%22%3Afalse%7D
"
Note: the url looks not good because of the line breaks but
hopefully you understood
and the very first kind of param is focalTweetId, which is the id of the tweet, this API call will return a data object where you'll find all infos about a tweet
const response = await fetch(url)
console.log(response.data.instructions[0].entries[0].content.itemContent.tweet_results.result.legacy.retweet_count)
I did this in JavaScript, so you can do it in python with
response = requests.get(url)
# ...
this will return the retweet_count, and there are a lot of other usefull data you can use

how to extract specific string from api response in python?

how to extract "e2f64fd6-13aa-5c6c-932a-c366a4f56076" from the below api response in python ?
{"message": "Rendition service output e2f64fd6-13aa-5c6c-932a-c366a4f56076/ae8f5aae-4d6a-5a17-9f95-d918634a668c has been created successfully."}
Assuming the message follows the same format always, there are several ways:
First I save the message on a variable d so I can work with it:
d = {"message": "Rendition service output e2f64fd6-13aa-5c6c-932a-c366a4f56076/ae8f5aae-4d6a-5a17-9f95-d918634a668c has been created successfully."}
Solution 1:
d['message'][25:].split('/')[0]
'e2f64fd6-13aa-5c6c-932a-c366a4f56076'
Solution 2 (I like this one more):
d['message'].split(' ')[3].split('/')[0]
'e2f64fd6-13aa-5c6c-932a-c366a4f56076'
If the format of the API response is fixed, you can use regex to extract your data.
import regex
message = "Rendition service output e2f64fd6-13aa-5c6c-932a-c366a4f56076/ae8f5aae-4d6a-5a17-9f95-d918634a668c has been created successfully."
m = regex.search('output (.+?)/', message)
if m:
print(m.group(1))
# prints e2f64fd6-13aa-5c6c-932a-c366a4f56076
If you don't want to use regex you could do:
start = message.find('output ') + len('output ') # To get the index of the character behind this string
end = message.find('/', start)
print(message[start:end])
# prints e2f64fd6-13aa-5c6c-932a-c366a4f56076
Found the information here

Parsing mixed text file with a lot header info

I'm an amateur astronomer (and retired) and just messing around with an idea. I want to scrape data from a NASA website that is a text file and extract specific values to help me identify when to observe. The text file automatically updates every 60 seconds. The text file has some header information before the actual rows and columns of data that I need to process. The actual data is numerical. Example:
Prepared by NASA
Please send comments and suggestions to xxx.com
Date Time Year
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
I want to access the string numerical data and convert it into a double
From what I can see the file is space delimited
I want to poll the website every 60 seconds and if Value 1 and Value 2 are above a specific threshold that will trigger a PyAutoGUI to automate a software application to take an image.
After reading the file from the website I tried converting the file into a dictionary thinking that I could then map keys to values, but I can't predict the exact location that I need. I thought once I extract the values I need I would write the file then try to convert the string into a double or float
I have tried to use
import re
re.split
to read each line and split out info but I get a huge mess because of the header information
I wanted to use a simple approach to open the file and this works
import urllib
import csv
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
print (data)
I found this on Stack overflow but I don't understand how I would use this
file = open('abc.txt','r')
while 1:
a = file.readline()
if a =='': break
a = a.split() #This creates a list of the input
name = a[0]
value = int(a[1]) # or value=float(a[1]) whatever you want
#use the name and value howsoever
f.close()
What I want is it to extract Value 1 and Value 2 as a double or float than in Part II (which I have not yet even started) I will compare Value 1 and Value 2 and if they are about a specific threshold this would trigger a PyAutoGUI to interact with my imaging software and trigger taking an image.
Here's a simple example of using regular expressions. This assumes you'd read the whole file into memory with a single f.read() rather that bothering to process individual lines, which with regular expressions, is often the simpler way to go (and I'm lazy and didn't want to have to create a test file):
import re
data = """
blah
blah
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
blah
"""
pattern = re.compile(r"(\d+) (\d+) (\d+) (\d+) (\d+) ([^\s]+) ([^\s]+)")
def main():
m = pattern.search(data)
if m:
# Do whatever processing you want to do here. You have access to all 7 input
# fields via m.group(1-7)
d1 = float(m.group(6))
d2 = float(m.group(7))
print(">{}< >{}<".format(d1, d2))
else:
print("no match")
main()
Output:
>6e-09< >1e-09<
You'd want to tweak this a bit if I've made a wrong assumption about the input data, but this gives you the general idea anyway.
This should handle just about anything else that exists in the input as long as nothing else does that looks like that one line you're interested in.
UPDATE:
I can't leave well enough alone. Here's code that pulls the data from the URL you provide and processes all the matching lines:
import re
import urllib
pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")
def main():
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
pos = 0
while True:
m = pattern.search(data, pos)
if not m:
break
pos = m.end()
# Do whatever processing you want to do here. You have access to all 8 input
# fields via m.group(1-8)
f1 = float(m.group(7))
f2 = float(m.group(8))
print(">{}< >{}<".format(f1, f2))
main()
Result:
>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<
This was a fun little challenge, I've pulled all the data out of the table for you, mapped it to class and converted the data to int and Decimal as appropriate. Once it's populated you can read all the data you want from it.
To get the data I've used the requests library, instead of urllib, that's merely personal preference. You could use pip install requests, if you wanted to use it too. It has a method iter_lines that can traverse the rows of data.
This may be overkill for what you need, but as I wrote it anyway I thought I'd post it for you.
import re
from datetime import datetime
from decimal import Decimal
import requests
class SolarXrayFluxData:
def __init__(
self,
year,
month,
day,
time,
modified_julian_day,
seconds_of_the_day,
short,
long
):
self.date = datetime(
int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
)
self.modified_julian_day = int(modified_julian_day)
self.seconds_of_the_day = int(seconds_of_the_day)
self.short = Decimal(short)
self.long = Decimal(long)
class GoesXrayFluxPrimary:
def __init__(self):
self.created_at = ''
self.data = []
def extract_data(self, url):
data = requests.get(url)
for i, line in enumerate(data.iter_lines(decode_unicode=True)):
if line[0] in [':', '#']:
if i is 1:
self.set_created_at(line)
continue
row_data = re.findall(r"(\S+)", line)
self.data.append(SolarXrayFluxData(*row_data))
def set_created_at(self, line):
date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')
if __name__ == '__main__':
goes_xray_flux_primary = GoesXrayFluxPrimary()
goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
print("Created At: %s" % goes_xray_flux_primary.created_at)
for row in goes_xray_flux_primary.data:
print(row.date)
print("%.12f, %.12f" % (row.short, row.long))
The intention of the SolarXrayFluxData class is to store each items data and to make sure it is in a nice usable format. While the GoesXrayFluxPrimary class is used populate a list of SolarXrayFluxData and to store any other data that you might want to pull out. For example I've grabbed the Created date and time. You could also get the Location and Source from the header data.

problem while storing tweets into csv file

I am working with Python attempting to store tweets (more precisely only their date, user, bio and text) related to a specific keyword in a csv file.
As I am working on the free-to-use API of Twitter, I am limited to 450 tweets every 15 minutes.
So I have coded something which is supposed to store exactly 450 tweets in 15 minutes.
BUT the problem is something goes wrong when extracting the tweets so that at a specific point the same tweet is stored again and again.
Any help would be much appreciated !!
Thanks in advance
import time
from twython import Twython, TwythonError, TwythonStreamer
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)
sfile = "tweets_" + keyword + todays_date + ".csv"
id_list = [last_id]
count = 0
while count < 3*60*60*2: #we set the loop to run for 3hours
# tweet extract method with the last list item as the max_id
print("new crawl, max_id:", id_list[-1])
tweets = twitter.search(q=keyword, count=2, max_id=id_list[-1])["statuses"]
time.sleep(2) ## 2 seconds rest between api calls (450 allowed within 15min window)
for status in tweets:
id_list.append(status["id"]) ## append tweet id's
if status==tweets[0]:
continue
if status==tweets[1]:
date = status["created_at"].encode('utf-8')
user = status["user"]["screen_name"].encode('utf-8')
bio = status["user"]["description"].encode('utf-8')
text = status["text"].encode('utf-8')
with open(sfile,'a') as sf:
sf.write(str(status["id"])+ "|||" + str(date) + "|||" + str(user) + "|||" + str(bio) + "|||" + str(text) + "\n")
count += 1
print(count)
print(date, text)
You should use Python's CSV library to write your CSV files. It takes a list containing all of the items for a row and automatically adds the delimiters for you. If a value contains a comma, it automatically adds quotes for you (which is how CSV files are meant to work). It can even handle newlines inside a value. If you open the resulting file into a spreadsheet application you will see it is correctly read in.
Rather than trying to use time.sleep(), a better approach is to work with absolute times. So the idea is to take your starting time and add three hours to it. You can then keep looping until this finish_time is reached.
The same approach can be made to your API call allocations. Keep a counter holding how many calls you have left and downcount it. If it reaches 0 then stop making calls until the next fifteen minute slot is reached.
timedelta() can be used to add minutes or hours to an existing datetime object. By doing it this way, your times will never slip out of sync.
The following shows a simulation of how you can make things work. You just need to add back your code to get your Tweets:
from datetime import datetime, timedelta
import time
import csv
import random # just for simulating a random ID
fifteen = timedelta(minutes=15)
finish_time = datetime.now() + timedelta(hours=3)
calls_allowed = 450
calls_remaining = calls_allowed
now = datetime.now()
next_allocation = now + fifteen
todays_date = now.strftime("%d_%m_%Y")
ids_seen = set()
with open(f'tweets_{todays_date}.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while now < finish_time:
time.sleep(2)
now = datetime.now()
if now >= next_allocation:
next_allocation += fifteen
calls_remaining = calls_allowed
print("New call allocation")
if calls_remaining:
calls_remaining -= 1
print(f"Get tweets - {calls_remaining} calls remaining")
# Simulate a tweet response
id = random.choice(["1111", "2222", "3333", "4444"]) # pick a random ID
date = "01.01.2019"
user = "Fred"
bio = "I am Fred"
text = "Hello, this is a tweet\nusing a comma and a newline."
if id not in ids_seen:
csv_output.writerow([id, date, user, bio, text])
ids_seen.add(id)
As for the problem of keep writing the same Tweets. You could use a set() to hold all of the IDs that you have written. You could then test if a new tweet has already been seen before writing it again.

Google search from Python program

I'm trying to take an input file, read each line, search google with that line and print all the search results from the query ONLY IF the result is from a specific website. A simple example to illustrate my point, if I search dog I only want results printed from wikipedia, whether that be one result or ten results from wikipedia. My problem is I've been getting really weird results. Below is my Python code which contains a specific URL I want results from.
My program
inputFile = open("small.txt", 'r') # Makes File object
outputFile = open("results1.txt", "w")
dictionary = {} # Our "hash table"
compare = "www.someurl.com/" # urls will compare against this string
from googlesearch import GoogleSearch
for line in inputFile.read().splitlines():
lineToRead = line
dictionary[lineToRead] = [] #initialzed to empty list
gs = GoogleSearch(lineToRead)
for url in gs.top_urls():
print url # check to make sure this is printing URLs
compare2 = url
if compare in compare2: #compare the two URLs, if they match
dictionary[lineToRead].append(url) #write out query string to dictionary key & append EACH url that matches
inputFile.close()
for i in dictionary:
print i # this print is a test that shows what the query was in google (dictionary key)
outputFile.write(i+"\n")
for j in dictionary[i]:
print j # this print is a test that shows the results from the query which should look like correct URL: "www.medicaldepartmentstore.com/..."(dictionary value(s))
outputFile.write(j+"\n") #write results for the query string to the output file.
My output file is incorrect, the way it's supposed to be formatted is
query string
http://www.
http://www.
http://www.
query string
http://www.
query string
http://www.medical...
http://www.medical...
Can you limit the scope of the results to the specific site (e.g. wikipedia) at the time of the query? For example, using:
gs = GoogleSearch("site:wikipedia.com %s" % query) #as shown in https://pypi.python.org/pypi/googlesearch/0.7.0
This would instruct Google to return only the results from that domain, so you won't need to filter them after seeing the results.
I think #Cahit has the right idea. The only reason you would be getting lines of just the query string is because the domain you were looking for wasn't in the top_urls(). You can verify this by checking if the array contained in the dictionary for a given key is empty
for i in dictionary:
outputFile.write("%s: " % str(i))
if len(dictionary[i]) == 0:
outputFile.write("No results in top_urls\n")
else:
outputFile.write("%s\n" % ", ".join(dictionary[i]))

Categories