I am working with Python attempting to store tweets (more precisely only their date, user, bio and text) related to a specific keyword in a csv file.
As I am working on the free-to-use API of Twitter, I am limited to 450 tweets every 15 minutes.
So I have coded something which is supposed to store exactly 450 tweets in 15 minutes.
BUT the problem is something goes wrong when extracting the tweets so that at a specific point the same tweet is stored again and again.
Any help would be much appreciated !!
Thanks in advance
import time
from twython import Twython, TwythonError, TwythonStreamer
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET)
sfile = "tweets_" + keyword + todays_date + ".csv"
id_list = [last_id]
count = 0
while count < 3*60*60*2: #we set the loop to run for 3hours
# tweet extract method with the last list item as the max_id
print("new crawl, max_id:", id_list[-1])
tweets = twitter.search(q=keyword, count=2, max_id=id_list[-1])["statuses"]
time.sleep(2) ## 2 seconds rest between api calls (450 allowed within 15min window)
for status in tweets:
id_list.append(status["id"]) ## append tweet id's
if status==tweets[0]:
continue
if status==tweets[1]:
date = status["created_at"].encode('utf-8')
user = status["user"]["screen_name"].encode('utf-8')
bio = status["user"]["description"].encode('utf-8')
text = status["text"].encode('utf-8')
with open(sfile,'a') as sf:
sf.write(str(status["id"])+ "|||" + str(date) + "|||" + str(user) + "|||" + str(bio) + "|||" + str(text) + "\n")
count += 1
print(count)
print(date, text)
You should use Python's CSV library to write your CSV files. It takes a list containing all of the items for a row and automatically adds the delimiters for you. If a value contains a comma, it automatically adds quotes for you (which is how CSV files are meant to work). It can even handle newlines inside a value. If you open the resulting file into a spreadsheet application you will see it is correctly read in.
Rather than trying to use time.sleep(), a better approach is to work with absolute times. So the idea is to take your starting time and add three hours to it. You can then keep looping until this finish_time is reached.
The same approach can be made to your API call allocations. Keep a counter holding how many calls you have left and downcount it. If it reaches 0 then stop making calls until the next fifteen minute slot is reached.
timedelta() can be used to add minutes or hours to an existing datetime object. By doing it this way, your times will never slip out of sync.
The following shows a simulation of how you can make things work. You just need to add back your code to get your Tweets:
from datetime import datetime, timedelta
import time
import csv
import random # just for simulating a random ID
fifteen = timedelta(minutes=15)
finish_time = datetime.now() + timedelta(hours=3)
calls_allowed = 450
calls_remaining = calls_allowed
now = datetime.now()
next_allocation = now + fifteen
todays_date = now.strftime("%d_%m_%Y")
ids_seen = set()
with open(f'tweets_{todays_date}.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
while now < finish_time:
time.sleep(2)
now = datetime.now()
if now >= next_allocation:
next_allocation += fifteen
calls_remaining = calls_allowed
print("New call allocation")
if calls_remaining:
calls_remaining -= 1
print(f"Get tweets - {calls_remaining} calls remaining")
# Simulate a tweet response
id = random.choice(["1111", "2222", "3333", "4444"]) # pick a random ID
date = "01.01.2019"
user = "Fred"
bio = "I am Fred"
text = "Hello, this is a tweet\nusing a comma and a newline."
if id not in ids_seen:
csv_output.writerow([id, date, user, bio, text])
ids_seen.add(id)
As for the problem of keep writing the same Tweets. You could use a set() to hold all of the IDs that you have written. You could then test if a new tweet has already been seen before writing it again.
Related
I have several thousand pdfs which I need to re-name based on the content. The layouts of the pdfs are inconsistent. To re-name them I need to locate a specific string "MEMBER". I need the value after the string "MEMBER" and the values from the two lines above MEMBER, which are Time and Date values respectively.
So:
STUFF
STUFF
STUFF
DD/MM/YY
HH:MM:SS
MEMBER ######
STUFF
STUFF
STUFF
I have been using regex101.com and have ((.*(\n|\r|\r\n)){2})(MEMBER.\S+) which matches all of the values I need. But it puts them across four groups with group 3 just showing a carriage return.
What I have so far looks like this:
import fitz
from os import DirEntry, curdir, chdir, getcwd, rename
from glob import glob as glob
import re
failed_pdfs = []
count = 0
pdf_regex = r'((.*(\n|\r|\r\n)){2})(MEMBER.\S+)'
text = ""
get_curr = getcwd()
directory = 'PDF_FILES'
chdir(directory)
pdf_list = glob('*.pdf')
for pdf in pdf_list:
with fitz.open(pdf) as pdf_obj:
for page in pdf_obj:
text += page.get_text()
new_file_name = re.search(pdf_regex, text).group().strip().replace(":","").replace("-","") + '.pdf'
text = ""
#clean_new_file_name = new_file_name.translate({(":"): None})
print(new_file_name)
# Tries to rename a pdf. If the filename doesn't already exist
# then rename. If it does exist then throw an error and add to failed list
try:
rename(pdf, new_file_name )
except WindowsError:
count += 1
failed_pdfs.append(str(count) + ' - FAILED TO RENAME: [' + pdf + " ----> " + str(new_file_name) + "]")
If I specify a group in the re.search portion- Like for instance Group 4 which contains the MEMBER ##### value, then the file renames successfully with just that value. Similarly, Group 2 renames with the TIME value. I think the multiple lines are preventing it from using all of the values I need. When I run it with group(), the print value shows as
DATE
TIME
MEMBER ######.pdf
And the log count reflects the failures.
I am very new at this, and stumbling around trying to piece together how things work. Is the issue with how I built the regex or with the re.search portion? Or everything?
I have tried re-doing the Regular Expression, but I end up with multiple lines in the results, which seems connected to the rename failure.
The strategy is to read the page's text by words and sort them adequately.
If we then find "MEMBER", the word following it represents the hashes ####, and the two preceeding words must be date and time respectively.
found = False
for page in pdf_obj:
words = page.get_text("words", sort=True)
# all space-delimited strings on the page, sorted vertically,
# then horizontally
for i, word in enumerate(words):
if word[4] == "MEMBER":
hashes = words[i+1][4] # this must be the word after MEMBER!
time-string = words[i-1][4] # the time
date_string = words[i-2][4] # the date
found = True
break
if found == True: # no need to look in other pages
break
I have a companies.json file with 3.6 million records (Every record contains an id & vat number) and an event.json file with 76.000 records (+- 20 properties). I wrote a script doing the following steps:
Open both JSON files
Loop through the 76.000 event records (type is class dict)
Check if the status of the event is new
If the status is new, check if the event has a companyID
If the event has a companyID, loop through the 3.6 million records to find the matching company ID.
Check if the matching company record has a VAT number
Replace the companyID with the VAT number and add an companyIDIsVat boolean.
When all looping is done, write the events to a new JSON file.
The script is working fine but it's taking 6-7 hours to complete. Is there a way to speed it up?
Current script
import json
counter = 0;
with open('companies.json', 'r') as companiesFile:
with open('events.json', 'r') as eventsFile:
events = json.load(eventsFile)
companies = json.load(companiesFile)
for index, event in enumerate(events):
print('Counter: ' + str(index))
if 'status' in event:
if(event['status'] == 'new'):
if 'companyID' in event:
for company in companies:
if(event['companyID'] == company['_id']):
if 'vat' in company:
event['companyID'] = company['vat']
event['companyIDIsVat'] = 1
counter = counter + 1
print('Found matches: ' + str(counter))
with open('new_events.json', 'w', encoding='utf-8') as f:
json.dump(events, f, ensure_ascii=False, indent=4)
So, the problem is that you are repeatedly searching through the entire companies list. But lists are inefficient for searching, because here, you must do a linear search, i.e., O(N). But you can do a constant-time search if you used a dict. Assuming you are company['_id'] is unique. Basically, you want to index on your IDs. For constant time lookups, use a dictionary, i.e. a map (a hash map in CPython, and probably every python implementation):
import json
counter = 0
with open('companies.json', 'r') as companiesFile:
with open('events.json', 'r') as eventsFile:
events = json.load(eventsFile)
companies = {
c["_id"]: c for c in json.load(companiesFile)
}
for index, event in enumerate(events):
print('Counter: ' + str(index))
if 'status' in event:
if (
event['status'] == 'new'
and 'companyID' in event
and event['companyID'] in companies
):
company = companies[event['companyID']]
if 'vat' in company:
event['companyID'] = company['vat']
event['companyIDIsVat'] = 1
counter = counter + 1
print('Found matches: ' + str(counter))
with open('new_events.json', 'w', encoding='utf-8') as f:
json.dump(events, f, ensure_ascii=False, indent=4)
This is a minimal modification to your script.
You should probably just save companies.json in the appropriate structure.
Again, it assumes the companies are unique by ID. If not, then you could use a dictionary of lists, and it should be still significantly faster as long as there aren't many repeating ID's
I'm an amateur astronomer (and retired) and just messing around with an idea. I want to scrape data from a NASA website that is a text file and extract specific values to help me identify when to observe. The text file automatically updates every 60 seconds. The text file has some header information before the actual rows and columns of data that I need to process. The actual data is numerical. Example:
Prepared by NASA
Please send comments and suggestions to xxx.com
Date Time Year
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
I want to access the string numerical data and convert it into a double
From what I can see the file is space delimited
I want to poll the website every 60 seconds and if Value 1 and Value 2 are above a specific threshold that will trigger a PyAutoGUI to automate a software application to take an image.
After reading the file from the website I tried converting the file into a dictionary thinking that I could then map keys to values, but I can't predict the exact location that I need. I thought once I extract the values I need I would write the file then try to convert the string into a double or float
I have tried to use
import re
re.split
to read each line and split out info but I get a huge mess because of the header information
I wanted to use a simple approach to open the file and this works
import urllib
import csv
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
print (data)
I found this on Stack overflow but I don't understand how I would use this
file = open('abc.txt','r')
while 1:
a = file.readline()
if a =='': break
a = a.split() #This creates a list of the input
name = a[0]
value = int(a[1]) # or value=float(a[1]) whatever you want
#use the name and value howsoever
f.close()
What I want is it to extract Value 1 and Value 2 as a double or float than in Part II (which I have not yet even started) I will compare Value 1 and Value 2 and if they are about a specific threshold this would trigger a PyAutoGUI to interact with my imaging software and trigger taking an image.
Here's a simple example of using regular expressions. This assumes you'd read the whole file into memory with a single f.read() rather that bothering to process individual lines, which with regular expressions, is often the simpler way to go (and I'm lazy and didn't want to have to create a test file):
import re
data = """
blah
blah
yr mo da hhmm day value1 value2
2019 03 31 1933 234 6.00e-09 1.00e-09
blah
"""
pattern = re.compile(r"(\d+) (\d+) (\d+) (\d+) (\d+) ([^\s]+) ([^\s]+)")
def main():
m = pattern.search(data)
if m:
# Do whatever processing you want to do here. You have access to all 7 input
# fields via m.group(1-7)
d1 = float(m.group(6))
d2 = float(m.group(7))
print(">{}< >{}<".format(d1, d2))
else:
print("no match")
main()
Output:
>6e-09< >1e-09<
You'd want to tweak this a bit if I've made a wrong assumption about the input data, but this gives you the general idea anyway.
This should handle just about anything else that exists in the input as long as nothing else does that looks like that one line you're interested in.
UPDATE:
I can't leave well enough alone. Here's code that pulls the data from the URL you provide and processes all the matching lines:
import re
import urllib
pattern = re.compile(r"(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+([^\s]+)\s+([^\s]+)")
def main():
data = urllib.urlopen("https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt").read()
pos = 0
while True:
m = pattern.search(data, pos)
if not m:
break
pos = m.end()
# Do whatever processing you want to do here. You have access to all 8 input
# fields via m.group(1-8)
f1 = float(m.group(7))
f2 = float(m.group(8))
print(">{}< >{}<".format(f1, f2))
main()
Result:
>9.22e-09< >1e-09<
>1.06e-08< >1e-09<
...
>8.99e-09< >1e-09<
>1.01e-08< >1e-09<
This was a fun little challenge, I've pulled all the data out of the table for you, mapped it to class and converted the data to int and Decimal as appropriate. Once it's populated you can read all the data you want from it.
To get the data I've used the requests library, instead of urllib, that's merely personal preference. You could use pip install requests, if you wanted to use it too. It has a method iter_lines that can traverse the rows of data.
This may be overkill for what you need, but as I wrote it anyway I thought I'd post it for you.
import re
from datetime import datetime
from decimal import Decimal
import requests
class SolarXrayFluxData:
def __init__(
self,
year,
month,
day,
time,
modified_julian_day,
seconds_of_the_day,
short,
long
):
self.date = datetime(
int(year), int(month), int(day), hour=int(time[:2]), minute=int(time[2:])
)
self.modified_julian_day = int(modified_julian_day)
self.seconds_of_the_day = int(seconds_of_the_day)
self.short = Decimal(short)
self.long = Decimal(long)
class GoesXrayFluxPrimary:
def __init__(self):
self.created_at = ''
self.data = []
def extract_data(self, url):
data = requests.get(url)
for i, line in enumerate(data.iter_lines(decode_unicode=True)):
if line[0] in [':', '#']:
if i is 1:
self.set_created_at(line)
continue
row_data = re.findall(r"(\S+)", line)
self.data.append(SolarXrayFluxData(*row_data))
def set_created_at(self, line):
date_str = re.search(r'\d{4}\s\w{3}\s\d{2}\s\d{4}', line).group(0)
self.created_at = datetime.strptime(date_str, '%Y %b %d %H%M')
if __name__ == '__main__':
goes_xray_flux_primary = GoesXrayFluxPrimary()
goes_xray_flux_primary.extract_data('https://services.swpc.noaa.gov/text/goes-xray-flux-primary.txt')
print("Created At: %s" % goes_xray_flux_primary.created_at)
for row in goes_xray_flux_primary.data:
print(row.date)
print("%.12f, %.12f" % (row.short, row.long))
The intention of the SolarXrayFluxData class is to store each items data and to make sure it is in a nice usable format. While the GoesXrayFluxPrimary class is used populate a list of SolarXrayFluxData and to store any other data that you might want to pull out. For example I've grabbed the Created date and time. You could also get the Location and Source from the header data.
I'm trying to make a program that reads in IRC posts and then determines the activity of that channel. Activity would be measured in two ways:
Posts per hour
Change in activity from last time step
The way i want to do it is to create a dictionary of lists. The keys in the dictionary will be a time stamp and will increase by exactly one minute. I will then scan over all lines in the IRC log and add the posts to the appropriate keys given the time and date the post was written. Thus allowing me no find out how many posts per minute (and therefore hours, days etc) were written.
Here is a sample of the log
'[2015-04-22 08:57:36] <Sirisian> ['
And the beginning of my code
# -*- coding: utf-8 -*-
in_file = open('Output.txt')
with in_file as f:
data = f.readlines()
#create dictionary of one minute timestamps
start = '[2015-04-22 08:57:00]'
min_store = {}
#check time stamp of post and asign to correct key
How can a code something that takes this [2015-04-22 08:57:00] and returns many time stamps, all with 1 minute increments up to some time many days / months in the future ?
Take a look at datetime.timedelta. You can parse your log datestamps into datetime objects which are much more workable. You can then just add and subtract datetime.timedelta objects from them to get times into the past or the future. datetime objects are also hashable so you can use them as dictionary keys directly, and can parse dates and times from strings so you can load in your log doing that.
You should be just adding posts to this dictionary as you see them rather than guessing at the time the log file ends. This will make your script reusable. You also seem to misunderstand how to open a file. You should also use the datetime module to handle the timestamps. Here is an example:
import datetime
import re
my_dict = {}
with open('Output.txt') as f:
for line in f:
stamp = re.search('\[(.+)\]', line)
stamp = stamp.group(1)
d = datetime.strptime(stamp, '%Y-%m-%d %H:%M:%s')
d.seconds = 0
msg = # here is where the message would go. you need to parse it out
if d in my_dict:
my_dict[d].append(msg)
else:
my_dict[d] = [msg]
Play with that and you should have a reusable script.
Here is an approach that uses the collections library:
from collections import Counter
import time
cnt = Counter()
date_input_format = '[%Y-%m-%d %H:%M:%S]'
date_output_format = '[%Y-%m-%d %H:%M:00]'
start = time.strptime('[2015-04-22 09:01:00]', date_input_format)
with open('Output.txt') as f:
for line in f:
line_date = time.strptime(line[:21], date_input_format)
if line_date > start:
cnt[time.strftime(date_output_format, line_date)] += 1
#Sort the keys for presentation
moments = cnt.keys()
#Keep track of how much has changed from minute to minute
previous = 0
for moment in sorted(moments):
current = cnt[moment]
print(moment, current, '{0:{1}}'.format((current - previous), '+' if (current - previous) else ''))
previous = current
With a Output.txt that looks like this:
[2015-04-22 08:57:36] Hi
[2015-04-22 08:57:45] Yo
[2015-04-22 08:58:36] Hey
[2015-04-22 08:58:37] Ho
[2015-04-22 09:01:36] Let's
[2015-04-22 09:57:16] Go
[2015-04-22 09:57:26] Good bye
[2015-04-22 09:58:36] See Ya
[2015-04-22 09:59:36] Peace
You should see the following output:
[2015-04-22 09:01:00] 1 +1
[2015-04-22 09:57:00] 2 +1
[2015-04-22 09:58:00] 1 -1
[2015-04-22 09:59:00] 1 0
I'm trying to make a simple script which tells me when a Twitter account has a new tweet.
import urllib
def CurrentP(array, string):
count = 0
for a_ in array:
if a_ == string:
return count
count = count + 1
twitters = ["troyhunt", "codinghorror"]
last = []
site = "http://twitter.com/"
for twitter in twitters:
source = site+twitter
for line in urllib.urlopen(source):
if line.find(twitter+"/status") != -1:
id = line.split('/')[3]
if id != last[CurrentP(twitters,twitter)]:
print "[+] New tweet + " + twitter
last[CurrentP(twitters,twitter)] = id
But get this error when I try to run the script
File "twitter.py", line 16 in ?
for line in urllib.urlopen(source):
TypeError: iteration over non-sequence
What did I do wrong?
Web Scraping is not the most economical way of retrieving data, Twitter does provides it own API, which returns data in nice JSON format which is very easy to parse and get the relevant inforation, The nice thing is that there are many python libraries available which do the same for you , like Tweepy, making the data extraction as simple as this example.