How to send scraped data through reddit bot - python

So I've got this bot that I want to use to reply with the box score of the mets game anytime someone says "mets score" on a specific subreddit. This is my first python project and I plan on using it on a dummy subreddit I created as a learning tool. I'm having trouble sending the scores from the website I scraped through the bot so it can appear in the reply to the "mets score" comments. Any suggestions?
import praw
import time
from lxml import html
import requests
from bs4 import BeautifulSoup
r = praw.Reddit(user_agent = 'my_first_bot')
r.login('user_name', 'password')
def scores():
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
table = soup.find("a",class_="teamName", text="NY Mets").find_previous("table")
a, b = [a.text for a in table.find_all("a",class_="teamName")]
inn, a_score, b_score = ([td.text for td in row.select("td.shsTotD")] for row in table.find_all("tr"))
print (" ".join(inn))
print ("{}: {}".format(a, " ".join(a_score)))
print ("{}: {}".format(b, " ".join(b_score)))
words_to_match = ['mets score']
cache = []
def run_bot():
print("Grabbing subreddit...")
subreddit = r.get_subreddit("random_subreddit")
print("Grabbing comments...")
comments = subreddit.get_comments(limit=40)
for comment in comments:
print(comment.id)
comment_text = comment.body.lower()
isMatch = any(string in comment_text for string in words_to_match)
if comment.id not in cache and isMatch:
print("match found!" + comment.id)
comment.reply('heres the score to last nights mets game...' scores())
print("reply successful")
cache.append(comment.id)
print("loop finished, goodnight")
while True:
run_bot()
time.sleep(120)

I think I'll just put you out of your misery ;). There are multiple issues with your code snippet:
comment.reply('heres the score to last nights mets game...' scores())
The .reply() method requires a string or an object that can have a good enough representation as a string. Assuming the method scores() returns a string, you should concatenate the two arguments, like this:
comment.reply('heres the score to last nights mets game...'+ scores())
It looks like your knowledge of basic python syntax and constructs is dusty. For a quick refresher see this.
Your method scores() doesn't return anything. It just prints out a bunch of lines (which I assume are for debugging purposes).
def scores():
soup = BeautifulSoup(requests.get("http://scores.nbcsports.com/mlb/scoreboard.asp?day=20160621&meta=true").content, "lxml")
.......
print (" ".join(inn))
print ("{}: {}".format(a, " ".join(a_score)))
print ("{}: {}".format(b, " ".join(b_score)))
Funnily enough you could use those exact strings for your return value (or maybe something else entirely, as suit your needs) like this:
def scores():
.......
inn_string = " ".join(inn)
a_string = "{}: {}".format(a, " ".join(a_score))
b_string = "{}: {}".format(b, " ".join(b_score))
return "\n".join([inn_string, a_string, b_string])
These should get you up and running.
More advice: Have you read the Reddit PRAW docs? You should. You should also probably use praw.helpers.comment_stream(). It's simple and easy to use and will handle retrieving new comments for you. Currently you try and fetch a maximum of 40 comments every 120 seconds. What happens when there are more than that many relevant comments in that 120 second span. You'll end up missing some of the comments you should've replied to. .comment_stream() will take care of rate limiting for you so that your bot can reply to each new comment which needs its attention at its own pace. Read more about this here.

Related

KeyError with Riot API Matchv5 When Trying To Pull Data

I'm trying to pull a list of team and player stats from match IDs. Everything looks fine to me but when I run my "for loops" to call the functions for pulling the stats I want, it just prints the error from my try/except block. I'm still pretty new to python and this is my first project so I've tried everything I can think of in the past few days but no luck. I believe the problem is with my actual pull request but I'm not sure as I'm also using a GitHub library I found to help me with the Riot API while I change and update it to get the info I want.
def get_match_json(matchid):
url_pull_match = "https://{}.api.riotgames.com/lol/match/v5/matches/{}/timeline?api_key={}".format(region, matchid, api_key)
match_data_all = requests.get(url_pull_match).json()
# Check to make sure match is long enough
try:
length_match = match_data_all['frames'][15]
return match_data_all
except IndexError:
return ['Match is too short. Skipping.']
And then this is a shortened version of the stat function:
def get_player_stats(match_data, player):
# Get player information at the fifteenth minute of the game.
player_query = match_data['frames'][15]['participantFrames'][player]
player_team = player_query['teamId']
player_total_gold = player_query['totalGold']
player_level = player_query['level']
And there are some other functions in the code as well but I'm not sure they are faulty as well or if they are needed to figure out the error. But here is the "for loop" to call the request and defines the variable 'matchid'
for matchid_batch in all_batches:
match_data = []
for match_id in matchid_batch:
time.sleep(1.5)
if match_id == 'MatchId':
pass
else:
try:
match_entry = get_match_row(match_id)
if match_entry[0] == 'Match is too short. Skipping.':
print('Match', match_id, "is too short.")
else:
match_entry = get_match_row(match_id).reshape(1, -1)
match_data.append(np.array(match_entry))
except KeyError:
print('KeyError.')
match_data = np.array(match_data)
match_data.shape = -1, 17
df = pd.DataFrame(match_data, columns=column_titles)
df.to_csv('Match_data_Diamond.csv', mode='a')
print('Done Batch!')
Since this is my first project any help would be appreciated since I can't find any info on this particular subject so I really don't know where to look to learn why it's not working on my own.
I guess your issue was that the 'frame' array is subordinate to the array 'info'.
def get_match_json(matchid):
url_pull_match = "https://{}.api.riotgames.com/lol/match/v5/matches/{}/timeline?api_key={}".format(region, matchid, api_key)
match_data_all = requests.get(url_pull_match).json()
try:
length_match = match_data_all['info']['frames'][15]
return match_data_all
except IndexError:
return ['Match is too short. Skipping.']
def get_player_stats(match_data, player): # player has to be an int (1-10)
# Get player information at the fifteenth minute of the game.
player_query = match_data['info']['frames'][15]['participantFrames'][str(player)]
#player_team = player_query['teamId'] - It is not possibly with the endpoint to get the teamId
player_total_gold = player_query['totalGold']
player_level = player_query['level']
return player_query
This example worked for me. Unfortunately it is not possible to gain the teamId only through your API-endpoint. Usually the players 1-5 are in team 100 (blue side) and 6-10 in team 200 (red side).

How to put a condition that only runs when there is any change in API?

Hello Community Members,
I am very new to python language and programming, currently I am working on a news API that shows the news from that API. I want this program to check and update whenever there is any update to the API. Please help what can I do to complete this.
CODE:
url = 'https://cryptopanic.com/api/v1/posts/?auth_token=<my token>&filter=hot'
html_link = requests.get(url)
datatype = html_link.json()
news_info = datatype['results']
latest_news = news_info[0]['title']
source = news_info[0]['source']['title']
print(latest_news)
I want this latest_news variable which stores the news to print whenever there is new news in the list, I have tried comparison method but still didn't find anything so far.
Does this fill your criteria? You have to run it every 5 minutes, or any time you want and you will get the latest titles.
import requests, json
old_news_info = {"news": []}
try:
old_news_info = json.load(open("old_news_info.json", "r"))
except:
pass
url = 'https://cryptopanic.com/api/v1/posts/?auth_token=<token>&filter=hot'
print("waiting for response")
html_link = requests.get(url)
datatype = html_link.json()
if datatype != {'status': 'Incomplete', 'info': 'Token not found'}:
news_info = datatype['results']
if not news_info[0] in old_news_info["news"]:
for news in news_info:
if news in old_news_info["news"]:
break
else:
old_news_info["news"].append(news)
print(news["source"]['title'])
json.dump(old_news_info, open("old_news_info.json", "w"), indent = 4)
else:
print("Token not found")

simple web scraper very slow

I'm fairly new to python and web-scraping in general. The code below works but it seems to be awfully slow for the amount of information its actually going through. Is there any way to easily cut down on execution time. I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated.
Currently the code starts at the sitemap then iterates through a list of additional sitemaps. Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. From the json data I pull an xml link that I use to search for a string. If the string is found it appends it to a text file.
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)
No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too.
This is checking about 2 urls per second on my machine which seems about right.
How this is better (you can argue with me about this part).
Don't have to reopen and close the output file after each write
Removed a fair bit of unneeded code
gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it)
Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident.
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))

Use of SET to ignore pre logged users in a looping script

I am trying to use a set in order to stop users being re printed in the following code. I managed to get python to accept he code without producing any bugs, but if I let the code run on a 10 second loop it continues to print the users who should have already been logged. This is my first attempt at using a set, and I am a complete novice at python (building it all so far based on examples I have seen and reverse engineering them.)
Below is an example of the code I am using
import mechanize
import urllib
import json
import re
import random
import datetime
from sched import scheduler
from time import time, sleep
######Code to loop the script and set up scheduling time
s = scheduler(time, sleep)
random.seed()
def run_periodically(start, end, interval, func):
event_time = start
while event_time < end:
s.enterabs(event_time, 0, func, ())
event_time += interval + random.randrange(-5, 45)
s.run()
###### Code to get the data required from the URL desired
def getData():
post_url = "URL OF INTEREST"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
######These are the parameters you've got from checking with the aforementioned tools
parameters = {'page' : '1',
'rp' : '250',
'sortname' : 'roi',
'sortorder' : 'desc'
}
#####Encode the parameters
data = urllib.urlencode(parameters)
trans_array = browser.open(post_url,data).read().decode('UTF-8')
xmlload1 = json.loads(trans_array)
pattern1 = re.compile('> (.*)<')
pattern2 = re.compile('/control/profile/view/(.*)\' title=')
pattern3 = re.compile('<span style=\'font-size:12px;\'>(.*)<\/span>')
##### Making the code identify each row, removing the need to numerically quantify the number of rows in the xmlfile,
##### thus making number of rows dynamic (change as the list grows, required for looping function to work un interupted)
for row in xmlload1['rows']:
cell = row["cell"]
##### defining the Keys (key is the area from which data is pulled in the XML) for use in the pattern finding/regex
user_delimiter = cell['username']
selection_delimiter = cell['race_horse']
if strikeratecalc2 < 12 : continue;
##### REMAINDER OF THE REGEX DELMITATIONS
username_delimiter_results = re.findall(pattern1, user_delimiter)[0]
userid_delimiter_results = (re.findall(pattern2, user_delimiter)[0])
user_selection = re.findall(pattern3, selection_delimiter)[0]
##### Code to stop duplicate posts of each user throughout the day
userset = set ([])
if userid_delimiter_results in userset: continue;
##### Printing the results of the code at hand
print "user id = ",userid_delimiter_results
print "username = ",username_delimiter_results
print "user selection = ",user_selection
print ""
##### Code to stop duplicate posts of each user throughout the day part 2 (udating set to add users already printed to the ignore list)
userset.update(userid_delimiter_results)
getData()
run_periodically(time()+5, time()+1000000, 300, getData)
Any comments will be greatly appreciated, this may seem common sense to you seasoned coders, but I really am just getting past "Hello world"
Kind regards AEA
This:
userset.update(userid_delimiter_results)
Should probably be this:
userset.add(userid_delimiter_results)
To prove it, try printing the contents of userset after each call.

Chatbot Conversation Objects, your approach?

I'm relatively new to programming and a recent project I've started to work on is a chatbot in python for an irc channel I frequent.
One of my goals is to have the bot able to very basically keep track of conversations it is having with users.
I presently an using conversation objects. When a user addresses the bot, it creates a new convo object and stores a log of the conversation, the current topic, etc. in that object.
When the user speaks, if their message matches the topic of the conversation, it chooses a response based on what they've said and the new topic.
For example, if the bot joins, and a user says: "Hello, bot." the conversation will be created and the topic set to "greeting".
the bot will say hello back and if the user asks: "What's up?", the bot will change the topic to "currentevents", and reply with "not much" or similar.
topic have related topic, and if the bot notices a sudden change to a topic not marked as related (questions are exceptions), it will act slightly confused and taken aback.
My question is though: I feel like my method is a bit overly complicated and unnecessary. I'm sure objects aren't the best thing to use. What would be another approach to keeping track of a conversation and it's topic? Be it a better way or worse, I'm just looking for ideas and a little brainstorming.
Before you say this isn't the right place, I've tried asking on programmers.stackexchange.com, but I didn't receive a relevant response, just someone who misunderstood me. I'm hoping I can get some more feedback on a more active site. In a way this is code help :)
Here's the code for my current approach. There are still a few bugs and I'm sure the code is far from efficient. Any tips or help on the code is welcomed.
def __init__(slef):
self.dicti_topics = {"None":["welcomed", "ask", "badbot", "leave"],
"welcomed":["welcomed", "howare", "badbot", "ask", "leave"],
"howare":["greetfinished", "badbot", "leave"]}
self.dicti_lines = {"hello":"welcomed", "howareyou":"howare", "goaway":"leave", "you'rebad":"badbot", "question":"asked"}
self.dicti_responce = dicti["Arriving dicti_responce"]
def do_actions(self):
if len(noi.recv) > 0:
line = False
##set vars
item = noi.recv.pop(0)
#update and trim lastrecv list
noi.lastrecv.append(item)
if len(noi.lastrecv) > 10: noi.lastrecv = noi.lastrecv[1:10]
args = item.split()
channel, user = args[0], args[1].split("!")[0]
message = " ".join(w for w in args[2:])
print "channel:", channel
print "User:", user
print "Message:", message
if re.match("noi", message):
if not user in noi.convos.keys():
noi.convos[user] = []
if not noi.convos[user]:
noi.convos[user] = Conversation(user)
noi.convos[user].channel = channel
line = "What?"
send(channel, line)
if re.match("hello|yo|hey|ohai|ello|howdy|hi", message) and (noi.jointime - time.time() < 20):
print "hello convo created"
if not user in noi.convos.keys():
noi.convos[user] = []
if not noi.convos[user]:
noi.convos[user] = Conversation(user, "welcomed")
noi.convos[user].channel = channel
#if user has an active convo
if user in noi.convos.keys():
##setvars
line = None
convo = noi.convos[user]
topic = convo.topic
#remove punctuation, "noi", and make lowercase
rmsg = message.lower()
for c in [".", ",", "?", "!", ";"]:
rmsg = rmsg.replace(c, "")
#print rmsg
rlist = rmsg.split("noi")
for rmsg in rlist:
rmsg.strip(" ")
#categorize message
if rmsg in ["hello", "yo", "hey", "ohai", "ello", "howdy", "hi"]: rmsg = "hello"
if rmsg in ["how do you do", "how are you", "sup", "what's up"]: rmsg = "howareyou"
if rmsg in ["gtfo", "go away", "shooo", "please leave", "leave"]: rmsg = "goaway"
if rmsg in ["you're bad", "bad bot", "stfu", "stupid bot"]: rmsg = "you'rebad"
#if rmsg in []: rmsg =
#if rmsg in []: rmsg =
#Question handling
r = r'(when|what|who|where|how) (are|is) (.*)'
m = re.match(r, rmsg)
if m:
rmsg = "question"
responce = "I don't know %s %s %s." % (m.group(1), m.group(3), m.group(2))
#dicti_lines -> {message: new_topic}
#if msg has an entry, get the new associated topic
if rmsg in self.dicti_lines.keys():
new_topic = self.dicti_lines[rmsg]
#dicti_topics
relatedtopics = self.dicti_topics[topic]
#if the topic is related, change topic
if new_topic in relatedtopics:
convo.change_topic(new_topic)
noi.convos[user] = convo
#and respond
if new_topic == "leave": line = random.choice(dicti["Confirm"])
if rmsg == "question": line = responce
else: line = random.choice(self.dicti_responce[new_topic])
#otherwise it's confused
else:
line = "Huh?"
if line:
line = line+", %s." % user
send(channel, line)
This is the do_action of a state in a state machine.
having clear goals is important in programming even before you set out deciding what objects and how. Unfortunately from what I've read above this isn't really clear.
So first forget the how of the program. forget about the objects, the code and what they do.
Now Imagine someone else was going to write for you the program. someone who is such a good programmer, they don't need you do tell them how to code. here are some questions they might ask you.
What is the purpose of your program in one sentence?
explain to me as simply as possible the main terms, IRC, Conversation.
what must it be able do? short bullet points.
explain in steps how you would use the program eg:
i type in
it then says
depending on weather... it gives me a list of this...
Having done this then forget about your convo object or whatever and think in terms of 1, 2 and 4. On Pen and paper think about the main elements of your problems i,e Conversations.Don't Just create Objects... you'll find them.
Now think about the relationships of those elements in terms of how they interact. i.e.
"Bot adds message to Topic, User adds message to Topic, messages from Topic are sent to Log."
this will help you find what the objects are, what they must do and what information they'll need to store.
Having said all that, I would say your biggest problem is your taking on more than you can chew. To begin with, for a computer to recognise words and put them into topics is quite complicated and involves linguistics and/or statistics. As a new programmer I'd tend to avoid these areas because they will simply let you down and in the process kill your motivation. start small... then go BIG. try messing with GUI programming, then make a simple calculator and stuff...

Categories