Python Scraping for String to get an Alarm - python

I think I've got to a point where I need help from professionals. I would like to build a scraper for a browser game that gives an alarm to a bot (Telegram or Discord). The connection of the bot is not the problem at first, it is more about getting the right result.
My script runs in a while-loop (it also runs without) and is supposed to look for links in an -tag. These links contain an ID. This is always incremented +1 when a new player signs up to the game and that's exactly what I need.
Since I need to compare the information, I figured I need to save it in a .csv file. And there lies the problem the output looks like this in the .csv:
index.php?section=impressum
I have two problems:
I want to limit the output to the first 5 results in the file
Only have in the file if something changes or the corresponding change.
1. + 2.
This ist my code so far:
import requests
import time
import csv
from datetime import datetime
from bs4 import BeautifulSoup
def writeCSV(data):
csv_file = open('ags_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow([data])
csv_file.close()
sleepTimer = 3
# Adresse der Webseite
url = "https://www.ag-spiel.de/"
allAGs = []
firstRun = True
while True:
response = requests.get(url + "index.php?section=live")
# BeautifulSoup HTML-Dokument aus dem Quelltext parsen
html = BeautifulSoup(response.text, 'html.parser')
# url aus dem <a href> parsen
newDetected = False
newAGs = []
possible_links = html.find_all('a')
for link in possible_links:
if link.has_attr('href'):
inhalt = str(link.attrs['href'])
if "aktie=" in inhalt:
if firstRun is True:
allAGs.append(inhalt)
else:
if str(inhalt) not in allAGs:
newDetected = True
print("ATTENTION!!! New AG! Url is: " + inhalt)
allAGs.append(inhalt)
# in Datei schreiben
writeCSV(inhalt)
else:
# print ("Debug output "+ inhalt + " already in AGlist")
continue
if firstRun is True:
print("Frist run successfull, current ags: " + str(len(allAGs)))
for AGurl in allAGs:
print(AGurl)
else:
if newDetected is False:
print(str(datetime.now().strftime("%H:%M:%S")) + ": Nothing changed")
writeCSV(inhalt)
else:
print("Something Changed, current ags: " + str(len(allAGs)))
for AGurl in allAGs:
print(AGurl)
firstRun = False
time.sleep(sleepTimer)
´´´´

Related

How to open a file and delete the first item (item at index 0)

Atm I am working on a plug in for a Chat bot for Twitch.
I have this working so far. So that I am able to add Items to a file.
# Variables
f = open("Tank_request_list.txt","a+")
fr = open("Tank_request_list.txt","r")
tr = "EBR" # test input
tank_request = fr.read()
treq = tank_request.split("#")
with open("Tank_request_list.txt") as fr:
empty = fr.read(1)
if not empty:
f.write(tr)
f.close
else:
tr = "#" + tr
f.write(tr)
f.close
I now need to work out how to delete an item at Index 0
I also have this piece of code I need to implement also:
# List Length
list_length = len(treq)
print "There are %d tanks in the queue." % (list_length)
# Next 5 Requests
print "Next 5 Requests are:"
def tank_lst(x):
for i in range(5):
print "- " + x[i]
# Run Tank_request
tank_lst(treq)
The following will return the right answer but not write it.
def del_played(tank):
del tank[0]
return tank
tanks = treq
print del_played(tanks)
First, remove the content
Use the truncate function for removing the content from a file then write the new list into it.

NYT summary extractor python2

I am trying to access the summary of the NYT articles using the NewsWire API and python 2.7. Here is the code:
from urllib2 import urlopen
import urllib2
from json import loads
import codecs
import time
import newspaper
posts = list()
articles = list()
i=30
keys= dict()
count=0
offset=0
while(offset<40000):
if(len(posts)>=30000): break
if(700<offset<800):
offset=offset + 100
#for p in xrange(100):
try:
url = "http://api.nytimes.com/svc/news/v3/content/nyt/all.json?offset="+str(offset)+"&api-key=ACCESSKEY"
data= loads(urlopen(url).read())
print str(len(posts) )+ " offset=" + str(offset)
if posts and articles and keys:
outfile= open("articles_next.tsv", "w")
for s in articles:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
outfile= open("summary_next.tsv", "w")
for s in posts:
outfile.write(s.encode("utf-8") + "\n")
outfile.close()
indexfile=open("ind2_next.tsv", "w")
for x in keys.keys():
indexfile.write('\n' + str(x) + " " + str(keys[x]))
indexfile.close()
for item in data["results"]:
if(('url' in item) & ('abstract' in item)) :
url= item["url"]
abst=item["abstract"]
if(url not in keys.values()):
keys[count]=url
article = newspaper.Article(url)
article.download()
article.parse()
try:
el_post = article.text.replace('\n\n',' ').replace("Advertisement Continue reading the main story",'')
except XMLSyntaxError, e:
continue
articles.append(el_post)
count=count + 1
res= abst # url + " " + abst
# print res.encode("utf-8")
posts.append(res) # Here is the appending statement.
if(len(posts)>=30000):
break
except urllib2.HTTPError, e:
print e
time.sleep(1)
offset=offset + 21
continue
except urllib2.URLError,e:
print e
time.sleep(1)
offset=offset + 21
continue
offset=offset + 19
print str(len(posts))
print str(len(keys))
I was getting is good summary. But sometimes I came across some weird sentences as a part of the summary. Here are the example:
Here’s what you need to know to start your day.
Corrections appearing in print on Monday, August 28, 2017.
which are consider to be a summary of some article. Kindly, help me in extracting the perfect summary of the article from the NYT news. I thought of using the titles if such the arises, but the title is weird too.
So, I have a taken a look through the summary results.
It is possible to remove repeated statements such as Corrections appearing in print on Monday, August 28, 2017., where only the date is different.
Simplest way to do this is to check if the statement is present in the vairable itself.
Example,
# declare at the top
# create a list that consists of repetitive statements. I found 'quotation of the day' being repeated as well
REMOVE_STATEMENTS = ["Corrections appearing in print on", "Quotation of the Day for"]
And then,
if (statement not in res for statement in REMOVE_STATEMENTS):
posts.append(res)
As for the remaining unwanted statements, there is NO way they can be differentiated, unless you search for keywords within res that you want to ignore, or they are repetitive. If you find any, just simply add them to the list I created.

Python IndexError: list index out of range when using iterations

I've been trying to download screenshots from the App Store and here's my code (I'm a beginner).
The problem I encounter is list index out of range at line 60 (screenshotList = data["results"][resultCounter]["screenshotUrls"]
The thing is that sometimes, the search API returns 0 results for the search term used, and therefore it gets messed up because "resultCount" = 0.
I'm not sure what else it could be/nor how I can fix it. Any help?
# Required libraries
import urllib
import string
import random
import json
import time
""" screenshotCounter is used so that all screenshots have a different name
resultCounter is used to go from result to result in downloaded JSON file
"""
screenshotCounter = 0
resultCounter = 0
""" Create three random letters as search term on App Store
Download JSON results file
Shows used search term
"""
searchTerm = (''.join(random.choice(string.ascii_lowercase) for i in range(3)))
urllib.urlretrieve("https://itunes.apple.com/search?country=us&entity=software&limit=3&term=" + str(searchTerm), "download.txt")
print "Used search term: " + str(searchTerm)
# Function to download screenshots + give it a name + confirmation msg
def download_screenshot(screenshotLink, screenshotName):
urllib.urlretrieve(screenshotLink, screenshotName)
print "Downloaded with success:" + str(screenshotName)
# Opens newly downloaded JSON file
with open ('download.txt') as data_file:
data = json.load(data_file)
""" Get the first list of screenshots from stored JSON file,
resultCounter = 0 on first iteration
"""
screenshotList = data["results"][resultCounter]["screenshotUrls"]
# Gives the number of found results and serves as iteration limit
iterationLimit = data["resultCount"]
# Prints the number of found results
print str(iterationLimit) + " results found."
""" Change the number of iterations to the number of results, which will be
different for every request, minus 1 since indexing starts at 0
"""
iterations = [0] * iterationLimit
""" For each iteration (number of results), find each screenshot in the
screenshotList, name it, download it. Then change result to find the next
screenshotList and change screenshotList variable.
"""
for number in iterations:
for screenshotLink in screenshotList:
screenshotName = "screenshot" + str(screenshotCounter) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
screenshotCounter = screenshotCounter + 1
resultCounter = resultCounter + 1
screenshotList = data["results"][resultCounter]["screenshotUrls"]
# Sleeping to avoid crash
time.sleep(1)
I rewrote your code to check for the presence of results before trying anything. If there aren't any, it goes back through the loop with a new search term. If there are, it will stop at the end of that iteration.
# Required libraries
import urllib
import string
import random
import json
import time
# Function to download screenshots + give it a name + confirmation msg
def download_screenshot(screenshotLink, screenshotName):
urllib.urlretrieve(screenshotLink, screenshotName)
print "Downloaded with success:" + str(screenshotName)
success = False
while success == False:
""" Create three random letters as search term on App Store
Download JSON results file
Shows used search term
"""
searchTerm = (''.join(random.choice(string.ascii_lowercase) for i in range(3)))
urllib.urlretrieve("https://itunes.apple.com/search?country=us&entity=software&limit=3&term=" + str(searchTerm), "download.txt")
print "Used search term: " + str(searchTerm)
# Opens newly downloaded JSON file
with open ('download.txt') as data_file:
data = json.load(data_file)
""" Get the first list of screenshots from stored JSON file,
resultCounter = 0 on first iteration
"""
resultCount = len(data["results"])
if resultCount == 0:
continue #if no results, skip to the next loop
success = True
print str(resultCount) + " results found."
for j, resultList in enumerate(data["results"]):
screenshotList = resultList["screenshotUrls"]
""" For each iteration (number of results), find each screenshot in the
screenshotList, name it, download it. Then change result to find the next
screenshotList and change screenshotList variable.
"""
for i, screenshotLink in enumerate(screenshotList):
screenshotName = "screenshot" + str(i) + '_' + str(j) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
# Sleeping to avoid crash
time.sleep(1)
have you tried
try:
for screenshotLink in screenshotList:
screenshotName = "screenshot" + str(screenshotCounter) + ".jpeg"
download_screenshot(screenshotLink, screenshotName)
screenshotCounter = screenshotCounter + 1
except IndexError:
pass

Python: Creating a file based on an array of strings

I'm trying to write a program that will go to a website and download all of the songs they have posted. Right now I'm having trouble creating new file names for each of the songs I download. I initially get all of the file names and the locations of the songs (html). However, when I try to create new files for the songs to be put in, I get an error saying:
IOError: [Errno 22] invalid mode ('w') or filename
I have tried using different modes like "w+", "a", and, "a+" to see if these would solve the issue but so far I keep getting the error message. I have also tried "% name"-ing the string but that has not worked either. My code is following, any help would be appreciated.
import urllib
import urllib2
def earmilk():
SongList = []
SongStrings = []
SongNames = []
earmilk = urllib.urlopen("http://www.earmilk.com/category/pop")
reader = earmilk.read()
#gets the position of the playlist
PlaylistPos = reader.find("var newPlaylistTracks = ")
#finds the number of songs in the playlist
NumberSongs = reader[reader.find("var newPlaylistIds = " ): PlaylistPos].count(",") + 1
initPos = PlaylistPos
#goes though the playlist and records the html address and name of the song
for song in range(0, NumberSongs):
songPos = reader[initPos:].find("http:") + initPos
namePos = reader[songPos:].find("name") + songPos
namePos += reader[namePos:].find(">")
nameEndPos = reader[namePos:].find("<") + namePos
SongStrings.append(reader[songPos: reader[songPos:].find('"') + songPos])
SongNames.append(reader[namePos + 1: nameEndPos])
#initPos += len(SongStrings[song])
initPos = nameEndPos
for correction in range(0, NumberSongs):
SongStrings[correction] = SongStrings[correction].replace('\\/', "/")
#downloading songs
#for download in range(0, NumberSongs):
#print reader.find("So F*")
#x= SongNames[0]
songDL = open(SongNames[0].formant(name), "w+")
songDL.write(urllib.urlretrieve(SongStrings[0], SongNames[0] + ".mp3"))
songDL.close()
print SongStrings
for name in range(0, NumberSongs):
print SongNames[name] + "\n"
earmilk.close()
You need to use filename = '%s' % (SongNames[0],) to construct the name but you also need to make sure that your file name is a valid one - I don't know of any songs called *.* but I wouldn't like to chance it so something like:
filename = ''.join([a.isalnum() and a or '_' for a in SongNames[0]])

Creating a dynamic forum signature generator in python

I have searched and searched but I have only found solutions involving php and not python/django. My goal is to make a website (backend coded in python) that will allow a user to input a string. The backend script would then be run and output a dictionary with some info. What I want is to use the info from the dictionary to sort of draw it onto an image I have on the server and give the new image to the user. How can I do this offline for now? What libraries can I use? Any suggestions on the route I should head on would be lovely.
I am still a novice so please forgive me if my code needs work. So far I have no errors with what I have but like I said I have no clue where to go next to achieve my goal. Any tips would be greatly appreciated.
This is sort of what I want the end goal to be http://combatarmshq.com/dynamic-signatures.html
This is what I have so far (I used beautiful soup as a parser from here. If this is too excessive or if I did it in a not so good way please let me know if there is a better alternative. Thanks):
The url where I'm getting the numbers I want (These are dynamic) is this: http://combatarms.nexon.net/ClansRankings/PlayerProfile.aspx?user=
The name of the player will go after user so an example is http://combatarms.nexon.net/ClansRankings/PlayerProfile.aspx?user=-aonbyte
This is the code with the basic functions to scrape the website:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
def get_avatar(player_name):
'''Return the players avatar as a binary string.'''
player_name = str(player_name)
url = 'http://combat.nexon.net/Avatar/MyAvatar.srf?'
url += 'GameName=CombatArms&CharacterID=' + player_name
sock = urlopen(url)
data = sock.read()
sock.close()
return data
def save_avatar(data, file_name):
'''Saves the avatar data from get_avatar() in png format.'''
local_file = open(file_name + '.png', 'w' + 'b')
local_file.write(data)
local_file.close()
def get_basic_info(player_name):
'''Returns basic player statistics as a dictionary'''
url = 'http://combatarms.nexon.net/ClansRankings'
url += '/PlayerProfile.aspx?user=' + player_name
sock = urlopen(url)
html_raw = sock.read()
sock.close()
html_original_parse = BeautifulSoup(''.join(html_raw))
player_info = html_original_parse.find('div', 'info').find('ul')
basic_info_list = range(6)
for i in basic_info_list:
basic_info_list[i] = str(player_info('li', limit = 7)[i+1].contents[1])
basic_info = dict(date = basic_info_list[0], rank = basic_info_list[1], kdr = basic_info_list[2], exp = basic_info_list[3], gp_earned = basic_info_list[4], gp_current = basic_info_list[5])
return basic_info
And here is the code that tests out those functions:
from grabber import get_avatar, save_avatar, get_basic_info
player = raw_input('Player name: ')
print 'Downloading avatar...'
avatar_data = get_avatar(player)
file_name = raw_input('Save as? ')
print 'Saving avatar as ' + file_name + '.png...'
save_avatar(avatar_data, file_name)
print 'Retrieving ' + player + '\'s basic character info...'
player_info = get_basic_info(player)
print ''
print ''
print 'Info for character named ' + player + ':'
print 'Character creation date: ' + player_info['date']
print 'Rank: ' + player_info['rank']
print 'Experience: ' + player_info['exp']
print 'KDR: ' + player_info['kdr']
print 'Current GP: ' + player_info['gp_current']
print ''
raw_input('Press enter to close...')
If I understand you correctly, you want to get an image from one place, get some textual information from another place, draw text on top of the image, and then return the marked-up image. Do I have that right?
If so, get PIL, the Python Image Library. Both PIL and BeatifulSoup are capable of reading directly from an opened URL, so you can forget that socket nonsense. Get the player name from the HTTP request, open the image, use BeautifulSoup to get the data, use PIL's text functions to write on the image, save the image back into the HTTP response, and you're done.

Categories