Python - How to test similarity between strings and only print new strings?

Python - How to test similarity between strings and only print new strings? - python

I have developed a webscraper with beautiful soup that scrapes news from a website and then sends them to a telegram bot. Every time the program runs it picks up all the news currently on the news web page, and I want it to just pick the new entries on the news and send only those.
How can I do this? Should I use a sorting algorithm of some sort?
Here is the code:
#Lib requests
import requests
import bs4
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
body = soup.body
for paragrafo in body.find_all('p', class_='article-thumb-text'):
print(paragrafo.text)
conteudo = paragrafo.text
id = requests.get('https://api.telegram.org/bot<TOKEN>/getUpdates')
chat_id = id.json()['result'][0]['message']['from']['id']
print(chat_id)
msg = requests.post('https://api.telegram.org/bot<TOKEN>/sendMessage', data = {'chat_id': chat_id ,'text' : conteudo})

You need to keep track of articles that you have seen before, either by using a full database solution or by simply saving the information in a file. The file needs to be read before starting. The website is then scraped and compared against the existing list. Any articles not in the list are added to the list. At the end, the updated list is saved back to the file.
Rather that storing the whole text in the file, a hash of the text can be saved instead. i.e. convert the text into a unique number, in this case a hex digest is used to make it easier to save to a text file. As each hash will be unique, they can be stored in a Python set to speed up the checking:
import hashlib
import requests
import bs4
import os
# Read in hashes of past articles
db = 'past.txt'
if os.path.exists(db):
with open(db) as f_past:
past_articles = set(f_past.read().splitlines())
else:
past_articles = set()
fonte = requests.get('https://www.noticiasaominuto.com/')
soup = bs4.BeautifulSoup(fonte.text, 'lxml')
for paragrafo in soup.body.find_all('p', class_='article-thumb-text'):
m = hashlib.md5(paragrafo.text.encode('utf-8'))
if m.hexdigest() not in past_articles:
print('New {} - {}'.format(m.hexdigest(), paragrafo.text))
past_articles.add(m.hexdigest())
# ...Update telegram here...
# Write updated hashes back to the file
with open(db, 'w') as f_past:
f_past.write('\n'.join(past_articles))
The first time this is run, all articles will be displayed. The next time, no articles will be displayed until the website is updated.

Related

Problem with lxml.xpath not putting elements into a list

So here's my problem. I'm trying to use lxml to web scrape a website and get some information but the elements that the information pertains to aren't being found when using the var.xpath command. It's finding the page but after using the xpath it doesn't find anything.
import requests
from lxml import html
def main():
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# the root of the tracker website
page = html.fromstring(result.content)
print('its getting the element from here', page)
threesRank = page.xpath('//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tbody/tr[*]/td[3]/div/div[2]/div[1]/div')
print('the 3s rank is: ', threesRank)
if __name__ == "__main__":
main()
OUTPUT:
"D:\Python projects\venv\Scripts\python.exe" "D:/Python projects/main.py"
its getting the element from here <Element html at 0x20eb01006d0>
the 3s rank is: []
Process finished with exit code 0
The output next to "the 3s rank is:" should look something like this
[<Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>, <Element html at 0x20eb01006d0>]

Because the xpath string does not match, no result set is returned by page.xpath(..). It's difficult to say exactly what you are looking for but considering "threesRank" I assume you are looking for all the table values, ie. ranking and so on.
You can get a more accurate and self-explanatory xpath using the Chrome Addon "Xpath helper". Usage: enter the site and activate the extension. Hold down the shift key and hoover on the element you are interested in.
Since the HTML used by tracker.network.com is built dynamically using javascript with BootstrapVue (and Moment/Typeahead/jQuery) there is a big risk the dynamic rendering is producing different results from time to time.
Instead of scraping the rendered html, I suggest you instead use the structured data needed for the rendering, which in this case is stored as json in a JavaScript variable called __INITIAL_STATE__
import requests
import re
import json
from contextlib import suppress
# get page
result = requests.get('https://rocketleague.tracker.network/rocket-league/profile/xbl/ReedyOrange/overview')
# Extract everything needed to render the current page. Data is stored as Json in the
# JavaScript variable: window.__INITIAL_STATE__={"route":{"path":"\u0 ... }};
json_string = re.search(r"window.__INITIAL_STATE__\s?=\s?(\{.*?\});", result.text).group(1)
# convert text string to structured json data
rocketleague = json.loads(json_string)
# Save structured json data to a text file that helps you orient yourself and pick
# the parts you are interested in.
with open('rocketleague_json_data.txt', 'w') as outfile:
outfile.write(json.dumps(rocketleague, indent=4, sort_keys=True))
# Access members using names
print(rocketleague['titles']['currentTitle']['platforms'][0]['name'])
# To avoid 'KeyError' when a key is missing or index is out of range, use "with suppress"
# as in the example below: since there there is no platform no 99, the variable "platform99"
# will be unassigned without throwing a 'keyerror' exception.
from contextlib import suppress
with suppress(KeyError):
platform1 = rocketleague['titles']['currentTitle']['platforms'][0]['name']
platform99 = rocketleague['titles']['currentTitle']['platforms'][99]['name']
# print platforms used by currentTitle
for platform in rocketleague['titles']['currentTitle']['platforms']:
print(platform['name'])
# print all titles with corresponding platforms
for title in rocketleague['titles']['titles']:
print(f"\nTitle: {title['name']}")
for platform in title['platforms']:
print(f"\tPlatform: {platform['name']}")

lxml doesn't support "tbody". change your xpath to
'//*[#id="app"]/div[2]/div[2]/div/main/div[2]/div[3]/div[1]/div/div/div[1]/div[2]/table/tr[*]/td[3]/div/div[2]/div[1]/div'

Search through JSON query from Valve API in Python

I am looking to find various statistics about players in games such as CS:GO from the Steam Web API, but cannot work out how to search through the JSON returned from the query (e.g. here) in Python.
I just need to be able to get a specific part of the list that is provided, e.g. finding total_kills from the link above. If I had a way that could sort through all of the information provided and filters it down to just that specific thing (in this case total_kills) then that would help a load!
The code I have at the moment to turn it into something Python can read is:
url = "http://api.steampowered.com/IPlayerService/GetOwnedGames/v0001/?key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697&include_appinfo=1"
data = requests.get(url).text
data = json.loads(data)

If you are looking for a way to search through the stats list then try this:
import requests
import json
def findstat(data, stat_name):
for stat in data['playerstats']['stats']:
if stat['name'] == stat_name:
return stat['value']
url = "http://api.steampowered.com/ISteamUserStats/GetUserStatsForGame/v0002/?appid=730&key=FE3C600EB76959F47F80C707467108F2&steamid=76561198185148697"
data = requests.get(url).text
data = json.loads(data)
total_kills = findstat(data, 'total_kills') # change 'total_kills' to your desired stat name
print(total_kills)

scrape text from webpage using python 2.7

I'm trying to scrape data from this website:
Death Row Information
I'm having trouble to scrape the last statements from all the executed offenders in the list because the last statement is located at another HTML page. The name of the URL is built like this: http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname].html. I can't think of a way of how I can scrape the last statements from these pages and put them in an Sqlite database.
All the other info (expect for "offender information", which I don't need) is already in my datbase.
Anyone who can give me a pointer to get started getting this done in Python?
Thanks
Edit2: I got a little bit further:
import sqlite3
import csv
import re
import urllib2
from urllib2 import Request, urlopen, URLError
from BeautifulSoup import BeautifulSoup
import requests
import string
URLS = []
Lastwords = {}
conn = sqlite3.connect('prison.sqlite')
conn.text_factory = str
cur = conn.cursor()
# Make some fresh tables using executescript()
cur.execute("DROP TABLE IF EXISTS prison")
cur.execute("CREATE TABLE Prison ( link1 text, link2 text,Execution text, LastName text, Firstname text, TDCJNumber text, Age integer, date text, race text, county text)")
conn.commit()
csvfile = open("prisonfile.csv","rb")
creader = csv.reader(csvfile, delimiter = ",")
for t in creader:
cur.execute('INSERT INTO Prison VALUES (?,?,?,?,?,?,?,?,?,?)', t, )
for column in cur.execute("SELECT LastName, Firstname FROM prison"):
lastname = column[0].lower()
firstname = column[1].lower()
name = lastname+firstname
CleanName = name.translate(None, ",.!-#'#$" "")
CleanName2 = CleanName.replace(" ", "")
Url = "http://www.tdcj.state.tx.us/death_row/dr_info/"
Link = Url+CleanName2+"last.html"
URLS.append(Link)
for URL in URLS:
try:
page = urllib2.urlopen(URL)
except URLError, e:
if e.code ==404:
continue
soup = BeautifulSoup(page.read())
statements = soup.findAll ('p',{ "class" : "Last Statement:" })
print statements
csvfile.close()
conn.commit()
conn.close()
The code is messy, I know. Once everything works I will clean it up. One problem though. I'm trying to get all the statements by using soup.findall, but I cannot seem to get the class right. The relevant part of the page source looks like this:
<p class="text_bold">Last Statement:</p>
<p>I don't have anything to say, you can proceed Warden Jones.</p>
However, the output of my program:
[]
[]
[]
...
What could be the problem exactly?

I will not write code that solves the problem, but will give you a simple plan for how to do it yourself:
You know that each last statement is located at the URL:
http://www.tdcj.state.tx.us/death_row/dr_info/[lastname][firstname]last.html
You say you already have all the other information. This presumably includes the list of executed prisoners. So you should generate a list of names in your python code. This will allow you to generate the URL to get to each page you need to get to.
Then make a For loop that iterates over each URL using the format I posted above.
Within the body of this for loop, write code to read the page and get the last statement. The last statement on each page is in the same format on each page, so you can use parsing to capture the part that you want:
<p class="text_bold">Last Statement:</p>
<p>D.J., Laurie, Dr. Wheat, about all I can say is goodbye, and for all the rest of you, although you don’t forgive me for my transgressions, I forgive yours against me. I am ready to begin my journey and that’s all I have to say.</p>
Once you have your list of last statements, you can push them to SQL.
So your code will look like this:
import urllib2
# Make a list of names ('Last1First1','Last2First2','Last3First3',...)
names = #some_call_to_your_database
# Make a list of URLs to each inmate's last words page
# ('URL...Last1First1last.html',URL...Last2First2last.html,...)
URLS = () # made from the 'names' list above
# Create a dictionary to hold all the last words:
LastWords = {}
# Iterate over each individual page
for eachURL in URLS:
response = urllib2.urlopen(eachURL)
html = response.read()
## Some prisoners had no last words, so those URLs will 404.
if ...: # Handle those 404s here
## Code to parse the response, hunting specifically
## for the code block I mentioned above. Once you have the
## last words as a string, save to dictionary:
LastWords['LastFirst'] = "LastFirst's last words."
# Now LastWords is a dictionary with all the last words!
# Write some more code to push the content of LastWords
# to your SQL database.

urllib2.urlopen not getting all content

I am a beginner in python to pull some data from reddit.com
More precisely, I am trying to send a request to http:www.reddit.com/r/nba/.json to get the JSON content of the page and then parse it for entries about a specific team or player.
To automate the data gathering, I am requesting the page like this:
import urllib2
FH = urllib2.urlopen("http://www.reddit.com/r/nba/.json")
rnba = FH.readlines()
rnba = str(rnba[0])
FH.close()
I am also pulling the content like this on a copy of the script, just to be sure:
FH = requests.get("http://www.reddit.com/r/nba/.json",timeout=10)
rnba_json = FH.json()
FH.close()
However, I am not getting the full data that is presented when I manually go to
http://www.reddit.com/r/nba/.json with either method, in particular when I call
print len(rnba_json['data']['children']) # prints 20-something child stories
but when I do the same loading the copy-pasted JSON string like this:
import json
import urllib2
fh = r"""{"kind": "Listing", "data": {"modhash": ..."""# long JSON string
r_nba = json.loads(fh) #loads the json string from the site into json object
print len(r_nba['data']['children']) #prints upwards of 100 stories
I get more story links. I know about the timeout parameter but providing it did not resolve anything.
What am I doing wrong or what can I do to get all the content presented when I pull the page in the browser?

To get the max allowed, you'd use the API like: http://www.reddit.com/r/nba/.json?limit=100

getting data from webpage with select menus with python and beautifulsoup

I am trying to collect data from a webpage which has a bunch of select lists i need to fetch
data from. Here is the page:- http://www.asusparts.eu/partfinder/Asus/All In One/E Series/
And this is what i have so far:
import glob, string
from bs4 import BeautifulSoup
import urllib2, csv
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'
##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
for option in option_tags:
open(url + option['value'])
html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")
soup = BeautifulSoup(html)
all = soup.find('div', id="accordion")
I am not sure if i am going about the right way? As all the select menus make it confusing. Basically i need to grab
all the data from the selected results such as images,price,description,etc. They are all contained within
one div tag which contains all the results, which is named 'accordion' so would this still gather all the data?
or would i need to dig deeper to search through the tags inside this div? Also i would have prefered to search by id rather than
class as i could fetch all the data in one go. How would i do this from what i have above? Thanks. Also i am unsure about the glob function too if i am using that correctly or not?
EDIT
Here is my edited code, no errors return however i am not sure if it returns all the models for the e-series?
import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup
##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
base_url = 'http://www.asusparts.eu/' + url
print base_url
##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'
##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)
print soup
##-identify the id of select list which contains the E-series-##
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')
print option_tags
##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]
print option_tags
for option in option_tags:
url + option['redirectvalue']
print " " + url + option['redirectvalue']

First of all, I'd like to point out a couple of problems you have in the code you posted. First, of all the glob module is not typically used for making HTTP requests. It is useful for iterating through a subset of files on a specified path, you can read more about it in its docs.
The second issue is that in the line:
for file in glob.glob("http://www.asusparts.eu/partfinder/*"):
you have an indentation error, because there is no indented code that follows. This will raise an error and prevent the rest of the code from being executed.
Another problem is that you are using some of python's "reserved" names for your variables. You should never use words such as all or file for variable names.
Finally when you are looping through option_tags:
for option in option_tags:
open(url + option['value'])
The open statement will try and open a local file whose path is url + option['value']. This will likely raise an error, as I doubt you'll have a file at that location. In addition, you should be aware that you aren't doing anything with this open file.
Okay, so enough with the critique. I've taken a look at the asus page and I think I have an idea of what you want to accomplish. From what I understand, you want to scrape a list of parts (images, text, price, etc..) for each computer model on the asus page. Each model has its list of parts located at a unique URL (for example: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220). This means that you need to be able to create this unique URL for each model. To make matters more complicated, each parts category is loaded dynamically, so for example the parts for the "Cooling" section are not loaded until you click on the link for "Cooling". This means we have a two part problem: 1) Get all of the valid (brand, type, family, model) combinations and 2) Figure out how to load all the parts for a given model.
I was kind of bored and decided to write up a simple program that will take care of most of the heavy lifting. It isn't the most elegant thing out there, but it'll get the job done. Step 1) is accomplished in get_model_information(). Step 2) is taken care of in parse_models() but is a little less obvious. Taking a look at the asus website, whenever you click on a parts subsection the JavaScript function getProductsBasedOnCategoryID() is run, which makes an ajax call to a formatted PRODUCT_URL (see below). The response is some JSON information that is used to populate the section you clicked on.
import urllib2
import json
import urlparse
from bs4 import BeautifulSoup
BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
'44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']
def get_options(url, select_id):
"""
Gets all the options from a select element.
"""
r = urllib2.urlopen(url)
soup = BeautifulSoup(r)
select = soup.find('select', id=select_id)
try:
options = [option for option in select.strings]
except AttributeError:
print url, select_id, select
raise
return options[1:] # The first option is the menu text
def get_model_information():
"""
Finds all the models for each family, all the families and models for each
type, and all the types, families, and models for each brand.
These are all added as tuples (brand, type, family, model) to the list
models.
"""
model_info = []
print "Getting brands"
brand_options = get_options(BASE_URL, 'mySelectList')
for brand in brand_options:
print "Getting types for {0}".format(brand)
# brand = brand.replace(' ', '%20') # URL encode spaces
brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
types = get_options(brand_url, 'mySelectListType')
for _type in types:
print "Getting families for {0}->{1}".format(brand, _type)
bt = '{0}/{1}'.format(brand, _type)
type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
families = get_options(type_url, 'myselectListFamily')
for family in families:
print "Getting models for {0}->{1}->{2}".format(brand,
_type, family)
btf = '{0}/{1}'.format(bt, family)
fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
models = get_options(fam_url, 'myselectListModel')
model_info.extend((brand, _type, family, m) for m in models)
return model_info
def parse_models(model_information):
"""
Get all the information for each accessory type for every
(brand, type, family, model). accessory_info will be the python formatted
json results. You can parse, filter, and save this information or use
it however suits your needs.
"""
for brand, _type, family, model in model_information:
for accessory in ACCESSORIES:
r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
accessory=accessory,
brand=brand,))
accessory_info = json.load(r)
# Do something with accessory_info
# ...
def main():
models = get_model_information()
parse_models(models)
if __name__ == '__main__':
main()
Finally, one side note. I have dropped urllib2 in favor of the requests library. I personally think provides much more functionality and has better semantics, but you can use whatever you would like.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - How to test similarity between strings and only print new strings? - python

Related

Problem with lxml.xpath not putting elements into a list

Search through JSON query from Valve API in Python

scrape text from webpage using python 2.7

urllib2.urlopen not getting all content

getting data from webpage with select menus with python and beautifulsoup

Categories

Resources