Scraping a table without id with python and beautifulsoup - python

I am currently working on creating a telegram bot .
I now want to add the command /drop Sorties, but I need bs4 to scrape a table from an this page.
The bot should answer something like
Rifle Riven Mod Rare (6.79%)
Ayatan Anasa Sculpture Uncommon (28.00%)
4000 Endo Uncommon (12.10%)
etc etc etc..
I should define something in the code to look ONLY for user input in that defined page, and reply with the next table he find in that page.
Example html from the link provided above
<h3 id="sortieRewards">Sorties:</h3>
<table><tbody><tr><th colspan="2">Sortie</th></tr><tr><td>Rifle Riven Mod</td><td>Rare (6.79%)</td></tr><tr><td>Ayatan Anasa Sculpture</td><td>Uncommon (28.00%)</td></tr><tr><td>4000 Endo</td><td>Uncommon (12.10%)</td>
The bot should reply with the content of the table even if the input from the user is Sortie and not Sorties:

soup = BeautifulSoup(page, 'lxml')
sorties_header = soup.find('h3',{'id':'sortieRewards'})
sorties_table = sorties_header.find_next('table')
# First row is header. We need to skip it
for sortie in sorties_table.find_all('tr')[1:]:
data = sortie.find_all('td')
item = data[0].text
drop_rate = data[1].text
print(item,drop_rate)
The output is
Rifle Riven Mod Rare (6.79%)
Ayatan Anasa Sculpture Uncommon (28.00%)
4000 Endo Uncommon (12.10%)

Related

Web scrape website with dropdown menu that changes website dynamically (onchange)

So I'm trying to scrape census data from a website that changes dynamically when a county is selected from the drop down menu. It looks like this:
<select id="cat_id_select_GEO" onchange="changeHeaderSelection('GEO');
<option value="0500000US01001" select="selected">Autaga County, Alabama</option>
<select>
a link
So from the research i've done, it sounds like i need to make some sort of Get request? (selenium?) but I am completely lost on how to do this. I know how to get the data i want, once i've made the county selection. But I've never had to scrape something where the website changes dynamically (i.e. the url doesn't change)
I understand that some may find this to be a simple question... but I've read numerous other similar questions and would greatly benefit from someone walking me through example, and/or directing me to a solid guide.
this is what i've been messing around with so far. I can see it kinda works at selecting the values... but it spits out this error: Message: stale element reference: element is not attached to the page document
(Session info: chrome=74.0.3729.169)
for index, row in StateURLs.iterrows():
url = row['URL']
state = row['STATE']
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe')
driver.get(url)
select_county = Select(driver.find_element_by_id('cat_id_select_GEO'))
options = select_county.options
for index in range(0, len(options) - 1):
select_county.select_by_index(index)
I also would love help on how to then convert this webpages to beautiful soup so i can scrape each page after the selection is made
The main landing page does get requests with a query string that returns a json string containing the info from that is first returned when you submit your query including further urls that are listed on the results page.
import requests
search_term = 'searchTerm: Autauga County, Alabama'
search_term = search_term.replace(' ','+')
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=').json()
Here is an example of that json
I can generate the correct url to use in the browser which returns all that data as json but can't seem to configure requests so works. Perhaps someone else can pick up this and work it out. I will look again tomorrow.
r = requests.get('https://factfinder.census.gov/rest/communityFactsNav/nav?N=0&_t=1558559559868&log=t&searchTerm=term ' + search_term + ',Alabama&src=', allow_redirects= True).json()
url = 'https://factfinder.census.gov' + r['CFMetaData']['measuresAndLinks']['links']['2017 American Community Survey'][0]['url']
code = url.split('/')[-2]
url = 'https://factfinder.census.gov/tablerestful/tableServices/renderProductData?renderForMap=f&renderForChart=f&pid=ACS_17_5YR_{}&prodToReplace=ACS_16_5YR_{}&log=t&_ts=576607332612'.format(code, code)

How can we get response of next loading web page

I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.
While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.
There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.

Python BeautifulSoup inconsistent result

I have been trying to learn a bit of python, and I tried to create a small program that asks the user for subreddit and then prints all the front page headlines and links to the articles, here is the code
import requests
from bs4 import BeautifulSoup
subreddit = input('Type de subreddit you want to see : ')
link_visit = f'https://www.reddit.com/r/{subreddit}/'
print(link_visit)
base_url = link_visit
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
for article in soup.find_all('div', class_='top-matter'):
headline = article.find('p', class_='title')
print('HeadLine : ' , headline.text )
a = headline.find('a', href=True)
link = a['href'].split('/domain')
print('Link : ' , link[0])
My problem is that sometimes it prints the desired result, other times it does nothing, only asks the user for the subrredit and prints the link to said subreddit.
Can someone explain why is this happening?
Your request is being rejected by reddit in order to conserve their resources.
When you detect the failing case, print out the HTML. I think you'll see something like this:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 3 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>

Python Beautiful Soup - Getting past Steam's age check

I'm learning web scraping and I've been trying to write a program that extracts information from Steam's website as an exercise.
I want to write a program that just visits the page of each top 10 best selling game and extracts something, but my program just gets redirected to the age check page when it tries to visit M rated games.
My program looks something like this:
front_page = urlopen('http://store.steampowered.com/').read()
bs = BeautifulSoup(front_page, 'html.parser')
top_sellers = bs.select('#tab_topsellers_content a.tab_item_overlay')
for item in top_sellers:
game_page = urlopen(item.get('href'))
bs = BeautifulSoup(game_page.read(), 'html.parser')
#Now I'm on the age check page :(
I don't know how to get past the age check, I've tried filling out the age check by sending a POST request to it like this:
post_params = urlencode({'ageDay': '1', 'ageMonth': 'January', 'ageYear': '1988', 'snr': '1_agecheck_agecheck__age-gate'}).encode('utf-8')
page = urlopen(agecheckurl, post_params)
But it dosn't work, I'm still on the age check page. Anyone that can help me out here, how can I get past it?
Okay, seems like Steam use cookies to save the age check result. It's using birthtime.
Since I don't know how to set cookies use urllib, here is an example using requests:
import requests
cookies = {'birthtime': '568022401'}
r = requests.get('http://store.steampowered.com/', cookies=cookies)
Now there is no age check.
I like to use Selenium Webdriver for form input, since it's an easy solution for clicks and keystrokes. You can look at the docs or checkout the examples here, on "Filling out and Submitting Forms".
https://automatetheboringstuff.com/chapter11/

Need to scrape information from a webpage with a "show more" button, any recommendations?

Currently developing a "crawler" for educational reasons,
Everything is working fine, i can extract url's & information & save it in a json file, everything is all fine and dandy... EXCEPT
the page has a "load more" button that i NEED to interact with in order for the crawler to continue looking for more urls.
This is where i could use you amazing guys & girls!
Any recommendations on how to do this?
I would like to interact with the "load more" button and re-send the HTML information to my crawler.
Would really, appreciate any amount of help from you guys!
Website: http://virali.se/photo/gallery/
bit of example code for finding business names:
def base_spider(self, max_pages, max_CIDS):
url = "http://virali.se/photo/gallery/photog/" # Input URL
for pages in range(0, max_pages):
source_code = requests.get(url) # gets the source_code from the URL
plain_text = source_code.text # Pure text transform for BeautifulSoup
soup = BeautifulSoup(plain_text, "html.parser") # Use HTML parser to read the plain_text var
for article in soup.find_all("article"):
business_name_pattern = re.compile(r"<h1>(.*?)</?h1>")
business_name_raw = str(re.findall(business_name_pattern, str(article)))
business_name_clean = re.sub("[\[\]\'\"]", "", business_name_raw)
self.myprint(business_name_clean) # custom print function for weird chars
This code is only looking for the business names, but of course, it is going to run out of business names to search for if the "show more results" button on the url is not interacted with.
If you look at the site with a developer tool (I used Chrome) then you can see that an XHR post request is fired when you click the "Show more results" button.
In this case you can emulate this request to gather the data:
with requests.Session() as session:
response = session.post("http://virali.se/photo/gallery/search", data={'start':0})
print(response.content)
The "magic" is in the data parameter of the session.post: it is the required argument to load the images from this offset. In the example above 0 is the first bunch of images you see per default on the site.
And you can parse response.content with BeautifulSoup.
I hope this helps you get started, although the example uses Python 3 but it can be solved with Python 2 too in the same manner (without using the with construct).

Categories