Scrape Table Data with navigation using Beautiful Soup

Scrape Table Data with navigation using Beautiful Soup - python

I tried downloading pdf files from the website, which is contained in a table with pagination. I can download the pdf files from the first page, but it is not fetching the pdf from all the 4000+ pages. When I tried understanding the logic by observing the URL request, it seems static with out any additional value get appended on it during pagination and I couldn't figure out the way to fetch all pdfs from the table using BeautifulSoup.
Hereby attached the code that I am using to download pdf file from the table in website,
# Import libraries
import requests
from bs4 import BeautifulSoup
import re
import requests, json
# URL from which pdfs to be downloaded
url="https://loksabha.nic.in/Questions/Qtextsearch.aspx"
# Requests URL and get response object
response = requests.get(url)
# Parse text obtained
span = soup.find("span", id="ContentPlaceHolder1_lblfrom")
Total_pages = re.findall(r'\d+', span.text)
print(Total_pages[0])
# Find all hyperlinks present on webpage
# links = soup.find_all('a')
i = 0
# From all links check for pdf link and
# if present download file
# for link in links:
for link in table1.find_all('a'):
if ('.pdf' in link.get('href', [])):
list2 = re.findall('CalenderUploading', link.get('href', []))
if len(list2)==0:
# url = re.findall('hindi', link.get('href', []))
print(link.get('href', []))
i += 1
# Get response object for link
response = requests.get(link.get('href'))
# Write content in pdf file
pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")

Firstly
you need to establish a session first time you call to store cookie values
sess=requests.session()
and use sess.get subsequently instead of requests.get
Secondly:
its not static... its not get request for subsequent pages
its a post request with : ctl00$ContentPlaceHolder1$txtpage="2" for page 2
make a session with requests
capture the view parameters after first request using BeautifulSoup
the value of __VIEWSTATE, __VIEWSTATEGENERATOR and __EVENTVALIDATION
etc are in a <div class="aspNetHidden">
when you request for tthe page for first time...
for subsequent pages you will have to pass these parameters along with
page number in post parameter like this ... ctl00$ContentPlaceHolder1$txtpage="2"
using "POST" and not "GET"
this is what is sent by post request for eg. for page 4001 page here
on the loksabha site
workout other parts .. dont expect complete solution here :-)
sess=requests.session()
resp=sess.get('https://loksabha.nic.in/Questions/Qtextsearch.aspx')
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
for pagenum in range(4000): # change as per your old code
postback={'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$cmdNext',
'__EVENTARGUMENT': '',
'__VIEWSTATE':vstat,
'__VIEWSTATEGENERATOR': vstatgen,
'__VIEWSTATEENCRYPTED': vstatenc,
'__EVENTVALIDATION': eventval,
'ctl00$txtSearchGlobal': '',
'ctl00$ContentPlaceHolder1$ddlfile': '.pdf',
'ctl00$ContentPlaceHolder1$TextBox1': '',
'ctl00$ContentPlaceHolder1$btn': 'allwordbtn',
'ctl00$ContentPlaceHolder1$btn1': 'titlebtn',
'ctl00$ContentPlaceHolder1$txtpage': str(pagenum) }
resp=sess.post('https://loksabha.nic.in/Questions/Qtextsearch.aspx',data=postback)
soup=bs(resp.content)
vstat=soup.find('input',{'name':'__VIEWSTATE'})['value']
vstatgen=soup.find('input',{'name':'__VIEWSTATEGENERATOR'})['value']
vstatenc=soup.find('input',{'name':'__VIEWSTATEENCRYPTED'})['value']
eventval=soup.find('input',{'name':'__EVENTVALIDATION'})['value']
### process next page...extract pdfs here
###
###
####

Related

API - Web Scrape

how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]

To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()

Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)

I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/

Price history extraction from steam market

I am using following URL to extract the JSON file for the price history
https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20|%20Blind%20Spot%20(Field-Tested)
The python code I am using:
item = requests.get(URL, cookies={'steamLogin': steamid}); # get item data
print(str(currRun),' out of ',str(len(allItemNames))+' code: '+str(item.status_code))
item = item.content
item = json.loads(item)
Now I went to almost all the solutions that was posted in this community but I am still getting status code as 400 and Items as [].
When I copy paste the URL and open it in browser I am able to see the JSON file with required data but somehow the Jupyter notebook is unable to detect the content
I also tried Beautiful soup to read the content with the following code:
r = requests.get(url)
#below code extracts the whole HTML Code of above URL
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find_all('pre')
print(table)
Output: []

So you are getting [] because you are not authorized, so you recieve empty json array. You can check it by opening link in incognito (Ctrl+Shift+N) mode.
To authorize you need to set Cookie header to your request, so your code will be as this:
import requests
url = "https://steamcommunity.com/market/pricehistory/?appid=730&market_hash_name=P90%20%7C%20Blind%20Spot%20(Field-Tested)"
headers = {
"Cookie": "Your cookie"
}
json = requests.get(url, headers=headers).text
...
How to find Cookie (Chrome)
Go to link with json
Press F12 to open Chrome Development Toolkit.
Open Network tab
Reload page.
Double click on first sent request
Open Headers subtab
Scroll to Request Headers
Find Cookie header

Web scraping with BeautifulSoup only scrapes the first page

I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}

Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame

If page_links is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.
It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.
As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)

How to download in python big media links of a web page behind a log in form?

I'm looking for some library or libraries in Python to:
a) log in a web site,
b) find all links to some media files (let us say having "download" in their URLs), and
c) download each file efficiently directly to the hard drive (without loading the whole media file into RAM).
Thanks

You can use the broadly used requests module (more than 35k stars on github), and BeautifulSoup. The former handles session cookies, redirections, encodings, compression and more transparently. The later finds parts in the HTML code and has an easy-to-remember syntax, e.g. [] for properties of HTML tags.
It follows a complete example in Python 3.5.2 for a web site that you can scrap without a JavaScript engine (otherwise you can use Selenium), and downloading sequentially some links with download in its URL.
import shutil
import sys
import requests
from bs4 import BeautifulSoup
""" Requirements: beautifulsoup4, requests """
SCHEMA_DOMAIN = 'https://exmaple.com'
URL = SCHEMA_DOMAIN + '/house.php/' # this is the log-in URL
# here are the name property of the input fields in the log-in form.
KEYS = ['login[_csrf_token]',
'login[login]',
'login[password]']
client = requests.session()
request = client.get(URL)
soup = BeautifulSoup(request.text, features="html.parser")
data = {KEYS[0]: soup.find('input', dict(name=KEYS[0]))['value'],
KEYS[1]: 'my_username',
KEYS[2]: 'my_password'}
# The first argument here is the URL of the action property of the log-in form
request = client.post(SCHEMA_DOMAIN + '/house.php/user/login',
data=data,
headers=dict(Referer=URL))
soup = BeautifulSoup(request.text, features="html.parser")
generator = ((tag['href'], tag.string)
for tag in soup.find_all('a')
if 'download' in tag['href'])
for url, name in generator:
with client.get(SCHEMA_DOMAIN + url, stream=True) as request:
if request.status_code == 200:
with open(name, 'wb') as output:
request.raw.decode_content = True
shutil.copyfileobj(request.raw, output)
else:
print('status code was {} for {}'.format(request.status_code,
name),
file=sys.stderr)

You can use the mechanize module to log into websites like so:
import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://www.example.com")
br.select_form(nr=0) #Pass parameters to uniquely identify login form if needed
br['username'] = '...'
br['password'] = '...'
result = br.submit().read()
Use bs4 to parse this response and find all the hyperlinks in the page like so:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(result, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
You can use re to further narrow down the links you need from all the links present in the response webpage, which are media links (.mp3, .mp4, .jpg, etc) in your case.
Finally, use requests module to stream the media files so that they don't take up too much memory like so:
response = requests.get(url, stream=True) #URL here is the media URL
handle = open(target_path, "wb")
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
handle.write(chunk)
handle.close()
when the stream attribute of get() is set to True, the content does not immediately start downloading to RAM, instead the response behaves like an iterable, which you can iterate over in chunks of size chunk_size in the loop right after the get() statement. Before moving on to the next chunk, you can write the previous chunk to memory hence ensuring that the data isn't stored in RAM.
You will have to put this last chunk of code in a loop if you want to download media of every link in the links list.
You will probably have to end up making some changes to this code to make it work as I haven't tested it for your use case myself, but hopefully this gives a blueprint to work off of.

Unable to glean data from other pages

I've written a script in python using post requests to get data from a webpage. The webpage traverses 57 pages with a next or dropdown button. What I've written so far can fetch data only from the first page. I tried a lot to find a way to capture data going through it's next pages but failed. How can I get data from all of the 57 pages? Thanks in advance.
Here is what I've tried so far:
import requests
from lxml import html
with requests.session() as session:
session.headers = {"User-Agent":"Mozilla/5.0"}
page = session.post("http://registers.centralbank.ie/(X(1)S(cvjcqdbijraticyy2ssdyqav))/FundSearchResultsPage.aspx?searchEntity=FundServiceProvider&searchType=Name&searchText=&registers=6%2c29%2c44%2c45&AspxAutoDetectCookieSupport=1",
data={'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':'2'},
headers={'Content-Type': 'application/x-www-form-urlencoded'})
tree = html.fromstring(page.text)
titles = tree.cssselect("table")[1]
list_row =[[tab_d.text_content() for tab_d in item.cssselect('td.gvwColumn,td.entityNameColumn,td.entityTradingNameColumn')]
for item in titles.cssselect('tr')]
for data in list_row:
print(' '.join(data))
This is The Link to that page
Btw, I didn't find any paginated links through which I can go on to the next page except for the "data" in requests parameter where there is a page number option which changes when button is clicked. However, changing that number doesn't bring data from other pages.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape Table Data with navigation using Beautiful Soup - python

Related

API - Web Scrape

Price history extraction from steam market

Web scraping with BeautifulSoup only scrapes the first page

How to download in python big media links of a web page behind a log in form?

Unable to glean data from other pages

Categories

Resources