I've written a script in python using post requests to get data from a webpage. The webpage traverses 57 pages with a next or dropdown button. What I've written so far can fetch data only from the first page. I tried a lot to find a way to capture data going through it's next pages but failed. How can I get data from all of the 57 pages? Thanks in advance.
Here is what I've tried so far:
import requests
from lxml import html
with requests.session() as session:
session.headers = {"User-Agent":"Mozilla/5.0"}
page = session.post("http://registers.centralbank.ie/(X(1)S(cvjcqdbijraticyy2ssdyqav))/FundSearchResultsPage.aspx?searchEntity=FundServiceProvider&searchType=Name&searchText=®isters=6%2c29%2c44%2c45&AspxAutoDetectCookieSupport=1",
data={'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':'2'},
headers={'Content-Type': 'application/x-www-form-urlencoded'})
tree = html.fromstring(page.text)
titles = tree.cssselect("table")[1]
list_row =[[tab_d.text_content() for tab_d in item.cssselect('td.gvwColumn,td.entityNameColumn,td.entityTradingNameColumn')]
for item in titles.cssselect('tr')]
for data in list_row:
print(' '.join(data))
This is The Link to that page
Btw, I didn't find any paginated links through which I can go on to the next page except for the "data" in requests parameter where there is a page number option which changes when button is clicked. However, changing that number doesn't bring data from other pages.
Related
I want to scrap data from the tables on this url, but I couldn't find the table class.
I tried the first steps in BeautifulSoup but I couldn't get any further.
from pandas .io.html import read_html
import requests
from bs4 import BeautifulSoup
url = 'https://www.cbe.org.eg/en/Auctions/Pages/AuctionsEGPTBillsHistorical.aspx'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# No class
bills_table = soup.find('table', class_= )
I appreciate any help.
Thanks.
When clicking on that link, a page is displayed containing a date range with the ability to show a table (which I assume is the table you are after) upon a click. You have 2 options here:
Simulate the website using a package such as selenium
Send the post request yourself directly
There is documentation available on selenium which should help. I will describe 2 in more detail.
Open the website linked and open the dev panel on whichever browser you are using and navigate to the 'network' tab. Enter a date range and press the show online button as required. A new entry will appear, which is the request we want to make. Click on it and create a new requests.post(..) request with the listed request headers and the body. You may want to edit the body to change the date range.
how to get access to this API:
import requests
url = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
print(requests.get(url))
I'm trying to retrieve data from this site via API, I found the url above and I can see its data , however I can't seem to get it right because I'm running into code 403.
This is the website url:
https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos
I'm trying to retrieve items category, they are visible for me, but I'm unable to take them.
Later I'll use these categories to iterate over products API.
API Category
Obs: please be gentle it's my first post here =]
To get the data as you shown in your image the following headers and endpoint are needed:
import requests
headers = {
'sm-token': '{"IdLoja":2691,"IdRede":884}',
'User-Agent': 'Mozilla/5.0',
'Referer': 'https://www.nagumo.com.br/osasco-lj46-osasco-ayrosa-rua-avestruz/departamentos',
}
params = {
'id_loja': '2691',
}
r = requests.get('https://www.nagumo.com.br/api/b2c/page/menu', params=params, headers=headers)
r.json()
Not sure exactly what your issue is here.
Bu if you want to see the content of the response and not just the 200/400 reponses. You need to add '.content' to your print.
Eg.
#Create Session
s = requests.Session()
#Example Connection Variables, probably not required for your use case.
setCookieUrl = 'https://www...'
HeadersJson = {'Accept-Language':'en-us'}
bodyJson = {"__type":"xxx","applicationName":"xxx","userID":"User01","password":"password2021"}
#Get Request
p = s.get(otherUrl, json=otherBodyJson, headers=otherHeadersJson)
print(p) #Print response (200 etc)
#print(p.headers)
#print(p.content) #Print the content of the response.
#print(s.cookies)
I'm also new here haha, but besides this requests library, you'll also need another one like beautiful soup for what you're trying to do.
bs4 installation: https:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Once you install it and import it, it's just continuing what you were doing to actively get your data.
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
this gets the entire HTML content of the page, and so, you can get your data from this page based on their css selectors like this:
site_data = soup.select('selector')
site_data is an array of things with that 'selector', so a simple for loop and an array to add your items in would suffice (as an example, getting links for each book on a bookstore site)
For example, if i was trying to get links from a site:
import requests
from bs4 import BeautifulSoup
sites = []
URL = 'https://b2c-api-premiumlabel-production.azurewebsites.net/api/b2c/page/menu?id_loja=2691'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
links = soup.select("a") # list of all items with this selector
for link in links:
sites.append(link)
Also, a helpful tip is when you inspect the page (right click and at the bottom press 'inspect'), you can see the code for the page. Go to the HTML and find the data you want and right click it and select copy -> copy selector. This will make it really easy for you to get the data you want on that site.
helpful sites:
https://oxylabs.io/blog/python-web-scraping
https://realpython.com/beautiful-soup-web-scraper-python/
I am trying to crawl the job's links from this web.
At first, I will explain my code:
I will crawl all the links of each job using for.
After getting all the links on the first page, I will move to the next page and repeat crawling the links of the job.
But the program returns the result like this:
As you can see, from the 11th and so on, the link is repeated as the first 10 links.
My assumption is that it doesn't actually go to the next page but still crawling data from the old page. In this case, there are how many pages, the program will crawl the 1st page with the same times.
And there should have more than 9 links on the 1st page.
I don't really know how to fix that.
How can I solve this problem?
Sincerely thanks!
import requests
def main(url):
params = {
"x-algolia-agent": "Algolia for JavaScript (3.35.1); Browser",
"x-algolia-application-id": "JF8Q26WWUD",
"x-algolia-api-key": "ecef10153e66bbd6d54f08ea005b60fc"
}
data = "{\"requests\":[{\"indexName\":\"vnw_job_v2\",\"params\":\"query=&hitsPerPage=1000&attributesToRetrieve=%5B%22*%22%2C%22-jobRequirement%22%2C%22-jobDescription%22%5D&attributesToHighlight=%5B%5D&query=&facetFilters=%5B%5D&filters=&numericFilters=%5B%5D&page=0&restrictSearchableAttributes=%5B%22jobTitle%22%2C%22skills%22%2C%22company%22%5D\"}]}"
r = requests.post(url, params=params, data=data)
for item in r.json()['results'][0]['hits']:
print(item['jobTitle'])
if __name__ == "__main__":
main('https://jf8q26wwud-dsn.algolia.net/1/indexes/*/queries')
I am working on a web scraping project and want to get a list of products from Dell's website. I found this link (https://www.dell.com/support/home/us/en/04/products/) which pulls up a box with a list of product categories (really just redirect urls. If it doesn't come up for you click the button which says "Browse all products"). I tried using Python Requests to GET the page and save the text to a file to parse through, but the response doesn't contain any of the categories/redirect urls. My code is as basic as it gets:
import requests
url = "https://www.dell.com/support/home/us/en/04/products/"
page = requests.get(url)
with open("laptops.txt", "w", encoding="utf-8") as outf:
outf.write(page.text)
outf.close()
Is there a way to get these redirect urls? I am essentially trying to make my own site map of their products so that I can scrape the details of each one. Thanks
This page uses JavaScript to get and display these links - but requests/urllib and BeautifulSoup/lxml can't run JavaScript.
Using DevTools in Firefox/Chrome (tab: Network) I found it reads it from url
https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743
so I use it to get links.
You may have to to change country=pl&language=pl in url to get it in different language.
import requests
from bs4 import BeautifulSoup as BS
url = "https://www.dell.com/support/components/productselector/allproducts?category=all-products/esuprt_&country=pl&language=pl®ion=emea&segment=bsd&customerset=plbsd1&openmodal=true&_=1589265310743"
response = requests.get(url)
soup = BS(response.text, 'html.parser')
all_items = soup.find_all('a')
for item in all_items:
print(item.text, item['href'])
BTW: Other method is it use Selenium to control real web browser which can run JavaScript.
try using selenium chrome driver it helps for handling dynamic data on website and also features like clicking buttons, handling page refresh etc.
Beginner guide to web scraping
I am trying to scrape some data from the webmd messageboard. Initially I constructed a loop to get the page numbers for each category and stored the in a dataframe. When I try to run the loop I do get the proper amount of post for each subcategory but only for the first page. Any ideas what might be going wrong?
lists2=[]
df1= pd.DataFrame (columns=['page'],data=page_links)
for j in range(len(df1)):
pages = (df1.page.iloc[j])
print(pages)
req1 = urllib.request.Request(pages, headers=headers)
resp1 = urllib.request.urlopen(req1)
soup1 = bs.BeautifulSoup(resp1,'lxml')
for body_links in soup1.find_all('div',class_="thread-detail"):
body= body_links.a.get('href')
lists2.append(body)
I am getting the proper page in the print function but then it seem to iterate only in the first page and getting the links of the posts. Also when I copy and paste the link for any page besides the first one it seems to momentarily load the first page and then goes to the proper number page. I tried to add time.sleep(1) but does not work. Another thing I tried was to add {headers='Cookie': 'PHPSESSID=notimportant'}
Replace this line:
pages = (df1.page.iloc[j])
With this:
pages = (df1.page.iloc[j, 0])
You will now iterate through the values of your DataFrame
If page_links is list with urls like
page_links = ["http://...", "http://...", "http://...", ]
then you could use it directly
for url in page_links:
req1 = urllib.request.Request(url headers=headers)
If you need it in DataFrame then
for url in df1['page']:
req1 = urllib.request.Request(url headers=headers)
But if your current code displays all urls but you get result only for one page then problem is not in DataFrame but in HTML and find_all.
It seems only first page has <div class_="thread-detail"> so it can't find it on other pages and it can't add it to list. You should check it again. For other pages you may need different arguments in find_all. But without urls to these pages we can't check it and we can't help more.
It can be other common problem - page may use JavaScript to add these elements but BeautifulSoup can't run JavaScript - and then you woould need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can run JavaScript. You could turn off JavaScript in browser and open urls to check if you can see elements on page and in HTML inDevTools` in Chrome/Firefox.
As for PHPSESSID with requests you could use Session to get from server fresh cookies with PHPSESSID and automatically add them to other reuqests
import requests
s = reqeusts.Session()
# get any page to get fresh cookies from server
r = s.get('http://your-domain/main-page.html')
# use it automatically with cookies
for url in page_links:
r = s.get(url)