Trying to scrape data, but can't scrape it all (Python)

Trying to scrape data, but can't scrape it all (Python) - python

I'm trying to scrape the number of followers from this page http://freelegalconsultancy.blogspot.co.uk/ but can't seem to pull it. I've tried using urllib, urllib2, urllib3, selenium and beautiful soup, but have had no luck pulling the followers. Here's what my code looks like currently:
import urllib2
url = "http://freelegalconsultancy.blogspot.co.uk/"
opener = urllib2.urlopen(url)
for item in opener:
print item
How would I go about pulling the number of followers?

Try to use selenium code as below:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://freelegalconsultancy.blogspot.co.uk/')
driver.switch_to_frame(driver.find_element_by_xpath('//div[#id="followers-iframe-container"]/iframe'))
followers_text = driver.find_element_by_xpath('//div[#class="member-title"]').text
followers = int(followers_text.split('(')[1].split(')')[0])
Last line is kinda rude, so you can change it if you like

Related

Create get request in a webdriver with selenium

Is it possible to send a get request to a webdriver using selenium?
I want to scrape a website with an infinite page and want to scrape a substantial amount of the objects on the website. For this I use Selenium to open the website in a webdriver and scroll down the page until enough objects on the page are visible.
However, I'd like to scrape the information on the page with BeautifulSoup since this is the most effective way in this case. If the get request is send in the normal way (see the code) the response only holds the first objects and not the objects from the scrolled-down page (which makes sence).
But is there any way to send a get request to an open webdriver?
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import requests
from bs4 import BeautifulSoup
# Opening the website in the webdriver
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
# Loop for scrolling
scroll_start = 0
for i in range(100):
scroll_end = scroll_start + 1080
driver.execute_script(f'window.scrollTo({scroll_start}, {scroll_end})')
time.sleep(2)
scroll_start = scroll_end
# The get request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')

You should probably find out what is the endpoint that the website is using to get the data for the infinite scrolling.
Go to the website, open the Dev Tools, open the Network tab and find the HTTP request that is asking for the content you're seeking, then MAYBE you can use it too. Just know that there are a lot of variables like, are they using some sort of authorization for their APIs? Are the APIs returning JSON, XML, HTML, ...? Also, I am not sure if this is fair-use.

Trying to scrape from mutiple pages with same link

from bs4 import BeautifulSoup
import requests
import time
from selenium import webdriver
driver = webdriver.Chrome(r'C:\chromedriver.exe')
url ='https://www.sambav.com/hyderabad/doctors'
driver.get(url)
soup = BeautifulSoup(driver.page_source,'html.parser')
for links in soup.find_all('div',class_='sambavdoctorname'):
link = links.find('a')
print(link['href'])
driver.close()
I am trying to scrape this page, the link is same in all pages. I am trying to extract the links from all mutiple pages but it's not giving any output nor showing any error just the program gets end.

If you check that website by developer tools in browser ( chrome or mozilla or whatever), before loading website, the website fetch data from few sources. One of this sources is "https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=" . Your code could be simplified (and there is no need to use selenium):
import requests
r = requests.get('https://www.sambav.com/api/search/DoctorSearch?searchText=&city=Hyderabad&location=')
BASE_URL_DOCTOR = 'https://www.sambav.com/hyderabad/doctor/'
for item in r.json():
print(BASE_URL_DOCTOR + item['uniqueName'])

Can't get all titles from a list with Python WebScraping

I'm practicing web scraping with Python atm and I found a problem, I wanted to scrape one website that has a list of anime that I watched before but when I try to scrape it (via requests or selenium) it only gets around 30 of 110 anime names from the page.
Here is my code with selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
browser.get("https://anilist.co/user/Agusmaris/animelist/Completed")
data = BeautifulSoup(browser.page_source, 'lxml')
for title in data.find_all(class_="title"):
print(title.getText())
And when I run it, the page source only shows up until an anime called 'Golden time' when there are like 70 or more left that are in the page.
Thanks
Edit: Code that works now thanks to 'supputuri':
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer = driver.find_element_by_css_selector("div.footer")
preY = 0
print(str(footer))
while footer.rect['y'] != preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
print('loading')
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
for title in soup.find_all(class_="title"):
print(title.getText())
driver.close()
driver.quit()
ret = input()

Here is the solution.
Make sure to add import time
driver.get("https://anilist.co/user/Agusmaris/animelist/Completed")
time.sleep(3)
footer =driver.find_element_by_css_selector("div.footer")
preY =0
while footer.rect['y']!=preY:
preY = footer.rect['y']
footer.location_once_scrolled_into_view
time.sleep(1)
print(str(driver.page_source))
This will iterate until all the anime is loaded and then gets the page source.
Let us know if this was helpful.

So, this is the jist of what I get when I load the page source:
AniListwindow.al_token = 'E1lPa1kzYco5hbdwT3GAMg3OG0rj47Gy5kF0PUmH';Sorry, AniList requires Javascript.Please enable Javascript or http://outdatedbrowser.com>upgrade to a modern web browser.Sorry, AniList requires a modern browser.Please http://outdatedbrowser.com>upgrade to a newer web browser.
Since I know damn well that Javascript is enabled and my Chrome version is fully up to date, and the URL listed takes one to a nonsecure website to "download" a new version of your browser, I think this is a spam site. Not sure if you were aware of that when posting so I won't flag as such, but I wanted you and others who come across this to be aware.

Advice on how to scrape data from this website

I would like some advice on how to scrape data from this website.
I started with selenium, but got stuck at the beginning because, for example, I have no idea how to set the dates.
My code until now:
from bs4 import BeautifulSoup as soup
from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font
from selenium import webdriver
from selenium.webdriver.common.by import By
import datetime
import os
import time
import re
day = datetime.date.today().day
month = datetime.date.today().month
year = datetime.date.today().year
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
cookieValue = '12-c12-cached|from:' +str(year)+ '-' +str(month)+ '-' +str(day-5)+ ','+'to:' +str(year)+ '-' +str(month)+ '-' + str(day) +',dateType:1,company:PreussenElektra,fuel:uranium,canceled:0,durationComparator:ge,durationValue:5,durationUnit:day'
#saving url
browser = webdriver.Chrome(executable_path=r"C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe")
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit'
browser.add_cookie({'name': 'tem', 'value': cookieValue})
browser.get(my_url)
my_url = 'https://www.eex-transparency.com/homepage/power/germany/production/availability/non-usability-by-unit/non-usability-history'
browser.get(my_url)
Obviously I am not asking for code, just some suggestions on how to continue with Selenium (how to set dates and other data) or any idea on how to scrape this website
Thanks in advance.
EDIT: I am trying to follow the cookie way. That is my updated code, I read that the cookie need to be created before loading the page and so I did, any idea why it is not working?

Best approach for you will be changing cookies, because every filter data is saved in cookie.
Check cookies in chrome ( f12 -> application -> cookies ) and play with filters. If you will change it in programmers tools you have to refresh website :)
Check this post on how to change cookies in selenium python.
To get values from website you have to use classic way like u did here, but you will have to use classes:
radio = browser.find_elements_by_class_name('aaaaaa')
You can always use xPath to search elements ( chrome will generate them for you ).

Is there any particular reason why you have decided to use selenium over other web scraping tools (scrapy, urllib, etc.)? I personally have not used Selenium but I have used some of the other tools. Below is an example of a script to just pull all the html from a page.
import urllib
import urllib2
from bs4 import BeautifulSoup as soup
link = "https://ubuntu.com"
page = urllib2.urlopen(link)
data = soup(page, 'html.parser')
print (data)
This is just a short script to pull all the HTML off a page. I believe BeautifulSoup has additional tools for inputting data into fields, but the exact method slips my mind right now, if I can find my notes on it I will edit this post. I remember it being very straightforward, though.
Best of luck!
Edit: here's a discussion web scraping tools from reddit a while back that I had saved https://www.reddit.com/r/Python/comments/1qnbq3/webscraping_selenium_vs_conventional_tools/

Looping through web pages to webscrape data

I'm trying to loop through Zillow pages and extract data. I know that the URL is being updated with a new page number after each iteration but the data extracted is as if the URL is still on page 1.
import selenium
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
next_page='https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/'
num_data1=pd.DataFrame(columns=['name','number'])
browser=webdriver.Chrome()
browser.get('https://www.zillow.com/romeo-mi-48065/real-estate-agent-reviews/')
while True:
page=requests.get(next_page)
contents=page.content
soup = BeautifulSoup(contents, 'html.parser')
number_p=soup.find_all('p', attrs={'class':'ldb-phone-number'},text=True)
name_p=soup.find_all('p', attrs={'class':'ldb-contact-name'},text=True)
number_p=pd.DataFrame(number_p,columns=['number'])
name_p=pd.DataFrame(name_p,columns=['name'])
num_data=number_p['number'].apply(lambda x: x.text.strip())
nam_data=name_p['name'].apply(lambda x: x.text.strip())
number_df=pd.DataFrame(num_data,columns=['number'])
name_df=pd.DataFrame(nam_data,columns=['name'])
num_data0=pd.concat([number_df,name_df],axis=1)
num_data1=num_data1.append(num_data0)
try:
button=browser.find_element_by_css_selector('.zsg-pagination>li.zsg-pagination-next>a').click()
next_page=str(browser.current_url)
except IndexError:
break

Replace page=requests.get(next_page) with page = browser.page_source
Basically what's happening is that you're going to the next page in Chrome, but then trying to load that page's url with requests which is getting redirected back to page one by Zillow (probably because it doesn't have the cookies or appropriate request headers).

why not make your life easier and use the Zillow API instead of scraping? (do you even have permission to scrape their site?)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to scrape data, but can't scrape it all (Python) - python

Related

Create get request in a webdriver with selenium

Trying to scrape from mutiple pages with same link

Can't get all titles from a list with Python WebScraping

Advice on how to scrape data from this website

Looping through web pages to webscrape data

Categories

Resources