Web Scrapping - empty list

Web Scrapping - empty list - python

I am trying to scrap data from the below ecommerce site. But I am getting blank list while trying to append names in the Brand_Names list. Below is my code for the same. I am performing this task using selenium. I have not mentioned code for importing the required libraries. Rest code is pasted below
driver=webdriver.Chrome('chromedriver.exe')
time.sleep(2)
url='https://www.amazon.in/'
driver.get(url)
time.sleep(1)
driver.maximize_window()
#Searching element for search field
search_field=driver.find_element_by_xpath('//div[#class="nav-search-field "]/input')
user_inp=input("Enter your search value: ")
search_field.send_keys(user_inp)
time.sleep(2)
search_btn=driver.find_element_by_xpath('//div[#class="nav-search-submit nav-sprite"]')
search_btn.click()
URLs=[]
for page in range(0,3):
links=driver.find_elements_by_xpath('//h2[#class="a-size-mini a-spacing-none a-color-base s-line-clamp-2"]//a')
for i in links:
URLs.append(i.get_attribute('href'))
nxt_button=driver.find_element_by_xpath('//*[#class="s-pagination-item s-pagination-next s-pagination-button s-pagination-separator"]')
nxt_button.click()
time.sleep(2)
Brand_Name=[]
Product_Name=[]
Price=[]
Expected_Delivery=[]
for i in URLs:
driver.get(url)
time.sleep(2)
try:
brands=driver.find_element_by_xpath('//a[#id="bylineInfo"]')
Brand_Name.append(brands.text)
except NoSuchElementException:
Brand_Name.append('-')

The middle part of your code is where the bugs lie:
URLs=[]
time.sleep(2)
for page in range(0,3):
URL = [my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//a[#class='a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal']")]
URLs.extend(URL)
nxt_button=driver.find_element_by_xpath('//*[#class="s-pagination-item s-pagination-next s-pagination-button s-pagination-separator"]')
nxt_button.click()
time.sleep(2)
Brand_Name=[]
Product_Name=[]
Price=[]
Expected_Delivery=[]
for i in URLs:
driver.get(i)
I've fixed the 1st loop so the URLs are fetched properly. Remember to import the relevant libraries, e.g., from selenium.webdriver.common.by import By. The 2nd loop begins with a bug, but the rest of your code seems fine.

Related

BeautifulSoup Python Selenium - Wait for tweet to load before scraping website

I am trying to scrape a website to extract tweet links (Specifically DW in this case) but I am unable to get any data because the tweets are not loading immediately so the request executes before there is time to give it to load. I have tried using requests timeout as well as time.sleep() but without luck. After using those two options I tried using Selenium to load the webpage locally and give it time to load, but I can't seem to make it work. I believe this can be done with Selenium. Here is what I tried so far:
links = 'https://www.dw.com/en/vaccines-appear-effective-against-india-covid-variant/a-57344037'
driver.get(links)
delay = 30 #seconds
try:
WebDriverWait(driver, delay).until(EC.visibility_of_all_elements_located((By.ID, "twitter-widget-0")))
except:
pass
tweetSource = driver.page_source
tweetSoup = BeautifulSoup(tweetSource, features='html.parser')
linkTweets = tweetSoup.find_all('a')
for linkTweet in linkTweets:
try:
tweetURL = linkTweet.attrs['href']
except: # pass on KeyError or any other error
pass
if "twitter.com" in tweetURL and "status" in tweetURL:
# Run getTweetID function
tweetID = getTweetID(tweetURL)
newdata = [tweetID, date_tag, "DW", links, title_tag, "News", ""]
# Write to dataframe
df.loc[len(df)] = newdata
print("working on tweetID: " + str(tweetID))
If anyone could get Selenium to find the tweet that would be great!

it's an iframe first you need to switch to that iframe
iframe = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "twitter-widget-0"))
)
driver.switch_to.frame(iframe)

Python Selenium how to interact with two different website formats in one code

there are two formats on the website I am interacting:
http://15256160037.58food.com/contact/
http://hubeianran.58food.com/contact/
Also two links format: (with title in a[1] or a[2]
<div class="company-left-title">
湖北安然保健品有限公司
<p>［企业黄页]</p>
</div>
OR
<div class="company-left-title">
<a href="javascript:Go('/qy-l-0-4-3595-3595-1.html');">
</a>亳州市九熹堂药业有限公司
</div>
I am trying to get contact info, website on those websites and put them into a csv. in the second format, I have to click in another button to get the whole info
I used:
driver.get('http://www.58food.com/qy-l-0-3595.html')
while True:
try:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="company-left-title"]/a[2]')]
except:
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath('//*[#class="company-left-title"]/a[1]')]
locs = [loc.text for loc in driver.find_elements_by_xpath('//*[#class="company-text"]/p')]
for link,loc in zip(links,locs):
time.sleep(2)
driver.get(link)
windows = driver.window_handles
driver.switch_to.window(windows[-1])
driver.find_element_by_link_text('联系方式').click()
try:
company = driver.find_element_by_xpath('//*[#class="rclefttop"]/strong').text
con_num = driver.find_element_by_xpath('//*[#class="rcleftlist"]/i[1]').text
driver.back()
driver.back()
except:
company = driver.find_element_by_xpath('//*[#class="px14 lh18"]/table/tbody/tr[1]/td[2]').text
driver.find_element_by_id('glo_contactway_content').click()
con_num = driver.find_element_by_xpath('//*[#class="archives dr-archives relative"]/p[1]').text
driver.find_element_by_id('close').click()
website = driver.find_element_by_xpath('//*[#class="px14 lh18"]/table/tbody/tr[5]/td[2]/a').text
driver.back()
driver.back()
dataframe = pd.DataFrame({'col1':company,'col2':con_num,'col3':con_num2,'col4':loc,'col5':website},index=[0])
try:
next_page = driver.find_element_by_link_text("下一页")
next_page.click()
except:
print('No more pages')
break
But
ElementNotInteractableException: element not interactable
(Session info: chrome=88.0.4324.104)
Could someone please help with that issue?

May be the object you are trying to click was still loading and that is why not clickable. Try Explicit wait till the object you are trying to click is visible and clickable.
Refer to this link for more information.

How to Handle Nested Loops in Selenium

I am hoping someone can help me handle nested loop in selenium. I am trying to Scrape a website using selenium, it happens that i have to scrape multiple information with different links.
So i got all the links and looped through each, but in the process, the first link only displayed the items i needed, then the code breaks.
def get_financial_info(self):
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='/home/miracle/chromedriver')
driver.get("https://www.financialjuice.com")
try:
WebDriverWait(driver, 60).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trendWrap']")))
except TimeoutException:
driver.quit()
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
for record in category_url:
driver.get(record.get_attribute("href"))
news = {}
title_element = driver.find_elements_by_xpath("//p[#class='headline-title']")
for news_record in title_element:
news['title'] = news_record.text
print news

Your category_url will be valid only on page where you've defined it and after first re-direction to another page it becomes stale...
You need to replace
category_url = driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a[#href]")
with
category_url = [a.get_attribute("href") for a in driver.find_elements_by_xpath("//ul[#class='nav navbar-nav']/li[#class='text-uppercase']/a")]
and then loop through the list of links as
for record in category_url:
driver.get(record)

Unable to get text using selenium web driver while iterating over multiple links stored in an array

This is a continuation of my previous question "Web scraping using selenium and beautifulsoup.. trouble in parsing and selecting button". I could solve the previous problem, but I am now stuck on the below.
I got the links from previously stored in an array.
Then, I am trying to visit all the links stored in a list named StartupLink.
The information I need to scrape and store in an array is in div class=content tag. For some link, the above div tag contains div hidden_more with javascript enabled click events. So I am handling the exception. However, the loop runs fine and visits links but after first two links it gives NA output even though there is the presence of div content tag, it also shows no error (that's unacceptable).
The array contains 400 links to visit with similar div content element.
Where am I going wrong here?
Description=[]
driver = webdriver.Chrome()
for link in StartupLink:
try:
driver.get(link)
sleep(5)
more = driver.find_element_by_xpath('//a[#class="hidden_more"]')
element = WebDriverWait(driver, 10).until(EC.visibility_of(more))
sleep(5)
element.click()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
except Exception as e:# NoSuchElementException:
driver.start_session()
sleep(5)
page = driver.find_element_by_xpath('//div[#class="content"]').text
sleep(5)
print(str(e))
if page == '':
page = "NA"
Description.append(page)
else:
Description.append(page)
print(page)

Waiting for page to load selenium, Python. All the pages have the same structure

I am trying to scrape some data using selenium and python. I have a list with some links and I have to go through every link. What I do now is the following:
for link in links:
self.page_driver.get(link)
time.sleep(5)
#scrape data
It works just fine, the problem is that I have a lot of links and waiting 5 seconds for each one is a waste of time. That's why I decided to try with something like:
self.driver.get(link)
try:
element_present = EC.presence_of_element_located((By.CLASS_NAME, 'cell-box'))
WebDriverWait(self.driver, 10).until(element_present)
except TimeoutException:
logging.info("Timed out waiting for page to load")
The problem is that every link has the exact same structure inside, only data change, so the element is found even if the page hasn't changed. What I would like to do is to save the name of the product in the link in a variable, change page wait until the name of the product is different than the one saved, which means the new page loaded. Any help would be really appreciated.

You can add the staleness_of Expected Condition
wait = WebDriverWait(self.driver, 10)
element = None
for link in links:
self.page_driver.get(link)
if (element is not None):
wait.until(EC.staleness_of(element)
try:
element = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'cell-box')))
except TimeoutException:
logging.info("Timed out waiting for page to load")
#scrape data

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scrapping - empty list - python

Related

BeautifulSoup Python Selenium - Wait for tweet to load before scraping website

Python Selenium how to interact with two different website formats in one code

How to Handle Nested Loops in Selenium

Unable to get text using selenium web driver while iterating over multiple links stored in an array

Waiting for page to load selenium, Python. All the pages have the same structure

Categories

Resources