Extracting Data from a Table in HTML using Selenium and Python - python

I have this assignment of extracting some items from each row of a table in HTML. I have figured out how to grab the whole table from the web using Selenium with Python. Following is the code for that:
from selenium import webdriver
import time
import pandas as pd
mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly
table = mydriver.find_element_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody')
for row in table.find_elements_by_xpath('./tr'):
print(row.text)
I am unable to understand the way I can grab specific items from the table itself. Following are the items that I require:
Company Name
PDF Link(if it does not exist, write "No PDF Link")
Received Time
Dessiminated Time
Time Taken
Description
Any help in logic would be helpful.
Thanks in Advance.

for tr in mydriver.find_elements_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_lblann"]/table//tr'):
tds = tr.find_elements_by_tag_name('td')
print ([td.text for td in tds])

I went through a rough time to get this working. I think it works just fine now. Its pretty inefficient though. Following is the code:
from selenium import webdriver
import time
import pandas as pd
from selenium.common.exceptions import NoSuchElementException
mydriver = webdriver.Chrome('C:/Program Files/chromedriver.exe')
mydriver.get("https://www.bseindia.com/corporates/ann.aspx?expandable=0")
time.sleep(5) # wait 5 seconds until DOM will load completly
trs = mydriver.find_elements_by_xpath('//*[#id="ctl00_ContentPlaceHolder1_lblann"]/table/tbody/tr')
del trs[0]
names = []
r_time = []
d_time = []
t_taken = []
desc = []
pdfs = []
codes = []
i = 0
while i < len(trs):
names.append(trs[i].text)
l = trs[i].text.split()
for item in l:
try:
code = int(item)
if code > 100000:
codes.append(code)
except:
pass
link = trs[i].find_elements_by_tag_name('td')
pdf_count = 2
while pdf_count < len(link):
try:
pdf = link[pdf_count].find_element_by_tag_name('a')
pdfs.append(pdf.get_attribute('href'))
except NoSuchElementException:
pdfs.append("No PDF")
pdf_count = pdf_count + 4
time = trs[i + 1].text.split()
if len(time) == 5:
r_time.append("No Time Given")
d_time.append(time[3] + " " + time[4])
t_taken.append("No Time Given")
else:
r_time.append(time[3] + " " + time[4])
d_time.append(time[8] + " " + time[9])
t_taken.append(time[12])
desc.append(trs[i+2].text)
i = i + 4
df = pd.DataFrame.from_dict({'Name':names,'Description':desc, 'PDF Link' : pdfs,'Company Code' : codes, 'Received Time' : r_time, 'Disseminated Time' : d_time, 'Time Taken' : t_taken})
df.to_excel('corporate.xlsx', header=True, index=False) #print the data in the excel sheet.
Also, I have added another aspect that was asked, I got the company code in another column as well. Thats the result I get.

Related

Why Does My Code Scrape The First Record Only?

My code goes into a website, and clicks on records which causes drop downs.
My current code only prints the first drop down record, and not the others.
For example, the first record of the webpage when clicked, drops down 1 record. This record is shown attached. This is also the first and only dropdown record that gets printed as my output.
The code prints this
How do I get it to pull all drop down titles?
from selenium import webdriver
import time
driver = webdriver.Chrome()
for x in range (1,2):
driver.get(f'https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page={x}')
time.sleep(4)
productlist_length = len(driver.find_elements_by_xpath("//div[#class='accordin_title']"))
for i in range(1, productlist_length + 1):
product = driver.find_element_by_xpath("(//div[#class='accordin_title'])[" + str(i) + "]")
title = product.find_element_by_xpath('.//h4').text.strip()
print(title)
buttonToClick = product.find_element_by_xpath('.//div[#class="sign"]')
buttonToClick.click()
time.sleep(5)
subProduct=driver.find_element_by_xpath(".//li[#class='sub_accordin_presentation']")
otherTitle=subProduct.find_element_by_xpath('.//h4').text.strip()
print(otherTitle)
You don't need selenium at all. Not sure what all the info is that you are after but the following shows you that the content is available, from within those expand blocks, with the response from a simple requests.get().:
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://library.iaslc.org/conference-program?product_id=24&author=&category=&date=&session_type=&session=&presentation=&keyword=&available=&cme=&page=1')
soup = bs(r.text, 'lxml')
sessions = soup.select('#accordin > ul > li')
for session in sessions:
print(session.select_one('h4').text)
sub_session = session.select('.sub_accordin_presentation')
if sub_session:
print([re.sub(r'[\n\s]+', ' ', i.text) for i in sub_session])
print()
print()
Try:
productlist_length = len(driver.find_elements_by_xpath('//*[#class="jscroll-inner"]/ul/li'))
for product in productlist_length:
title = product.find_element_by_xpath('(.//*[#class="accordin_title"]/div)[3]/h4').text

convert multiple strings from selenium and Beautiful soup to CSV file

I have this scraper I am trying to export as a csv file in Google Colab. I received the scraped information as a string value, but I cannot convert it to a csv. I want each scraped attribute "title", "size", etc to populate a column in a csv file. I have ran the strings through Beautiful soup to remove the HTML formatting. Please see my code below to help.
import pandas as pd
import time
import io
from io import StringIO
import csv
#from google.colab import drive
#drive.mount('drive')
#Use new Library (kora.selenium) to run chromedriver
from kora.selenium import wd
#Import BeautifulSoup to parse HTML formatting
from bs4 import BeautifulSoup
wd.get("https://www.grailed.com/sold/EP8S3v8V_w") #Get webpage
ScrollNumber=round(200/40)+1
for i in range(0,ScrollNumber):
wd.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
#--------------#
#Each new attribute will have to found using XPATH because Grailed's website is written in Javascript (js.react) not HTML
#Only 39 results will show because the JS page is infinite scroll and selenium must be told to keep scrolling.
follow_loop = range(2, 200)
for x in follow_loop:
#Title
title = "//*[#id='shop']/div/div/div[3]/div[2]/div/div["
title += str(x)
title += "]/a/div[3]/div[2]/p"
title = wd.find_elements_by_xpath(title)
title = str(title)
#Price
price = "//*[#id='shop']/div/div/div[3]/div[2]/div/div["
price += str(x)
price += "]/div/div/p/span"
price = wd.find_elements_by_xpath(price)
price = str(price)
#Size
size = "//*[#id='shop']/div/div/div[3]/div[2]/div/div["
size += str(x)
size += "]/a/div[3]/div[1]/p[2]"
size = wd.find_elements_by_xpath(size)
size = str(size)
#Sold
sold = "//*[#id='shop']/div/div/div[3]/div[2]/div/div["
sold += str(x)
sold += "]/a/p/span"
sold = wd.find_elements_by_xpath(sold)
sold = str(sold)
#Clean HTML formatting using Beautiful soup
cleantitle = BeautifulSoup(title, "lxml").text
cleanprice = BeautifulSoup(price, "lxml").text
cleansize = BeautifulSoup(size, "lxml").text
cleansold = BeautifulSoup(sold, "lxml").text
This was a lot of work lol
from selenium import webdriver
import time
import csv
driver = webdriver.Chrome()
driver.get("https://www.grailed.com/sold/EP8S3v8V_w")
scroll_count = round(200 / 40) + 1
for i in range(scroll_count):
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
time.sleep(2)
time.sleep(3)
titles = driver.find_elements_by_css_selector("p.listing-designer")
prices = driver.find_elements_by_css_selector("p.sub-title.sold-price")
sizes = driver.find_elements_by_css_selector("p.listing-size.sub-title")
sold = driver.find_elements_by_css_selector("div.-overlay")
data = [titles, prices, sizes, sold]
data = [list(map(lambda element: element.text, arr)) for arr in data]
with open('sold_shoes.csv', 'w') as file:
writer = csv.writer(file)
j = 0
while j < len(titles):
row = []
for i in range(len(data)):
row.append(data[i][j])
writer.writerow(row)
j += 1
I'm not sure why it makes a newline between every row in the file, but I assume it's not a problem. Also, it's a naïve solution in that it assumes the size of each list is the same, consider using one list and making new lists from the child elements of the parent. Also, I just used Selenium without BeautifulSoup because it's easier for me, but you should learn BS too because it's faster for scraping than Selenium. Happy coding.

I try to get the backers and the home city of different projects on kickstarter

With the following code I try to get the home city and places where the backers are located from kickstarter. However, I keep running into the following error:
File "D:/location", line 60, in < module >
page1 = urllib.request.urlopen(projects[counter])
IndexError: list index out of range
Does someone have a more elegant solution to feed the page to urllib.request.urlopen? (see the lines in ** **)
code:
# coding: utf-8
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
from datetime import datetime
from collections import OrderedDict
import re
browser = webdriver.Firefox()
browser.get('https://www.kickstarter.com/discover?ref=nav')
categories = browser.find_elements_by_class_name('category-container')
category_links = []
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.category_links.append((str(category_link.find_element_by_class_name('f3').text),
category_link.find_element_by_class_name('bg-white').get_attribute('href')))
scraped_data = []
now = datetime.now()
counter = 1
for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
time.sleep(2)
browser.find_element_by_id('category_filter').click()
time.sleep(2)
for i in range(27):
try:
time.sleep(2)
browser.find_element_by_id('category_'+str(i)).click()
time.sleep(2)
except:
pass
#while True:
# try:
# browser.find_element_by_class_name('load_more').click()
# except:
# break
projects = []
for project_link in browser.find_elements_by_class_name('clamp-3'):
projects.append(project_link.find_element_by_tag_name('a').get_attribute('href'))
for project in projects:
**page1 = urllib.request.urlopen(projects[counter])**
soup1 = BeautifulSoup(page1, "lxml")
**page2 = urllib.request.urlopen(projects[counter].split('?')**[0]+'/community')
soup2 = BeautifulSoup(page2, "lxml")
time.sleep(2)
print(str(counter)+': '+project+'\nStatus: Started.')
project_dict = OrderedDict()
project_dict['Category'] = category[0]
browser.get(project)
project_dict['Name'] = soup1.find(class_='type-24 type-28-sm type-38-md navy-700 medium mb3').text
project_dict['Home State'] = str(soup1.find(class_='nowrap navy-700 flex items-center medium type-12').text)
try:
project_dict['Backer State'] = str(soup2.find(class_='location-list-wrapper js-location-list-wrapper').text)
except:
pass
print('Status: Done.')
counter+=1
scraped_data.append(project_dict)
later = datetime.now()
diff = later - now
print('The scraping took '+str(round(diff.seconds/60.0,2))+' minutes, and scraped '+str(len(scraped_data))+' projects.')
df = pd.DataFrame(scraped_data)
df.to_csv('kickstarter-data.csv')
If you only use counter to print the project status message, you can use range or enumerate instead. Here is an example with enumerate:
for counter, project in enumerate(projects):
... code ...
enumerate produces a tuple ( index, item ) , so the rest of your code should work fine as it is.
A fiew more things:
List index starts at 0 so when you use counter to access items you get an IndexError because you initiate counter with 1.
In the for loop you don't need projects[counter], just use project

Append scraped data to different columns

while True:
for rate in soup.find_all('div',{"class":"rating"}):
if rate.img is not None:
print (rate.img['alt'])
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
while True:
for rate in soup.findAll('div',{"class":"listing_title"}):
print (rate.a.text)
try:
driver.find_element_by_link_text('Next').click()
except:
break
driver.quit()
This should do what you're looking for. You should grab the parent class of both (I chose .listing, and get each attribute from there, insert them in a dict, and then write the dicts to CSV with the Python CSV library. Just as a fair warning, I didn't run it until it broke, I just broke after the second loop to save some computing.
WARNING HAVE NOT TESTED ON FULL SITE
import csv
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
url = 'http://www.tripadvisor.in/Hotels-g186338-London_England-Hotels.html'
driver = webdriver.Firefox()
driver.get(url)
hotels = []
while True:
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('div.listing')
for l in listings:
hotel = {}
hotel['name'] = l.select('a.property_title')[0].text
hotel['rating'] = float(l.select('img.sprite-ratings')[0]['alt'].split('of')[0])
hotels.append(hotel)
next = driver.find_element_by_link_text('Next')
if not next:
break
else:
next.click()
time.sleep(0.5)
if len(hotels) > 0:
with open('ratings.csv', 'w') as f:
fieldnames = [ k for k in hotels[0].keys() ]
writer = csv.DictWriter(f,fieldnames=fieldnames)
writer.writeheader()
for h in hotels:
writer.writerow(h)
driver.quit()
You should look at using a list.
I would try something like this:
for rate in soup.findAll('div',{"class":["rating","listing_title"]}):
(could be wrong, this machine doesn't have bs4 for me to check, sorry)

Accessing uninform <dt></dt> <dd></dd> tags

from collections import defaultdict
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
r= requests.get("http://www.walmart.com/search/?query=marvel&cat_id=4096_530598")
r.content
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class" : "tile-content"})
data=defaultdict(list)
for tile in g_data:
#the "tile" value in g_data contains what you are looking for...
#find the product titles
try:
title = tile.find("a","js-product-title")
data['Product Title'].append(title.text)
except:
data['Product Title'].append("")
#find the prices
try:
price = tile.find('span', 'price price-display').text.strip()
data['Price'].append(price)
except:
data['Price'].append("")
#find the stars
try:
g_star = tile.find("div",{"class" : "stars stars-small tile-row"}).find('span','visuallyhidden').text.strip()
data['Stars'].append(g_star)
except:
data['Stars'].append("")
try:
dd_starring = tile.find('dd', {"class" : "media-details-multi-line media-details-artist-dd module"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running_time = tile.find_all('dl',{"class" : "media-details dl-horizontal copy-mini"})
for dd_run in running_time :
running = dd_run.find_all('dd')[1:2]
for run in running :
#print run.text.strip()
data['Running Time'].append(run.text.strip())
except:
data['Running Time'].append("")
try:
dd_format = tile.findAll('dd',{"class" :"media-details-multi-line"})[1:2]
for formatt in dd_format:
data['Format'].append(textOfFormat)
except:
data['Format'].append("")
try:
div_shipping =tile.find_all('div',{"data-offer-shipping-pass-eligible":"false"})
data['Shipping'].append("")
except:
freeshipping = "Free Shipping"
data['Shipping'].append(freeshipping)
df = pd.DataFrame(data)
df
I want to access the which if without a class name. How to access it?
Like row no.11 has 5 director field and few other have Release date.
Currently I am accessing it using [2:1] and so on.. But its not flxible and doesnt populate my table correctly.
Any function to do this?
Substitute Staring and Running time with:
try:
dd_starring = tile.find('dd', {"class" : "media-details-artist-dd"}).text.strip()
data['Starring'].append(dd_starring)
except:
data['Starring'].append("")
try:
running = tile.find('dt',{'class':'media-details-running-time'})
running_time = running.find_next("dd")
data['Running Time'].append(running_time.text)
except:
data['Running Time'].append("")
This should run now. It seems that when you select multiple classes with BeautifulSoup it can get confused so you can get the Actors just by css class media-details-artist-dd. For the running time I employed a simple trick :)
EDIT: Changed the code to find the dd for Running Time and then get the next sibling. The previous code had an extra unneeded part
It should work now

Categories