How to optimize a Selenium webdriver crawler?

How to optimize a Selenium webdriver crawler? - python

So, I have to crawl a table in each webpage in a website, there are 324 web pages (meaning 324 tables) and each table has 1000 rows and 7 columns, but 1 column is useless and I didn't use that one.
The code is kind of okay but the problem is it's very slow and it takes a lot of time.I was wondering if I could do some changes to the code to make it faster!
Here's the code:
driver = webdriver.Chrome('./chromedriver.exe')
driver.get('https://beheshtezahra.tehran.ir/Default.aspx?tabid=92')
driver.maximize_window()
part_count = 1
li = []
for i in range(0, 324):
start = timeit.default_timer()
firstname = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtname']")))
lastname = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtFamily']")))
part = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='dnn$ctr1877$DeadSearch$txtPart']")))
firstname.clear()
firstname.send_keys("%")
lastname.clear()
lastname.send_keys("%")
part.clear()
part.send_keys(str(part_count))
driver.find_element_by_xpath('//*[#id="dnn_ctr1877_DeadSearch_btnSearch"]').click()
print('Saving the information..')
first_name = driver.find_elements_by_xpath('//table/tbody/tr/td[2]')
last_name = driver.find_elements_by_xpath('//table/tbody/tr/td[3]')
fathers_name = driver.find_elements_by_xpath('//table/tbody/tr/td[4]')
birth_date = driver.find_elements_by_xpath('//table/tbody/tr/td[5]')
death_date = driver.find_elements_by_xpath('//table/tbody/tr/td[6]')
grave_info = driver.find_elements_by_xpath('//table/tbody/tr/td[7]')
print('Appending the information..')
for j in range(0, 1000):
li.append(first_name[j].text)
li.append(last_name[j].text)
li.append(fathers_name[j].text)
li.append(birth_date[j].text)
li.append(death_date[j].text)
li.append(grave_info[j].text)
print('Page ' + str(part_count) + ' is crawled!')
stop = timeit.default_timer()
part_count += 1
print('Time: ', stop - start)
And in the end, I wrote the list into a CSV file. Any suggestions would be appreciated!

What you could do after the print('Saving the information..') part:
print('Saving the information..')
page_snapshot = lxml.html.document_fromstring(driver.page_source)
first_name = page_snapshot.xpath('//table/tbody/tr/td[2]')
last_name = page_snapshot.xpath('//table/tbody/tr/td[3]')
fathers_name = page_snapshot.xpath('//table/tbody/tr/td[4]')
birth_date = page_snapshot.xpath('//table/tbody/tr/td[5]')
death_date = page_snapshot.xpath('//table/tbody/tr/td[6]')
grave_info = page_snapshot.xpath('//table/tbody/tr/td[7]')
print('Appending the information..')
for j in range(0, 1000):
li.append(first_name[j].text)
li.append(last_name[j].text)
li.append(fathers_name[j].text)
li.append(birth_date[j].text)
li.append(death_date[j].text)
li.append(grave_info[j].text)
lxml is blazing fast, just import lxml.html (after pip install ofc. :)) and make sure that the page is completely loaded before you grab the snapshot.

Related

Export data to gsheet workbooks / worksheets after looping through a script 10 times

I have the following script I'm running to get data from Google's pagespeed insights tool via API:
from datetime import datetime
from urllib import request
import requests
import pandas as pd
import numpy as np
import re
from os import truncate
import xlsxwriter
import time
import pygsheets
import pickle
domain_strip = 'https://www.example.co.uk'
gc = pygsheets.authorize(service_file='myservicefile.json')
API = "myapikey"
strat = "mobile"
def RunCWV():
with open('example_urls_feb_23.txt') as pagespeedurls:
content = pagespeedurls.readlines()
content = [line.rstrip('\n') for line in content]
#Dataframes
dfCWV2 = pd.DataFrame({'Page':[],'Overall Performance Score':[],'FCP (seconds) CRUX':[],'FCP (seconds) Lab':[],'FID (seconds)':[],'Max Potential FID (seconds)':[],'LCP (seconds) CRUX':[],'LCP (seconds) Lab':[],'LCP Status':[],'CLS Score CRUX':[],'Page CLS Score Lab':[],'CLS Status':[],'Speed Index':[],'Uses Efficient Cache Policy?':[],'Landing Page':[]})
dfCLSPath2 = pd.DataFrame({'Page':[],'Path':[],'Selector':[],'Node Label':[],'Element CLS Score':[],'Landing Page':[],'large_uid':[]})
dfUnsizedImages2 = pd.DataFrame({'Page':[],'Image URL':[],'Landing Page':[],'unsized_uid':[]})
dfNCAnim2 = pd.DataFrame({'Page':[],'Animation':[],'Failure Reason':[],'Landing Page':[]})
dfLCP_Overview = pd.DataFrame({'Page':[],'Preload LCP Savings (seconds)':[],'Resize Images Savings (seconds)':[],'Text Compression Savings (seconds)':[],'Preload Key Requests Savings (seconds)':[],'Preconnect Savings (seconds)':[],'Unused CSS Savings (seconds)':[],'Unused JS Savings (seconds)':[],'Unminified CSS Savings (seconds)':[],'Unminified JS Savings (seconds)':[],'Efficiently Animated Content Savings':[],'Landing Page':[]})
dfLCPOb2 = pd.DataFrame({'Page':[],'LCP Tag':[],'LCP Tag Type':[],'LCP Image Preloaded?':[],'Wasted Seconds':[],'Action':[],'Landing Page':[]})
dfresize_img = pd.DataFrame({'Page':[],'Image URL':[],'Total Bytes':[],'Wasted Bytes':[],'Overall Savings (seconds)':[],'Action':[],'Landing Page':[]})
dfFontDisplay2 = pd.DataFrame({'Page':[],'Resource':[],'Font Display Utilised?':[],'Wasted Seconds':[],'Action':[],'Landing Page':[]})
dfTotalBW2 = pd.DataFrame({'Page':[],'Total Byte Weight of Page':[],'Large Network Payloads?':[],'Resource':[],'Total KB':[],'Landing Page':[]})
dfRelPreload2 = pd.DataFrame({'Page':[],'Resource':[],'Wasted Seconds':[],'Landing Page':[]})
dfRelPreconnect2 = pd.DataFrame({'Page':[],'Resource':[],'Wasted Ms':[],'Passed Audit':[],'Landing Page':[]})
dfTextCompression2 = pd.DataFrame({'Page':[],'Text Compression Optimal?':[],'Action':[],'Savings':[],'Landing Page':[]})
dfUnusedCSS2 = pd.DataFrame({'Page':[],'CSS File':[],'Unused CSS Savings KiB':[],'Unused CSS Savings (seconds)':[],'Wasted %':[],'Landing Page':[]})
dfUnusedJS2 = pd.DataFrame({'Page':[],'JS File':[],'Unused JS Savings (seconds)':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfUnminCSS2 = pd.DataFrame({'Page':[],'CSS File':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfUnminJS2 = pd.DataFrame({'Page':[],'JS File':[],'Total Bytes':[],'Wasted Bytes':[],'Wasted %':[],'Landing Page':[]})
dfCritRC2 = pd.DataFrame({'Page':[],'Resource':[],'Start Time':[],'End Time':[],'Total Time':[],'Transfer Size':[],'Landing Page':[]})
dfAnimContent2 = pd.DataFrame({'Page':[],'Efficient Animated Content?':[],'Resource':[],'Total Bytes':[],'Wasted Bytes':[],'Landing Page':[]})
dfSRT2 = pd.DataFrame({'Page':[],'Passed Audit?':[],'Server Response Time ms':[],'Server Response Time Savings':[],'Landing Page':[]})
dfRedirects2 = pd.DataFrame({'Page':[],'Redirects':[],'Wasted ms':[],'Landing Page':[]})
dfFID_Summary2 = pd.DataFrame({'Page':[],'FID (seconds)':[],'Total Blocking Time (seconds)':[],'FID Rating':[],'Total Tasks':[],'Total Task Time of Page (seconds)':[],'Tasks over 50ms':[],'Tasks over 100ms':[],'Tasks over 500ms':[],'3rd Party Total Wasted Seconds':[],'Bootup Time (seconds)':[],'Number of Dom Elements':[],'Mainthread work Total Seconds':[],'Duplicate JS Savings (Seconds)':[],'Legacy JS Savings (seconds)':[],'Landing Page':[]})
dflongTasks2 = pd.DataFrame({'Page':[],'Task':[],'Task Duration Seconds':[],'Total Tasks':[],'Total Task Time of Page (seconds)':[],'Tasks over 50ms':[],'Tasks over 100ms':[],'Tasks over 500ms':[],'Landing Page':[]})
dfthirdP2 = pd.DataFrame({'Page':[],'3rd Party Total wasted Seconds':[],'3rd Party Total Blocking Time (seconds)':[],'3rd Party Resource Name':[],'Landing Page':[]})
dfbootup2 = pd.DataFrame({'Page':[],'Page Bootup Time Score':[],'Resource':[],'Time spent Parsing / Compiling Ms':[]})
dfthread2 = pd.DataFrame({'Page':[],'Score':[],'Mainthread work total seconds':[],'Mainthread work Process Type':[],'Duration (Seconds)':[],'Landing Page':[]})
dfDOM2 = pd.DataFrame({'Page':[],'Dom Size Score':[],'DOM Stat':[],'DOM Value':[],'Landing Page':[],})
dfdupJS2 = pd.DataFrame({'Page':[],'Score':[],'Audit Status':[],'Duplicate JS Savings (seconds)':[], 'Landing Page':[]})
dflegacyJS2 = pd.DataFrame({'Page':[],'Audit Status':[],'Legacy JS Savings (seconds)':[],'JS File of Legacy Script':[],'Wasted Bytes':[],'Landing Page':[]})
#Run PSI
for line in content:
x = f'https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url={line}&strategy={strat}&key={API}'
print(f'Running CWV Audit on {strat} from: {line} - Please Wait...')
r = requests.get(x)
data = r.json()
line_stripped = line
if domain_strip in line:
line_stripped = line_stripped.replace(domain_strip, '')
else:
pass
#CWV Overview
try:
op_score = data["lighthouseResult"]["categories"]["performance"]["score"] * 100
fcp_score_CRUX = data["loadingExperience"]["metrics"]["FIRST_CONTENTFUL_PAINT_MS"]["percentile"] / 1000
fcp_score_LAB = data["lighthouseResult"]["audits"]["first-contentful-paint"]["numericValue"] / 1000
fid_score = data["loadingExperience"]["metrics"]["FIRST_INPUT_DELAY_MS"]["percentile"] / 1000
Max_P_FID = data["lighthouseResult"]["audits"]["max-potential-fid"]["numericValue"] / 1000
lcp_score_CRUX_ms = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"]
lcp_score_CRUX = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"] / 1000
lcp_score_LAB = data["lighthouseResult"]["audits"]["first-contentful-paint"]["numericValue"] / 1000
cls_score_Sitewide = data["loadingExperience"]["metrics"]["CUMULATIVE_LAYOUT_SHIFT_SCORE"]["percentile"] / 100
cls_score_Page_mult = data["lighthouseResult"]["audits"]["cumulative-layout-shift"]["numericValue"] * 1000
cls_score_Page = data["lighthouseResult"]["audits"]["cumulative-layout-shift"]["numericValue"]
speed_index = data["lighthouseResult"]["audits"]["speed-index"]["numericValue"] / 1000
efficient_cache = data["lighthouseResult"]["audits"]["uses-long-cache-ttl"]["score"]
if efficient_cache == 1:
efficient_cache = "Yes"
else:
efficient_cache = "No"
lcp_status = lcp_score_CRUX_ms
if lcp_score_CRUX_ms <=2500:
lcp_status = "Good"
elif lcp_score_CRUX_ms in range (2501, 4000):
lcp_status = "Needs Improvement"
else:
lcp_status = "Poor"
cls_status = cls_score_Page_mult
if cls_score_Page_mult <=100:
cls_status = "Good"
elif cls_score_Page_mult in range (101,150):
cls_status = "Needs Improvement"
else:
cls_status = "Poor"
new_row = pd.DataFrame({'Page':line_stripped,'Overall Performance Score':op_score, 'FCP (seconds) CRUX':round(fcp_score_CRUX,4),'FCP (seconds) Lab':round(fcp_score_LAB,4), 'FID (seconds)':round(fid_score,4),
'Max Potential FID (seconds)':round(Max_P_FID,4), 'LCP (seconds) CRUX':round(lcp_score_CRUX,4),'LCP (seconds) Lab':round(lcp_score_LAB,4), 'LCP Status':lcp_status, 'CLS Score CRUX':round(cls_score_Sitewide,4),
'Page CLS Score Lab':round(cls_score_Page,4),'CLS Status':cls_status,'Speed Index':round(speed_index,4),'Uses Efficient Cache Policy?':efficient_cache, 'Landing Page':line_stripped}, index=[0])
dfCWV2 = pd.concat([dfCWV2, new_row], ignore_index=True) #, ignore_index=True
except KeyError:
print(f'<KeyError> CWV Summary One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('CWV Summary')
print (dfCWV2)
#Export to GSheets line by line
sh = gc.open('CWV Overview AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('CWV')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfCWV2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
# #End test
#CLS
#Large Shifts
try:
for x in range (len(data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"])):
path = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["path"]
selector = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["selector"]
nodeLabel = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["node"]["nodeLabel"]
score = data["lighthouseResult"]["audits"]["layout-shift-elements"]["details"]["items"][x]["score"]
i = 1
new_row = pd.DataFrame({'Page':line_stripped, 'Path':path, 'Selector':selector, 'Node Label':nodeLabel,'Element CLS Score':round(score,4), 'Landing Page':line_stripped, 'large_uid':i}, index=[0])
dfCLSPath2 = pd.concat([dfCLSPath2, new_row], ignore_index=True)
except KeyError:
print(f'<KeyError> Layout Shift Elements - One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('Large Shifts')
print (dfCLSPath2)
sh = gc.open('CLS Audit AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('Large CLS Elements')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfCLSPath2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
#Unsized Images
try:
for x in range (len(data["lighthouseResult"]["audits"]["unsized-images"]["details"]["items"])):
unsized_url = data["lighthouseResult"]["audits"]["unsized-images"]["details"]["items"][x]["url"]
i = 1
new_row = pd.DataFrame({'Page':line_stripped, 'Image URL':unsized_url, 'Landing Page':line_stripped, 'unsized_uid':i}, index=[0])
dfUnsizedImages2 = pd.concat([dfUnsizedImages2, new_row], ignore_index=True)
except KeyError:
print(f'<KeyError> Unsized Images One or more keys not found {line}.')
except TypeError:
print(f'TypeError on {line}.')
print ('Unsized Images')
print(dfUnsizedImages2)
sh = gc.open('CLS Audit AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('Unsized Images')
df_worksheet = worksheet.get_as_df()
result = pd.concat([df_worksheet, dfUnsizedImages2], ignore_index=True)
result=result.drop_duplicates(keep='last')
worksheet.set_dataframe(result, 'A1')
I've only included the first few TRY blocks as the script is very long. Essentially what I want to do is the same as I have here, but rather than exporting the results from the dataframes after every URL has run, I want to export it, say, every 10 urls (or more). I have around 4000 urls in total and I need to capture the results from the audit for every url
I used to have the script set up to export to gsheets at the end of the whole script with every loop, but I always end up with the script crashing before it loops through every URL I'm auditing which is why I set it up as above to export line by line - it's SUPER slow though, taking over 2 weeks to run through all urls in my text file so I want to speed it up by only exporting every 10 urls worth of data at a time. That way if the script crashes, I've not lost everything, only the last 10 urls.
I tried setting a counter on each of the export blocks:
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")
result=pd.concat(results, ignore_index=True)
result=result.drop_duplicates(keep='last')
# add results to export list
results_to_export.append(result)
if results_to_export:
sh = gc.open('CWV Overview AWP - example Feb 2023')
worksheet = sh.worksheet_by_title('CWV')
combined_results = pd.concat(results_to_export, ignore_index=True)
worksheet.set_dataframe(combined_results, 'A1')
results_to_export.clear()
results =[]
But this just kept looping through the while loop and not moving onto the next Try block or throwing up errors (I tried every version of unindenting the if statements too but nothing worked).
Please help!

A shorter program would be more likely to get an expert answer
It may be a long time until you find somebody stumbling on to this page who is willing to read so much text, and who knows how to solve the problem. To improve your chances, it is best to trim your program to the absolute smallest size that allows the problem to manifest.
Is your if counter statement not indented enough?
Currently you have:
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
# your other code here
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")
But the if counter, positioned where it is in the above code, will only be reached after 5000 steps of "while counter".
Did you mean this?
results = []
results_to_export = []
for i in range(10):
counter = 0
while counter < 5000:
# your other code here
print("Starting loop iteration")
results.append(dfCWV2)
counter += 1
if counter % 10 == 0:
print("Running after 10 loops")

How to scrape a table with selenium?

I'm having a weird issue trying to scrape a table with selenium. For reference, the table is the item table here, although ideally I would like to be able to scrape any item table for any hero on this site.
self.item_table_xpath = '//table[descendant::thead[descendant::tr[descendant::th[contains(text(), "Item")]]]]'
def retrieve_hero_stats(self, url):
self.driver.get(url)
try:
win_rate_span = self.driver.find_element(by = By.XPATH, value = '//dd[descendant::*[#class = "won"]]/span')
except:
win_rate_span = self.driver.find_element(by = By.XPATH, value = '//dd[descendant::*[#class = "lost"]]/span')
win_rate = win_rate_span.text
hero_name = url.split('/')[-1]
values = list()
for i in range(1, 13):
values.append({
'Item Name': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[2]').text,
'Matches Played': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[3]').text,
'Matches Won': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[4]').text,
'Win Rate': self.driver.find_element(by = By.XPATH, value = self.item_table_xpath + f'/tbody/tr[{i}]' + '/td[5]').text
})
print(hero_name)
print(values)
The issue is the output of the code is inconsistent; sometimes the fields in the values list are populated, and sometimes they are not. This changes each time I run my code. I don't necessarily need someone to write this code for me, in fact, I'd prefer you didn't, I'm just stumped as to why the output changes every time I run?

Scraping data beach volleyball on multiple pages

I am trying to scrape all the possible data from this webpage Gstaad 2017
Here is my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium.webdriver.support.ui import Select
#Starts the driver and goes to our starting webpage
driver = webdriver.Chrome( "C:/Users/aldi/Downloads/chromedriver.exe")
driver.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
#Imports HTML into python
page = requests.get('http://www.bvbinfo.com/Tournament.asp?ID=3294&Process=Matches')
soup = BeautifulSoup(driver.page_source, 'lxml')
stages = soup.find_all('div')
stages = driver.find_elements_by_class_name('clsTournBracketHeader')[-1].text
#TODO the first row (country quota matches) has no p tag and therefore it is not included in the data
rows = []
paragraphs = []
empty_paragraphs = []
for x in soup.find_all('p'):
if len(x.get_text(strip=True)) != 0:
paragraph = x.extract()
paragraphs.append(paragraph)
if len(x.get_text(strip=True)) == 0:
empty_paragraph = x.extract()
empty_paragraphs.append(empty_paragraph)
# players
home_team_player_1 = ''
home_team_player_2 = ''
away_team_player_1 = ''
away_team_player_2 = ''
for i in range(0, len(paragraphs)):
#round and satege of the competition
round_n= paragraphs[i].find('u').text
paragraph_rows = paragraphs[i].text.split('\n')[1:-1]
counter = 0
for j in range(0,len(paragraph_rows)):
#TODO tournament info, these can vary from tournament to tournament
tournament_info = soup.find('td', class_ = 'clsTournHeader').text.strip().split()
tournament_category = [' '.join(tournament_info[0 : 2])][0]
tournament_prize_money = tournament_info[2]
#TODO tournament city can also have two elements, not just one
tournament_city = tournament_info[3]
tournament_year = tournament_info[-1]
tournament_days = tournament_info[-2][:-1].split("-")
tournament_starting_day = tournament_days[0]
tournament_ending_day = tournament_days[-1]
tournament_month = tournament_info[-3]
tournament_stars = [' '.join(tournament_info[5 : 7])][0]
players = paragraphs[i].find_all('a', {'href':re.compile('.*player.*')})
home_team_player_1 = players[counter+0].text
home_team_player_2 = players[counter+1].text
away_team_player_1 = players[counter+2].text
away_team_player_2 = players[counter+3].text
#matches
match= paragraph_rows[j].split(":")[0].split()[-1].strip()
#nationalities
nationalities = ["United", "States"]
if paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[3] in nationalities:
home_team_country = "United States"
else:
home_team_country = paragraph_rows[j].split("def.")[0].split("/")[1].split("(")[0].split(" ")[-2]
if paragraph_rows[j].split("def.")[1].split("/")[1].split(" ")[3] in nationalities:
away_team_country = "United States"
else:
away_team_country = paragraph_rows[j].split("def.")[1].split("/")[1].split("(")[0].split(" ")[-2]
parentheses = re.findall(r'\(.*?\)', paragraph_rows[j])
if "," in parentheses[0]:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = parentheses[0].split(",")[1]
home_team_qualification_round = home_team_qualification_round[1:-1]
else:
home_team_ranking = parentheses[0].split(",")[0]
home_team_ranking = home_team_ranking[1:-1]
home_team_qualification_round = None
if "," in parentheses[1]:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
away_team_qualification_round = parentheses[1].split(",")[1]
away_team_qualification_round = away_team_qualification_round[1:-1]
else:
away_team_ranking = parentheses[1].split(",")[0]
away_team_ranking = away_team_ranking[1:-1]
match_duration = parentheses[2]
match_duration = match_duration[1:-1]
away_team_qualification_round = None
# sets
sets = re.findall(r'\).*?\(', paragraph_rows[j])
sets = sets[1][1:-1]
if len(sets.split(",")) == 2:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = None
if len(sets.split(",")) == 3:
score_set1 = sets.split(",")[0]
score_set2 = sets.split(",")[1]
score_set3 = sets.split(",")[2]
row = { " home_team_player_1 ": home_team_player_1 ,
" home_team_player_2": home_team_player_2,
"away_team_player_1": away_team_player_1,
"away_team_player_2":away_team_player_1,
"match": match,
"home_team_country":home_team_country,
"away_team_country": away_team_country,
"home_team_ranking": home_team_ranking,
"away_team_ranking": away_team_ranking,
"match_duration": match_duration,
"home_team_qualification_round": home_team_qualification_round,
"away_team_qualification_round": away_team_qualification_round,
"score_set1":score_set1,
"score_set2":score_set2,
"score_set3":score_set3,
"tournament_category": tournament_category,
"tournament_prize_money": tournament_prize_money,
"tournament_city": tournament_city,
"tournament_year": tournament_year,
"tournament_starting_day": tournament_starting_day,
"tournament_ending_day":tournament_ending_day,
"tournament_month":tournament_month,
"tournament_stars":tournament_stars,
"round_n": round_n
}
counter += 4
rows.append(row)
data = pd.DataFrame(rows)
data.to_csv("beachvb.csv", index = False)
I am not really experienced in web scraping. I have just started as a self-taught and find the HTML source code quite messy and poorly structured.
I want to improve my code in two ways:
Include all the missing matches (country quota matches, semifinals, bronze medal, and gold medal) and the respective category for each match (country quota matches, pool, winner's bracket, semifinals, bronze medal, and gold medal)
iterate the code for more years and tournaments from the dropdown menu at the top of the webpage
I have tried to iterate through different years but my code does not work
tournament_years = {"FIVB 2015", "FIVB 2016"}
dfs = []
for year in tournament_years:
# select desired tournament
box_year = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[1]/td[2]/select"))
box_year.select_by_visible_text(year)
box_matches = Select(driver.find_element_by_xpath("/html/body/table[3]/tbody/tr/td/table[1]/tbody/tr[2]/td[2]/select"))
box_matches.select_by_visible_text("Matches")
The main idea was to create a list of dataframes for each year and each tournament by adding a new loop at the beginning of the code.
If someone has a better idea and technique to do so, it is really appreciated!

Making function iterative instead of recursive [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This loop is using massive amounts of ram. For a 20kb text file, can anyone help me format it to be iterative instead of recursive? As I keep getting recursion errors when it gets into the 3-4gb of ram usage. I tried using with open to close the stream and make it more pythonic. This method loop can only read data for about 10 minutes before it quits out on me.
def getgameticks():
gameticksurl = 'https://pro.stubhub.com/simweb/sim/services/priceanalysis?eventId=' + variable + '&sectionId=0'
print(gameticksurl)
# options = Options()
# options.add_argument("--headless")
# browser = webdriver.Firefox()#firefox_options=options)
browser.get(gameticksurl)
global wait
wait = WebDriverWait(browser, 30)
sleep(3)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(3)
wait.until(expected_conditions.presence_of_element_located((By.ID, 'listingsPerPage')))
browser.find_element_by_id('listingsPerPage').click
sleep(2)
select = Select(browser.find_element_by_id('listingsPerPage'))
select.select_by_visible_text('150')
gameinfo()
global trip
trip = False
def gameinfo():
wait.until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="filterBtn"]')))
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
html_doc = browser.page_source
soup = BeautifulSoup(html_doc, 'html.parser')
wait.until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="listingPageNumber"]')))
try:
select = Select(browser.find_element_by_xpath('//*[#id="listingPageNumber"]'))
current = select.all_selected_options[0].text
last = [option.text for option in select.options][-1]
pronto = False
except:
print('Something broke...Getting around it though...')
gameinfo()
if current == last:
global trip
trip = True
browser.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME)
wait.until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="filterBtn"]')))
browser.find_element_by_xpath('//*[#id="filterBtn"]').click()
wait.until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="filterBtn"]')))
gameinfo()
else:
wait.until(expected_conditions.presence_of_element_located((By.XPATH, '//*[#id="listingNextBtn"]')))
browser.find_element_by_xpath('//*[#id="listingNextBtn"]').click()
pass
dir_path = os.path.dirname(os.path.realpath(__file__))
file_path = (dir_path+'\Sheets')
try:
os.makedirs(file_path)
except:
pass
#######################
for mytable in soup.find_all('table'):
for trs in mytable.find_all('tr'):
tds = trs.find_all('td')
row1 = [elem.text.strip() for elem in tds]
row = str(row1)
cool = row.replace("[", "")
coolp = cool.replace("]", "")
cool2 = coolp.replace("'", "")
cool3 = cool2.replace(" , ", "")
row = cool3
rowtest = (row.split(','))
if len(rowtest) != 5:
rowtest = ['NULL', 'NULL', 'NULL', 'NULL', 'NULL']
row = (','.join(rowtest))
rowtest0 = rowtest[:4] # LISTING WITHOUT DAYS LISTED
rowtest1 = rowtest[0:1] # SECTION LOCATION
rowtest2 = rowtest[1:2] # TICKET PRICE
rowtest3 = rowtest[2:3] # ROW
rowtest4 = rowtest[3:4] # TICKET QTY
rowtest5 = rowtest[4:5] # DAYS LISTED
###TABLE STUFF#
row0 = (','.join(rowtest0)) #ROW STRING WITHOUT DAYS LISTED
with open(file_path+'\\'+variable+'.txt', "a+") as openit:
pass
#TABLE STUFF
with open(file_path+'\\'+variable+'.txt', "r+") as file:
for line in file:
linez = (line.split(',')) #LINE AS LIST
linezprice = (linez[-3]) #LINE PRICE
if row0+"\n" in line:
break
else:
file.write(row0+"\n")
print(row)
if trip == False:
pass
else:
slack_token1 = 'xoxb-420561995540-420693438947-JAZmP1pdfg6FkqnTTziPdggr'
sc1 = SlackClient(slack_token1)
sc1.api_call(
"chat.postMessage",
channel=channame,
text=row
)
while True:
gameinfo()

It seems like you want to continuously scrape some site -
just remove all the calls to gameinfo besides the endless loop - there's no reason to do this as a recursion

How do I use joblib to parallelize a Selenium scraping task? (A non-working example)

I have a task that requires extracting data from the Indian 2011 Census. I am using Selenium and have a working script (featured below), but I am trying to use the joblib library and Parallel to parallelize the task. I do not receive an error when I run this script, and I do observe my processors active in my task manager (Windows 10), but I do not see any files saved from running this program and it continues to run long after a non-parallel version would have completed. Any help would be much appreciated. Thanks so much. BTW, here is the link to the input dataset for this program.
The first four records of the
import time
import re
import string
import urllib.parse
import pandas
import numpy
import os
import csv
import joblib
from selenium import webdriver
from joblib import Parallel, delayed
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
path = 'C:/Users/d.wm.mclaughlin/Dropbox/research/india'
os.chdir(path)
input_df = pandas.read_excel("file_path/villages_3109_UTTAR PRADESH_12_003.xlsx", "Sheet1")
def downloadFunction(x):
driver = webdriver.PhantomJS('C:/phantomjs/bin/phantomjs.exe')
url = "url"
driver.get(url);
selected_state = str(input_df['state_no'][x])
selected_district = str(input_df['dist_no'][x])
selected_block = str(input_df['block_no'][x]).zfill(3)
selected_pan = str(input_df['pan'][x]).zfill(4)
selected_state_name = input_df['state'][x]
selected_dist_name = input_df['district'][x]
selected_block_name = input_df['block'][x]
selected_pan_name = input_df['village'][x]
select = Select(driver.find_element_by_css_selector("#ddl_state"))
select.select_by_value(selected_state)
distSelect = Select(driver.find_element_by_css_selector("#ddl_dist"))
distSelect.select_by_value(selected_district)
blkSelect = Select(driver.find_element_by_css_selector("#ddl_blk"))
blkSelect.select_by_value(selected_block)
panSelect = Select(driver.find_element_by_css_selector("#ddl_pan"))
panSelect.select_by_value(selected_pan)
button_list = ['#RadioButtonList1_0', '#RadioButtonList1_1', '#RadioButtonList1_2']
button_names = ['auto_inclusion', 'auto_exclusion', 'other']
for b in range(0,1):
selected_button = button_list[b]
selected_button_name = button_names[b]
driver.find_element_by_css_selector(selected_button).click()
driver.find_element_by_css_selector('#Button1').click()
if('No Record Found !!!' in driver.page_source):
print('No Record Found !!!')
else:
ae = driver.find_element_by_css_selector('#form1 > div:nth-child(4) > center:nth-child(2) > table > tbody > tr:nth-child(3) > td:nth-child(1)').text
if(ae == ''): ae = 0
ai = driver.find_element_by_css_selector('#form1 > div:nth-child(4) > center:nth-child(2) > table > tbody > tr:nth-child(3) > td:nth-child(2)').text
if(ai == ''): ai = 0
oth = driver.find_element_by_css_selector('#form1 > div:nth-child(4) > center:nth-child(2) > table > tbody > tr:nth-child(3) > td:nth-child(3)').text
if(oth == ''): oth = 0
dep = driver.find_element_by_css_selector('#form1 > div:nth-child(4) > center:nth-child(2) > table > tbody > tr:nth-child(3) > td:nth-child(4)').text
if(dep == ''): dep = 0
ae = int(ae)
ai = int(ai)
oth = int(oth)
dep = int(dep)
ai_dep = ai + dep
records = [ai_dep, ae, oth]
selected_record = records[b]
table_number = round(selected_record/45)
table_numbers = list(range(1, (1+(table_number)*3), 3))
data = []
for data_tab in table_numbers:
table_address = '#Div1 > table:nth-child(' + str(data_tab) + ')'
#print(table_address)
for tr in driver.find_elements_by_css_selector(table_address):
# CONTINUE FROM HERE!!!
#print(tr == driver.find_element_by_css_selector("#Div1 > table:nth-child(" + str(data_tab) + ") > tbody > tr:nth-child(1)"))
#"#Div1 > table:nth-child(" + str(data_tab) + ") > tbody > tr:nth-child(2)"
#"#Div1 > table:nth-child(" + str(data_tab) + ") > tbody > tr:nth-child(3)"
tds = tr.find_elements_by_tag_name('td')
if tds:
data.append([td.text for td in tds])
#newArray = numpy.array(data)
for listItem in range(0,len(data)):
if(listItem > 0):
data[listItem] = data[listItem][18:len(data[listItem])]
#print(len(data[listItem]))
flat_data = [item for sublist in data for item in sublist]
newArray = numpy.array(flat_data)
dataRows = int(numpy.array(flat_data).size / 9)
rowsTimesColumns = (dataRows * 9)
test = pandas.DataFrame(newArray.reshape(dataRows,9), columns=['no', 'hh_name', 'gender', 'age', 'sc', 'fm_name', 'depriv_count', 'ai_d_code', 'total_mem'])
file_path = 'C:/Users/d.wm.mclaughlin/Dropbox/research/lpg_india/data/secc/secc' + '_' + selected_state + '_' + '_' + selected_district + '_' + '_' + selected_block + '_' + '_' + selected_pan + '_' + '_' + selected_button_name + '.xlsx'
test.to_excel(file_path, 'Sheet1')
return print(x);
tester = Parallel(n_jobs=3)(delayed(downloadFunction)(in_val) for in_val in range(1, 10))

Assuming that you have enough memory to run this without using swap you should take a look at the documentation. From https://pythonhosted.org/joblib/parallel.html. Pay particular attention to the last line.
Warning
Under Windows, it is important to protect the main loop of code to
avoid recursive spawning of subprocesses when using joblib.Parallel.
In other words, you should be writing code like this:
import ....
def function1(...):
...
def function2(...):
...
... if __name__ == '__main__':
# do stuff with imports and functions defined about
...
No code should run outside of the “if name == ‘main’” blocks,
only imports and definitions.
If it is a memory issue please read the rest of the page. You could start with
from joblib.pool import has_shareable_memory
and changing your last line to:
if __name__ == '__main__':
tester = Parallel(n_jobs=3, max_nbytes=1e2)(delayed(downloadFunction, has_shareable_memory)(in_val) for in_val in range(1, 10))
But I'm guessing that not much of your memory consumption can be shared.
You could also add some garbage collection to save memory:
import gc
before your return statement delete all unnecessary variables and add
del driver
del test
del newArray
del data
# and all the rest
_ = gc.collect()
but be aware that this will not garbage collect the underlying executables memory e.g. PhantomJS

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize a Selenium webdriver crawler? - python

Related

Export data to gsheet workbooks / worksheets after looping through a script 10 times

How to scrape a table with selenium?

Scraping data beach volleyball on multiple pages

Making function iterative instead of recursive [closed]

How do I use joblib to parallelize a Selenium scraping task? (A non-working example)

Categories

Resources