I am trying to open up several URL's (because they contain data I want to append to a list). I have a logic saying "if amount in icl_dollar_amount_l" then run the rest of the code. However, I want the script to only run the rest of the code on that specific amount in the variable "amount".
Example:
selenium opens up X amount of links and sees ['144,827.95', '5,199,024.87', '130,710.67'] in icl_dollar_amount_l but i want it to skip '144,827.95', '5,199,024.87' and only get the information for '130,710.67' which is in the 'amount' variable already.
Actual results:
Its getting webscaping information for amount '144,827.95' only and not even going to '5,199,024.87', '130,710.67'. I only want it getting webscaping information for '130,710.67' because my amount variable has this as the only amount.
print(icl_dollar_amount_l)
['144,827.95', '5,199,024.87', '130,710.67']
print(amount)
'130,710.67'
file2.py
def scrapeBOAWebsite(url,fcg_subject_l, gp_subject_l):
from ICL_Awk_Checker import rps_amount_l2
icl_dollar_amount_l = []
amount_ack_missing_l = []
file_total_l = []
body_l = []
for link in url:
print(link)
browser = webdriver.Chrome(options=options,
executable_path=r'\\TEST\user$\TEST\Documents\driver\chromedriver.exe')
# if 'P2 Cust ID 908554 File' in fcg_subject:
browser.get(link)
username = browser.find_element_by_name("dialog:username").get_attribute('value')
submit = browser.find_element_by_xpath("//*[#id='dialog:continueButton']").click()
body = browser.find_element_by_xpath("//*[contains(text(), 'Total:')]").text
body_l.append(body)
icl_dollar_amount = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
icl_dollar_amount_l.append(icl_dollar_amount)
if not missing_amount:
logging.info("List is empty")
print("List is empty")
count = 0
for amount in missing_amount:
if amount in icl_dollar_amount_l:
body = body_l[count]
get_file_total = re.findall('(?:[\£\$\€]{1}[,\d]+.?\d*)', body)[0].split('$', 1)[1]
file_total_l.append(get_file_total)
return icl_dollar_amount_l, file_date_l, company_id_l, client_id_l, customer_name_l, file_name_l, file_total_l, \
item_count_l, file_status_l, amount_ack_missing_l
I don't know if I understand problem but this
if amount in icl_dollar_amount_l:
doesn't give information on which position is '130,710.67' in icl_dollar_amount_l and you need also
count = icl_dollar_amount_l.index(amount)
for amount in missing_amount:
if amount in icl_dollar_amount_l:
count = icl_dollar_amount_l.index(amount)
body = body_l[count]
But it will works if you expect only one amount on list icl_dollar_amount_l. For more elements you would have to use rather for-loop and check every element separatelly
for amount in missing_amount:
for count, item in enumerate(icl_dollar_amount_l)
if amount == item :
body = body_l[count]
But frankly I don't know why you don't check it in first loop for link in url: when you have direct access to icl_dollar_amount and body
Related
I'm building a web scraper and I'm able to print all he data I need, but I'm struggling adding the data to my csv file, I feel like I need to add another for loop or even a function. Currently I'm able to get it to print one row of scraped data values, but it skips the 64 other rows of data values.
So far I've tried to put in another for loop and break up each variable into it's own function, but it just breaks my code, Here's what I have so far, I feel like I'm just missing something too.
#Gets listing box
listingBox = searchGrid.find_elements(By.CLASS_NAME, 'v2-listing-card')
#Loops through each listing box
for listingBoxes in listingBox:
listingUrl = []
listingImg = []
listingTitle = []
listingPrice = []
#Gets listing url
listingUrl = listingBoxes.find_element(By.CSS_SELECTOR, 'a.listing-link')
print("LISTING URL:", listingUrl.get_attribute('href'))
#Gets listing image
listingImg = listingBoxes.find_element(By.CSS_SELECTOR, 'img.wt-position-absolute')
print("IMAGE:", listingImg.get_attribute('src'))
#Gets listing title
listingTitle = listingBoxes.find_element(By.CLASS_NAME, 'wt-text-caption')
print("TITLE:", listingTitle.text)
#Gets price
listingPrice = listingBoxes.find_element(By.CLASS_NAME, 'currency-value')
print("ITEM PRICE: $", listingPrice.get_attribute("innerHTML"))
#Gets seller name
# listingSellerName = listingBoxes.find_element(By.XPATH, '/html/body/main/div/div[1]/div/div[3]/div[8]/div[2]/div[10]/div[1]/div/div/ol/li/div/div/a[1]/div[2]/div[2]/span[3]')
# print("SELLER NAME:", listingSellerName.get_attribute("innerHTML"))
print("---------------")
finally:
driver.quit()
data = {'Listing URL': listingUrl, 'Listing Thumbnail': listingImg,'Listing Title': listingTitle, 'Listing Price': listingPrice}
df = pd.DataFrame.from_dict(data, orient='index')
df = df.transpose()
df.to_csv('raw_data.csv')
print('Data has been scrapped and added.')
In your code each loop reset the lists listingUrl, listingImg etc that's why df contains only one row of scraped data, corresponding to the last loop executed. If you want to add elements to a list you have to define the list BEFORE the loop and then use the .append() method inside the loop.
Then, instead of doing listingUrl.get_attribute('href') you will do listingUrl[-1].get_attribute('href') where [-1] means that you are taking the last element of the list.
listingUrl = []
listingImg = []
listingTitle = []
listingPrice = []
for listingBoxes in listingBox:
#Gets listing url
listingUrl.append( listingBoxes.find_element(By.CSS_SELECTOR, 'a.listing-link') )
print("LISTING URL:", listingUrl[-1].get_attribute('href'))
#Gets listing image
listingImg.append( listingBoxes.find_element(By.CSS_SELECTOR, 'img.wt-position-absolute') )
print("IMAGE:", listingImg[-1].get_attribute('src'))
#Gets listing title
listingTitle.append( listingBoxes.find_element(By.CLASS_NAME, 'wt-text-caption') )
print("TITLE:", listingTitle[-1].text)
#Gets price
listingPrice.append( listingBoxes.find_element(By.CLASS_NAME, 'currency-value') )
print("ITEM PRICE: $", listingPrice[-1].get_attribute("innerHTML"))
The idea of the code is to add to existent playlist unwatched EPs by index order, ep 1 Show X, ep 1 Show Z, regardless of air date:
from plexapi.server import PlexServer
baseurl = 'http://0.0.0.0:0000/'
token = '0000000000000'
plex = PlexServer(baseurl, token)
episode = 0
first_ep_name = []
for x in plex.library.section('Anime').search(unwatched=True):
try:
for y in plex.library.section('Anime').get(x.title).episodes()[episode]:
if plex.library.section('Anime').get(x.title).episodes()[episode].isWatched:
episode +=1
first_ep_name.append(y)
else:
episode = 0
first_ep_name.append(y)
except:
continue
plex.playlist('Anime Playlist').addItems(first_ep_name)
But when I run it, it will always add watched EPs but if I debug the code in Thoni IDE it seems that is doing its purpose so I am not sure whats wrong with that code.
Any ideas?
Im thinking that the error might be here:
plex.playlist('Anime Playlist').addItems(first_ep_name)
but according to the documentation addItems should be a list but my list "first_ep_name " its already appending unwatched episodes in the correct order, in theory addItems should recognize the specific episode and not only the series name but I am not sure anymore.
is somebody out there is having the same issue with plexapi I was able to find a way to get this project working properly:
from plexapi.server import PlexServer
baseurl = 'insert plex url here'
token = 'plex token here'
plex = PlexServer(baseurl, token)
anime_plex = []
scrapped_playlist = []
for x in plex.library.section('Anime').search(unwatched=True):
anime_plex.append(x)
while len(anime_plex) >0:
episode_list = []
for y in plex.library.section('Anime').get(anime_plex[0].title).episodes():
episode_list.append(y)
ep_checker = True
while ep_checker:
if episode_list[0].isWatched:
episode_list.pop(0)
else:
scrapped_playlist.append(episode_list[0])
episode_list.clear()
ep_checker = False
anime_plex.pop(0)
# plex.playlist('Anime Playlist').addItems(scrapped_playlist)
plex.playlist('Anime Playlist').delete()
plex.createPlaylist('Anime Playlist', section='Anime', items= scrapped_playlist)
Basically, what I am doing with that code I am looping through each anime series I have and if EP # X is watched then pop from the list until it finds a boolean FALSE then that will append into an empty list that later I will use for creating/adding to playlist.
The last lines of the code can be commented on for whatever purpose, creating the playlist anime or adding items.
Working on the python Bot with selenium, and infinite scrolling in dialog box isn't working due to "None" return from the "arguments[0].scrollHeight"
dialogBx=driver.find_element_by_xpath("//div[#role='dialog']/div[2]")
print(dialogBx) #<selenium.webdriver.remote.webelement.WebElement (session="fcec89cc11fa5fa5eaf29a8efa9989f9", element="31bfd470-de78-XXXX-XXXX-ac1ffa6224c4")>
print(type(dialogBx)) #<class 'selenium.webdriver.remote.webelement.WebElement'>
sleep(5)
last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx);
print("Height : ",last_height) #None
I needed last height to compare, please suggest solution.
Ok, to answer your question, since you are inside a dialog we should focus on it. When you execute : last_height=driver.execute_script("arguments[0].scrollHeight",dialogBx); I believe you are executing that in the main page or in a wrong div (not 100% sure). Either way I took a diferente approach, we are going to select the last <li> item currently available in the dialog and scroll down to its position, this will force the dialog to update. I will extract a code from the full code you will see below:
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
We first select the last list item and then the property location_once_scrolled_into_view. This property will scroll our dialog down to our last item and then it will load more items. start_pos is just the position in the list of <li> element we have available. ie.: <div><li></li><li></li><li></li></div> start_pos=2 which is the last li item starting from 0. I put this variable name because it is inside a for loop which is watching the changes of li items inside the div, you will get it once you see the full code.
In other hand to execute this,simply change the parameters at the top and execute the test function test(). If you are already log in to instagram you can just run get_list_of_followers().
Note: Using this function use a Follower class that is also in this code. You can remove if you wish but you will need to modify the function.
IMPORTANT:
When you execute this program, the dialog box items will be increasing until there is no more items to load, so a TODO would be remove the element you have already processed otherwise I believe performace will get slower when you start hitting big numbers!
Let me know if you need any other explanation. Now the code:
import time
from selenium import webdriver
from selenium.webdriver.remote.webelement import WebElement
# instagram url as our base
base_url = "https://www.instagram.com"
# =====================MODIFY THESE TO YOUR NEED=========
# the user we wish to get the followers from
base_user = "/nasa/"
# how much do you wish to sleep to wait for loading (seconds)
sleep_time = 3
# True will attempt login with facebook, False with instagram
login_with_facebook = True
# Credentials here
username = "YOUR_USERNAME"
password = "YOUR_PASSWORD"
# How many users do you wish to retrieve? -1 = all or n>0
get_users = 10
#==========================================================
# This is the div that contains all the followers info not the dialog box itself
dialog_box_xpath = '/html/body/div[4]/div/div[2]/ul/div'
total_followers_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span'
followers_button_xpath = '/html/body/div[1]/section/main/div/header/section/ul/li[2]/a'
insta_username_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[2]/div/label/input'
insta_pwd_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[3]/div/label/input'
insta_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[4]/button'
insta_fb_login_button_xpath = '/html/body/div[5]/div/div[2]/div[2]/div/div/div[1]/div/form/div[6]/button'
fb_username_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[1]/input'
fb_pwd_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[2]/input'
fb_login_button_xpath = '/html/body/div[1]/div[3]/div[1]/div/div/div[2]/div[1]/form/div/div[3]/button'
u_path = fb_username_xpath if login_with_facebook else insta_username_xpath
p_path = fb_pwd_xpath if login_with_facebook else insta_pwd_xpath
lb_path = fb_login_button_xpath if login_with_facebook else insta_login_button_xpath
# Simple class of a follower, you dont actually need this but for explanation is ok.
class Follower:
def __init__(self, user_name, href):
self.username = user_name
self.href = href
#property
def get_username(self):
return self.username
#property
def get_href(self):
return self.href
def __repr__(self):
return self.username
def test():
base_user_path = base_url + base_user
driver = webdriver.Chrome()
driver.get(base_user_path)
# click the followers button and will ask for login
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we decide if we will login with facebook or instagram
if login_with_facebook:
driver.find_element_by_xpath(insta_fb_login_button_xpath).click()
time.sleep(sleep_time)
username_input = driver.find_element_by_xpath(u_path)
username_input.send_keys(username)
password_input = driver.find_element_by_xpath(p_path)
password_input.send_keys(password)
driver.find_element_by_xpath(lb_path).click()
# We need to wait a little longer for the page to load so. Feel free to change this to your needs.
time.sleep(10)
# click the followers button again
driver.find_element_by_xpath(followers_button_xpath).click()
time.sleep(sleep_time)
# now we get the list of followers from the dialog box. This function will return a list of follower objects.
followers: list[Follower] = get_list_of_followers(driver, dialog_box_xpath, get_users)
# close the driver we do not need it anymore.
driver.close()
for follower in followers:
print(follower, follower.get_href)
def get_list_of_followers(driver, d_xpath=dialog_box_xpath, get_items=10):
"""
Get a list of followers from instagram
:param driver: driver instance
:param d_xpath: dialog box xpath. By default it gets the global parameter but you can change it
:param get_items: how many items do you wish to obtain? -1 = Try to get all of them. Any positive number will be
= the number of followers to obtain
:return: list of follower objects
"""
# getting the dialog content element
dialog_box: WebElement = driver.find_element_by_xpath(d_xpath)
# getting all the list items (<li></li>) inside the dialog box.
dialog_content: list[WebElement] = dialog_box.find_elements_by_tag_name("li")
# Get the total number of followers. since we get a string we need to convert to int by int(<str>)
total_followers = int(driver.find_element_by_xpath('/html/body/div[1]/section/main/div/header/section/ul/li['
'2]/a/span').get_attribute("title").replace(".",""))
# how many items we have without scrolling down?
li_items = len(dialog_content)
# We are trying to get n elements (n=get_items variable). Now we need to check if there are enough followers to
# retrieve from if not we will get the max quantity of following. This applies only if n is >=0. If -1 then the
# total amount of followers is n
if get_items == -1:
get_items = total_followers
elif -1 < get_items <= total_followers:
# no need to change anything, git is ok to work with get_items
pass
else:
# if it not -1 and not between 0 and total followers then we raise an error
raise IndexError
# You can start from greater than 0 but that will give you a shorter list of followers than what you wish if
# there is not enough followers available. i.e: total_followers = 10, get_items=10, start_from=1. This will only
# return 9 followers not 10 even if get_items is 10.
return generate_followers(0, get_items, total_followers, dialog_box, driver)
def generate_followers(start_pos, get_items, total_followers, dialog_box_element: WebElement, driver):
"""
Generate followers based on the parameters
:param start_pos: index of where to start getting the followers from
:param get_items: total items to get
:param total_followers = total number of followers
:param dialog_box_element: dialog box to get the list items count
:param driver: driver object
:return: followers list
"""
if -1 < start_pos < total_followers:
# we want to count items from our current position until the last element available without scrolling. We do
# it this way so when we scroll down, the list items will be greater but we will start generating followers
# from our last current position not from the beginning!
first = dialog_box_element.find_element_by_xpath("./li[{pos}]".format(pos=start_pos+1))
li_items = dialog_box_element.find_elements_by_xpath("./li[position()={pos}][last("
")]/following-sibling::li"
.format(pos=(start_pos + 1)))
li_items.insert(0, first)
print("Generating followers from position position: {pos} with {li_count} list items"
.format(pos=(start_pos+1), li_count=len(li_items)))
followers = []
for i in range(len(li_items)):
anchors = li_items[i].find_elements_by_tag_name("a")
anchor = anchors[0] if len(anchors) ==1 else anchors[1]
follower = Follower(anchor.text, anchor.get_attribute(
"href"))
followers.append(follower)
get_items -= 1
start_pos += 1
print("Follower {f} added to the list".format(f=follower))
# we break the loop if our starting position is greater than 0 or if get_items has reached 0 (means if we
# request 10 items we got them all no need to continue
if start_pos >= total_followers or get_items == 0:
print("finished")
return followers
print("finished loop, executing scroll down...")
last_li_item = driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/ul/div/li[{p}]'.format(p=start_pos))
last_li_item.location_once_scrolled_into_view
time.sleep(sleep_time)
followers.extend(generate_followers(start_pos, get_items, total_followers, dialog_box_element, driver))
return followers
else:
raise IndexError
I am trying to connect with elements that carry the contact numbers on each site. I was able to create the routine to get the numbers, extract the contact number with available formats and regex and the following code snippet to get the element
contact_elem = browser.find_elements_by_xpath("//*[contains(text(), '" + phone_num + "')]")
Considering the example of https://www.cssfirm.com/, the contact number appears in 2 locations, the top header and the bottom footer
The element texts accompanying the contact number are as follows :
<h3>CALL US TODAY AT (855) 910-7824</h3> - Footer
<span>Call Us<br>Today</span> (855) 910-7824 - Header
The extracted phone number matches perfectly while printing it out. For some reason, the element from the header part is not being detected.
I tried by searching for elements and even by deleting the footer element from the browser before executing the rest of the code.
What could be the reason for it to go undetected?
P.S: Below is the amateurish,uncorrected code. Efficiency edits/suggestions are welcome. The same code has been tested with various sites and works fine.
url = 'http://www.cssfirm.com/'
browser.get(url)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
s = BeautifulSoup(parsed, 'html.parser')
s = s.decode('utf-8')
phoneNumberRegex = '(\s*(?:\+?(\d{1,4}))?[-. (]*(\d{1,})[-. )]*(\d{3}|[A-Z0-9]+)[-. \/]*(\d{4}|[A-Z0-9]+)[-. \/]?(\d{4}|[A-Z0-9]+)?(?: *x(\d+))?\s*)'
custom_re = ['([0-9]{4,4} )([0-9]{3,3} )([0-9]{4,4})',
'([0-9]{3,3} )([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2}-)([0-9]{4,4}-)([0-9]{4,4}-)(0)',
'(\([0-9]{3,3}\) )([0-9]{3,3}-)([0-9]{4,4})',
'(\+[0-9]{2,2} )(\(0\)[0-9]{4,4} )([0-9]{4,6})',
'([0-9]{5,5} )([0-9]{6,6})',
'(\+[0-9]{2,2}\(0\))([0-9]{4,4} )([0-9]{4,4})',
'(\+[0-9]{2,2} )([0-9]{3,3} )([0-9]{4,4} )([0-9]{3,3})',
'([0-9]{3,3}-)([0-9]{3,3}-)([0-9]{4,4})']
phones = []
phones = re.findall(phoneNumberRegex, s)
phone_num_list = ()
phone_num = ''
matched = 0
for phoneHeader in phones:
#phoneHeader = phoneHeader.decode('utf-8')
for ph_cnd in phoneHeader:
for pttrn in custom_re:
phones = re.findall(pttrn,ph_cnd)
if(phones):
phone_num_list = phones
for x in phone_num_list:
phone_num = ''.join(x)
try:
contact_elem = browser.find_element_by_xpath("//*[contains(text(), '" + phone_num + "')]")
phone_num_txt = contact_elem.text
if(phone_num_txt):
matched = 1
break
except NoSuchElementException:
pass
if(matched == 1):
break
if(matched == 1):
break
if(matched == 1):
break
print("Phone number :",phone_num) <-- Perfect output
contact_elem <--empty for header or just the footer element
EDIT
Code updated. Forgot an important piece. Moreover, there is sleep time given in between to give time for the page to load. Considering it trivial, I haven't included them for a quick read.
I found a temporary solution by searching for the partial link text, as the number also comes on the link.
contact_elem2 = browser.find_element_by_partial_link_text(phone_num)
However, this does not answer the generic question as to why that text was ignored within the element.
I am trying to scrape all the different variations of this webpage.For instance the code that should scrape this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849.
should be the same as the code i use to scrape this webpage
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
print list
#Street=list.pop(0)
#CityStateZip=list.pop(0)
#Phone=list.pop(0)
#City,StateZip= CityStateZip.split(',')
#State,Zip= StateZip.split(' ')
#ContactName = Contact.findAll('b')[1]
#ContactEmail = Contact.findAll('a')[1]
#Body=tbl.findAll('p')[1]
#Website = Contact.findAll('a')[2]
#Email = ContactEmail.text.strip()
#ContactName = ContactName.text.strip()
#Website = Website.text.strip()
#Body = Body.text
#Body = re.sub(r'[\n\r\t\xa0]','',Body).strip()
#list.extend([Street,City,State,Zip,ContactName,Phone,Email,Website,Body])
return list
The way i believe i will need to write the code in order it to work, is to set it up so that print list returns the same number of values, ordered identically.Currently, the above script returns these values
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
[u'Alexandria,VA 22305']
Accounting for missing values,in order to be able to parse this page consistently,
I need the print list command to return something similar to this
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
['',u'Alexandria,VA 22305','']
this way i will be able to manipulate values by position(as they will be in consistent order). The problem is that i don't know how to accomplish this as I am still very new to parsing. If anybody has any insight as to how to solve the problem i would be highly appreciative.
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
Street=[s for s in list if ',' not in s and '-' not in s]
CityStateZip=[s for s in list if ',' in s]
Phone = [s for s in list if '-' in s]
if not Street:
Street=''
else:
Street=Street[0]
if not CityStateZip:
CityStateZip=''
else:
City,StateZip= CityStateZip[0].split(',')
State,Zip= StateZip.split(' ')
if not Phone:
Phone=''
else:
Phone=Phone[0]
list=[]
I figured out an alternative solution using substrings and if statements. Since there are only 3 values max in the list, all with defining characteristics i realized that i could delegate by looking for special characters rather than the position of the record.