A website changes content dynamically, through the use of two date filters (year / week), without the need of a get request (it is handled asynchronously on the client side). Each filter option produces a different page_source with td elements I would like to extract.
Currently, I am using a nested list for-loop to iterate through the filters (and so different page sources containing different td elements, iterate through the contents of each page source and then append the desired td elements in an empty list.
store = []
def getData():
year = ['2015','2014']
for y in year:
values = y
yearid = Select(browser.find_element_by_id('yearid'))
fsid.select_by_value(values)
weeks = ['1', '2']
for w in weeks:
value = w
frange = Select(browser.find_element_by_id('frange'))
frange.select_by_value('WEEKS')
selectElement = Select(browser.find_element_by_id('fweek'))
selectElement.select_by_value(value)
pressFilter = browser.find_element_by_name('submit')
pressFilter.submit()
#scrape data from page source
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
for el in soup.find_all('td'):
store.append(el.get_text())
So far so good, and I have a for loop that constructs a single list of all the td elements that I would like.
Instead, I would like to store separate lists, one for each page source (i.e. one per filter combination), in a list of lists. I can do that after the fact i.e. in a secondary step I could then extract the items from the list according to some criteria.
However, can I do that at the point of the original appending? Something like...
store = [[],[], [], []]
...
counter = 0
for el in soup.find_all('td'):
store[counter].append(el.get_text())
counter = counter +1
This isn't quite right as it only appends to the first object in the store list. If I put the counter in the td for-loop, then it will increase for each time td element is iterated, when in actual fact I only want it to increase when I have finished iterating through a particular page source ( which is itself an iteration of a filter combination).
I am stumped, is what I am trying even possible? If so, where should I put the counter? Or should I use some other technique?
Create a new list object per filter combination, so inside the for w in weeks: loop. Append your cell text to that list, and append the per-filter list this produces to store:
def getData():
store = []
year = ['2015','2014']
for y in year:
# ... elided for brevity
weeks = ['1', '2']
for w in weeks:
perfilter = []
store.append(perfilter)
# ... elided for brevity
for el in soup.find_all('td'):
perfilter.append(el.get_text())
Related
I'm having a bit of trouble with a function I'm trying to write. What it is supposed to do is 1) go to a particular URL and get a list of financial sectors stored in a particular div; 2) visit each sector's respective page and get 3 particular pieces of information from there; 3) put the gathered collection into a dictionary; and 4) append that dictionary to another dictionary.
The desired output is a dictionary containing a list of dictionaries for all the sectors.
Here is my function:
def fsr():
fidelity_sector_report = dict()
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
import requests
from bs4 import BeautifulSoup
# scrape the url page and locate links for each sector
try:
response = requests.get(url)
if not response.status_code == 200:
return 'Main page error'
page = BeautifulSoup(response.content, "lxml")
sectors = page.find_all('a',class_="heading1")
for sector in sectors:
link = 'https://eresearch.fidelity.com/' + sector['href']
name = sector.text
sect = dict()
lst = []
# scrape target pages for required information
try:
details = requests.get(link)
if not details.status_code == 200:
return 'Details page error'
details_soup = BeautifulSoup(details.content,'lxml')
fundamentals = details_soup.find('div',class_='sec-fundamentals')
values = dict()
#locate required values by addressing <tr> text and put them in a dictionary
values['Enterprise Value'] = fundamentals.select_one('th:contains("Enterprise Value") + td').text.strip()
values['Return on Equity (TTM)'] = fundamentals.select_one('th:contains("Return on Equity (TTM)") + td').text.strip()
values['Dividend Yield'] = fundamentals.select_one('th:contains("Dividend Yield") + td').text.strip()
#add values to the sector dictionary
sect[name] = values
# add the dictionary to the list
lst.append(dict(sect))
# for a dictionary using the list
fidelity_sector_report['results'] = lst
except:
return 'Something is wrong with details request'
return fidelity_sector_report
except:
return "Something is horribly wrong"
AS far as I can tell, it performs the main taks wonderfully, and the problem appears at the stage of appending a formed dictionary to a list - instead of adding new piece, it gets overwritten completely. I figured that out by putting print(lst) right after the fidelity_sector_report['results'] = lst line.
What should I change so that list (and, correspondingly, dictionary) gets formed as planned?
You should move the lst=[] outside of the sectors loop.
Your problem appears since for each sector, you reset lst and you append the current sector data to an empty list.
The following code causes the value of fidelity_sector_report['results'] to be replaced with lst.
fidelity_sector_report['results'] = lst
I presume you would want to access the respective values using a key, you can add the following line below fidelity_sector_report = dict() to initialize a dictionary:
fidelity_sector_report['results'] = {}
Then, create a key for each sector using the sector name and set the value with your values dictionary by replacing fidelity_sector_report['results'] = lst with:
fidelity_sector_report['results'][name] = dict(values)
You can access the data by using the relevant keys, i.e fidelity_sector_report['results']['Financials']['Dividend Yield'] for the dividend yield of the financials sector.
I am writing a simple secret santa script that selects a "GiftReceiver" and a "GiftGiver" from a list. Two lists and an empty dataframe to be populated are produced:
import pandas as pd
import random
santaslist_receivers = ['Rudolf',
'Blitzen',
'Prancer',
'Dasher',
'Vixen',
'Comet'
]
santaslist_givers = santaslist_receivers
finalDataFrame = pd.DataFrame(columns = ['GiftGiver','GiftReceiver'])
I then have a while loop that selects random elements from each list to pick a gift giver and receiver, then remove from the respective list:
while len(santaslist_receivers) > 0:
print (len(santaslist_receivers)) #Used for testing.
gift_receiver = random.choice(santaslist_receivers)
santaslist_receivers.remove(gift_receiver)
print (len(santaslist_receivers)) #Used for testing.
gift_giver = random.choice(santaslist_givers)
while gift_giver == gift_receiver: #While loop ensures that gift_giver != gift_receiver
gift_giver = random.choice(santaslist_givers)
santaslist_givers.remove(gift_giver)
dummyDF = pd.DataFrame({'GiftGiver':gift_giver,'GiftReceiver':gift_receiver}, index = [0])
finalDataFrame = finalDataFrame.append(dummyDF)
The final dataframe only contains three elements instead of six:
print(finalDataframe)
returns
GiftGiver GiftReceiver
0 Dasher Prancer
0 Comet Vixen
0 Rudolf Blitzen
I have inserted two print lines within the while loop to investigate. These print the length of the list santaslist_receivers before and after the removal of an element. The expected return is to see original list length on the first print, then minus 1 on the second print, then the same length again on the first print of the next iteration of the while loop, then so on. Specifically I expect:
6,5,5,4,4,3,3... and so on.
What is returned is
6,5,4,3,2,1
Which is consistent with the DataFrame having only 3 rows, but I do not see the cause of this.
What is the error in my code or my approach?
You can solve it by simply changing this line
santaslist_givers = santaslist_receivers
to
santaslist_givers = list(santaslist_receivers)
Python variables are pointers essentially so they refer to the same list , ie santaslist_givers and santaslist_receivers were accessing the same location in memory in your implementation . To make them different use a list function
And for some extra information , you can refer copy.deepcopy
You should make an explicit copy of your list here
santaslist_givers = santaslist_receivers
there are multiple options for doing this as explained in this question.
In this case I would recommend (if you have Python >= 3.3):
santaslist_givers = santaslist_receivers.copy()
If you are on an older version of Python, the typical way to do it is:
santaslist_givers = santaslist_receivers[:]
I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)
Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)
I have this code
lst = ["Appearence","Logotype", "Catalog", "Product Groups", "Option Groups","Manufacturers","Suppliers",
"Delivery Statuses","Sold Out Statuses", "Quantity Units", "CSV Import/Export", "Countries","Currencies","Customers"]
for item in lst:
wd.find_element_by_link_text(item).click()
assert wd.title != None
I not want to write list by hand.
I want to receive the list - lst directly from the browser.
I use
m = wd.find_elements_by_css_selector('li[id=app-]')
print(m[0].text)
Appearence
I don't know how to transfer the list to a cycle
look this picture screen browser
Please help me to understand how to use the list and to transfer it to a cycle
In your example variable m will be a list of WebElements you get the length of it and iterate CSS pseudo selector :nth-child() with a range:
m = wd.get_elements_by_css_selector('li#app-')
for elem in range(1, len(m)+1):
wd.get_element_by_css_selector('li#app-:nth-child({})'.format(elem)).click()
assert wd.title is not None
In the for loop it will iterate over a range of integers starting with 1 and ending with the length of the element list (+1 because is not inclusive), the we will click the nth-child of the selector using the iterating number, .format(elem) will replace th {} appearance in the string for the elem variable, in this case the integer iteration.
Hi have the following code:
for each in driver.find_elements_by_xpath(".//*[#id='ip_market" + market + "']/table/tbody")
cell = each.find_elements_by_tag_name('td')[0]
cell2 = each.find_elements_by_tag_name('td')[1]
cell3 = each.find_elements_by_tag_name('td')[2]
My problem is that I don't know exactly how many tds there are inside every each (sometimes 3, sometimes 15 etc etc...).
Is there a possibility to check the numbers of tds inside the for each in order to make dynamic find_elements_by_tag_name('td')?
Just don't get them by index, use:
cells = each.find_elements_by_tag_name('td')
then cells would be a list of elements which you can call len() on:
print(len(cells))
Or slice:
cells[:3] # gives first 3 elements
cell1, cell2, cell3 = cells[:3] # unpacking into separate variables
Or, you can get the text of every cell:
[cell.text for cell in cells]
find_elements_by_tag_name returns a list. Use len() on that list to get the number of elements by that tag name. Also, rather than saying for each I'd gently suggest using more descriptive assignments.
table = driver.find_element_by_xpath(".//*[#id='ip_market" + market + "']/table/tbody")
table_rows = table.find_elements_by_tag_name("tr")
for row in table_rows:
row_cells = row.find_elements_by_tag_name("td")
num_of_cells = len(row_cells)
From reading your comments, it looks like this is what you are trying to do...
for each in driver.find_elements_by_xpath(".//*[#id='ip_market" + market + "']/table/tbody")
for i in range(0,len(each.find_elements_by_tag_name('td'))
// you can now use i as an index into the loop through the TDs
print "do something interesting with i: " + i