How to append dictionaries to a list without overwriting data - python

I'm having a bit of trouble with a function I'm trying to write. What it is supposed to do is 1) go to a particular URL and get a list of financial sectors stored in a particular div; 2) visit each sector's respective page and get 3 particular pieces of information from there; 3) put the gathered collection into a dictionary; and 4) append that dictionary to another dictionary.
The desired output is a dictionary containing a list of dictionaries for all the sectors.
Here is my function:
def fsr():
fidelity_sector_report = dict()
url = "https://eresearch.fidelity.com/eresearch/goto/markets_sectors/landing.jhtml"
import requests
from bs4 import BeautifulSoup
# scrape the url page and locate links for each sector
try:
response = requests.get(url)
if not response.status_code == 200:
return 'Main page error'
page = BeautifulSoup(response.content, "lxml")
sectors = page.find_all('a',class_="heading1")
for sector in sectors:
link = 'https://eresearch.fidelity.com/' + sector['href']
name = sector.text
sect = dict()
lst = []
# scrape target pages for required information
try:
details = requests.get(link)
if not details.status_code == 200:
return 'Details page error'
details_soup = BeautifulSoup(details.content,'lxml')
fundamentals = details_soup.find('div',class_='sec-fundamentals')
values = dict()
#locate required values by addressing <tr> text and put them in a dictionary
values['Enterprise Value'] = fundamentals.select_one('th:contains("Enterprise Value") + td').text.strip()
values['Return on Equity (TTM)'] = fundamentals.select_one('th:contains("Return on Equity (TTM)") + td').text.strip()
values['Dividend Yield'] = fundamentals.select_one('th:contains("Dividend Yield") + td').text.strip()
#add values to the sector dictionary
sect[name] = values
# add the dictionary to the list
lst.append(dict(sect))
# for a dictionary using the list
fidelity_sector_report['results'] = lst
except:
return 'Something is wrong with details request'
return fidelity_sector_report
except:
return "Something is horribly wrong"
AS far as I can tell, it performs the main taks wonderfully, and the problem appears at the stage of appending a formed dictionary to a list - instead of adding new piece, it gets overwritten completely. I figured that out by putting print(lst) right after the fidelity_sector_report['results'] = lst line.
What should I change so that list (and, correspondingly, dictionary) gets formed as planned?

You should move the lst=[] outside of the sectors loop.
Your problem appears since for each sector, you reset lst and you append the current sector data to an empty list.

The following code causes the value of fidelity_sector_report['results'] to be replaced with lst.
fidelity_sector_report['results'] = lst
I presume you would want to access the respective values using a key, you can add the following line below fidelity_sector_report = dict() to initialize a dictionary:
fidelity_sector_report['results'] = {}
Then, create a key for each sector using the sector name and set the value with your values dictionary by replacing fidelity_sector_report['results'] = lst with:
fidelity_sector_report['results'][name] = dict(values)
You can access the data by using the relevant keys, i.e fidelity_sector_report['results']['Financials']['Dividend Yield'] for the dividend yield of the financials sector.

Related

I need to fill a List with API call's results with Multithreading

So I've been trying with many different methods but I can't get around it. Basically this happens:
API function call returns a Dict inside of a list.
I have a list of arguments that need to be passed to the function above one by one.
I don't care about order.
Last step is to append that list to a Pandas.DataFrame which will remove duplicates and order and etc.
Examples (btw, the API is Python-Binance):
symbols = ['ADAUSDT', 'ETHUSDT', 'BTCUSDT']
orders = pd.DataFrame()
for s in symbols:
orders = orders.append(client.get_all_orders(symbol=s)) # This returns the Dict
I tried using Queue() and Thread(), both with Lock(). I tried ThreadPoolExecutor() as well but I cannot make it work. The furthest I reached was with the last method but the amount of lines where different after each execution:
orders = pd.DataFrame()
temp = []
with ThreadPoolExecutor() as executor:
executor.map(get_orders, symbols)
for x in temp:
orders = orders.append([x])
Any ideas?
Thanks
On the sidelines, used alone orders.append([x])
this can help you
from binance import Client
client = Client(api_key, api_secret)
symbols = ['ADAUSDT', 'ETHUSDT', 'BTCUSDT']
orders_symbol =[]
orders = {}
for s in symbols:
orders.setdefault(s,{})
orders_symbol = client.get_all_orders(symbol=s)
for i in orders_symbol:
orders[s][i['orderId']] = i
print (s,i['orderId'],orders[s][i['orderId']])
print ()

IndexError: list index out of range in loop

I am using Python 3 / Tweepy to create a list that contains the user names associated with various Twitter handles.
My code creates an empty dictionary, loops through the handles in the list to get the user name, saves this info in a dictionary and then appends the dictionary to a new list.
I am getting IndexError: list index out of range when I run the code. When I remove the 4th line of the for loop I do not get errors. Any thoughts on how I can resolve the issue? Why is this line of code causing errors? Thanks!
Here is my code:
def analyzer():
handles = ['#Nasdaq', '#Apple', '#Microsoft', '#amazon', '#Google', '#facebook', '#GileadSciences', '#intel']
data = []
# Grab twitter handles and append the name to data
for handle in handles:
data_dict = {}
tweets = api.user_timeline(handle)
data_dict['Handle'] = handle
data_dict['Name'] = tweets[0]['user']['name']
data.append(data_dict)
i guess main issue in below code
tweets = api.user_timeline(handle)
api.user_timeline() may returns you empty list and you are trying to access
first element from this empty list.
tweets[0]
that's why you are getting 'index out of range' issue.
you can modify your code somthing like this -
for handle in handles:
data_dict = {}
tweets = api.user_timeline(handle)
data_dict['Handle'] = handle
if tweets:
data_dict['Name'] = tweets[0]['user']['name']
data.append(data_dict)
the error is occuring beacause of the empty list which you are trying to access with index 0. you can control this by checking if list is empty or not:
def analyzer():
handles = ['#Nasdaq', '#Apple', '#Microsoft', '#amazon', '#Google', '#facebook', '#GileadSciences', '#intel']
data = []
# Grab twitter handles and append the name to data
for handle in handles:
data_dict = {}
tweets = []
tweets = api.user_timeline(handle)
if tweets:
data_dict['Handle'] = handle
data_dict['Name'] = tweets[0]['user']['name']
data.append(data_dict)

Python when checking if scraped element exists in list

I keep getting an error when I am using an if else statement in python. I want my script to check if an index exists and if it does then run the code, if not then run another code. I get the error ValueError: 'Named Administrator' is not in list
import requests
from bs4 import BeautifulSoup
url_3 = 'https://www.brightscope.com/form-5500/basic-info/107299/Orthopedic-Institute-Of-Pennsylvania/15801790/Orthopedic-Institute-Of-Pennsylvania-401k-Profit-Sharing-Plan/'
page = requests.get(url_3)
soup = BeautifulSoup(page.text, 'html.parser')
divs = [e.get_text() for e in soup.findAll('span')]
if divs.index('Named Administrator'):
index = divs.index('Named Administrator')
contact = divs[index + 1]
else:
contact = '-'
Rather than doing index, do a __contains__ test:
if 'Named Administrator' in divs:
and move forward only if Named Administrator actually exists in divs list, so you won't get the ValueError.
Another consideration is that membership test in lists has O(N) time complexity, so if you are doing this for a large list, probably use a set instead:
{e.get_text() for e in soup.findAll('span')}
but as sets are unordered you won't be able to use index-ing.
So either think about something else that would work on sets as well i.e. no need to get next value by indexing.
Or you can use a set for membership test, and list for getting the next value. The cost here might be higher or lower based on your actual context and you can only find out that by profiling:
divs_list = [e.get_text() for e in soup.findAll('span')]
divs_set = set(divs_list)
if 'Named Administrator' in divs_set:
index = divs_list.index('Named Administrator')
contact = divs_list[index + 1]

Scraping multiple pages into list with beautifulsoup

I wrote out a scraper program using beautifulsoup4 in Python that iterates through multiple pages of cryptocurrency values and returns the the opening, highest, and closing values. The scraping part of the issue works fine but can't get it to save all of the currencies into my lists, only the last one gets added to the list.
Can anyone help me out on how to save all of them? I've done hours of searching and can't seem to find a relevant answer. The code is as follows:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
date=[]
open_p=[]
high_p=[]
low_p=[]
close_p=[]
table = []
for row in main_table.find_all('td'):
table_pull = row.find_all_previous('td') #other find methods aren't returning what I need, but this works just fine
table = [p.text.strip() for p in table_pull]
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)
Simply put, it looks like you are accessing all 'td' elements and then attempting to access the previous elements of that list, which is unnecessary. Also, as #hoefling pointed out, you are continuously overwriting your variable inside of your loop, which is the reasoning for why you are only returning the last element in the list (in other words, only the last iteration of your loop sets the value of that variable, all previous ones are overwritten). Apologies, I cannot test this out currently due to firewalls on my machine. Try the following:
no_space = name_15.str.replace('\s+', '-')
#lists out the pages to scrape
for n in no_space:
page = 'https://coinmarketcap.com/currencies/' + n + '/historical-data/'
http = lib.PoolManager()
response = http.request('GET', page)
soup = BeautifulSoup(response.data, "lxml")
main_table = soup.find('tbody')
table = [p.text.strip() for p in main_table.find_all('td')]
#You will need to re-think these indices here to get the info you want
date = table[208:1:-7]
open_p = table[207:1:-7]
high_p = table[206:1:-7]
low_p = table[205:1:-7]
close_p = table[204:0:-7]
df=pd.DataFrame(date,columns=['Date'])
df['Open']=list(map(float,open_p))
df['High']=list(map(float,high_p))
df['Low']=list(map(float,low_p))
df['Close']=list(map(float,close_p))
print(df)

Stuck with nested for loop issue

A website changes content dynamically, through the use of two date filters (year / week), without the need of a get request (it is handled asynchronously on the client side). Each filter option produces a different page_source with td elements I would like to extract.
Currently, I am using a nested list for-loop to iterate through the filters (and so different page sources containing different td elements, iterate through the contents of each page source and then append the desired td elements in an empty list.
store = []
def getData():
year = ['2015','2014']
for y in year:
values = y
yearid = Select(browser.find_element_by_id('yearid'))
fsid.select_by_value(values)
weeks = ['1', '2']
for w in weeks:
value = w
frange = Select(browser.find_element_by_id('frange'))
frange.select_by_value('WEEKS')
selectElement = Select(browser.find_element_by_id('fweek'))
selectElement.select_by_value(value)
pressFilter = browser.find_element_by_name('submit')
pressFilter.submit()
#scrape data from page source
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
for el in soup.find_all('td'):
store.append(el.get_text())
So far so good, and I have a for loop that constructs a single list of all the td elements that I would like.
Instead, I would like to store separate lists, one for each page source (i.e. one per filter combination), in a list of lists. I can do that after the fact i.e. in a secondary step I could then extract the items from the list according to some criteria.
However, can I do that at the point of the original appending? Something like...
store = [[],[], [], []]
...
counter = 0
for el in soup.find_all('td'):
store[counter].append(el.get_text())
counter = counter +1
This isn't quite right as it only appends to the first object in the store list. If I put the counter in the td for-loop, then it will increase for each time td element is iterated, when in actual fact I only want it to increase when I have finished iterating through a particular page source ( which is itself an iteration of a filter combination).
I am stumped, is what I am trying even possible? If so, where should I put the counter? Or should I use some other technique?
Create a new list object per filter combination, so inside the for w in weeks: loop. Append your cell text to that list, and append the per-filter list this produces to store:
def getData():
store = []
year = ['2015','2014']
for y in year:
# ... elided for brevity
weeks = ['1', '2']
for w in weeks:
perfilter = []
store.append(perfilter)
# ... elided for brevity
for el in soup.find_all('td'):
perfilter.append(el.get_text())

Categories