I would like to expand on a previously asked question:
Nested For Loop with Unequal Entities
In that question, I requested a method to extract the location's type (Hospital, Urgent Care, etc) in addition to the location's name (WELLSTAR ATLANTA MEDICAL CENTER, WELLSTAR ATLANTA MEDICAL CENTER SOUTH, etc).
The answer suggested utilizing a for loop and dictionary to collect the values and keys. The code snippet appears below:
from pprint import pprint
import requests
from bs4 import BeautifulSoup
url = "https://www.wellstar.org/locations/pages/default.aspx"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
d = {}
for row in soup.select(".WS_Content > .WS_LeftContent > table > tr"):
title = row.h3.get_text(strip=True)
d[title] = [item.get_text(strip=True) for item in row.select(".PurpleBackgroundHeading a)]
pprint(d)
I would like to extend the existing solution to include the entity's address matched with the appropriate key-value combination. If the best solution is to utilize something other than a dictionary, I'm open to that suggestion as well.
Let's say you have a dict my_dict and you want to add 2 with my_key as key. Simply do:
my_dict['my_key'] = 2
Let say you have a dict d = {'Name': 'Zara', 'Age': 7} now you want to add another value
'Sex'= 'female'
You can use built in update method.
d.update({'Sex': 'female' })
print "Value : %s" % d
Value : {'Age': 7, 'Name': 'Zara', 'Sex': 'female'}
ref is https://www.tutorialspoint.com/python/dictionary_update.htm
Related
I'm having to make a dictionary from a file that looks like this:
example =
'Computer science', random name, 17
'Computer science', another name, 18
'math', one name, 19
I want the majors to be keys but I'm having trouble grouping them this is what I've tried
dictionary = {}
for i in example_file:
dictionary = {example[0]:{example[1] : example[2]}
the problem with this is that it does turn the lines into a dictionary but one by one instead of having the ones with the same key in one dictionary
this is what its returning:
{computer science; {random name: 17}}
{computer science: {another name: 18}}
{math{one name:19}}
this is how I want it to look
{computer science: {random name: 17, another name: 18}, math:{one name:19}}
how do I group these?
You need to update the dictionary elements, not assign the whole dictionary each time through the loop.
You can use defaultdict(dict) to automatically create the nested dictionaries as needed.
from collections import defaultdict
dictionary = defaultdict(dict)
for subject, name, score in example_file:
dictionary[subject][name] = int(score)
It's a pretty well known problem with an elegant solution, making use of dict's setdefault() method.
dictionary = {}
for example in example_file:
names = dictionary.setdefault(example[0], {})
names[example[1]] = example[2]
print(dictionary)
This code prints:
{'Computer science': {'random name': 17, 'another name': 18}, 'math': {'one name': 19}}
An alternative code:
(but #hhimko 's solution is almost 50 times faster)
import pandas as pd
df = pd.read_csv("file.csv", header=None).sort_values(0).reset_index(drop=True)
result = dict()
major_holder = None
for index, row in tt.iterrows():
if row.iloc[0] != major_holder:
major_holder = row.iloc[0]
result[major_holder] = dict()
result[major_holder][row.iloc[1]] = row.iloc[2]
else:
result[major_holder][row.iloc[1]] = row.iloc[2]
print(result)
I am scraping all values from here
While I get the desired output, a caveat is the when I inspect the elements in the table tags..on one day I get 145 rows, on another day I get 143 rows, and on another day I get it around 140 rows...Simply put, I want to optimize the logic so that whatever is the last element record in the tag for eg: 145,150,132... it runs fine.
Here's a piece of code for the same:
table_row = []
for i in range(1, 142):
temp = browser.find_element_by_xpath('//*[#id="companies-table-deal-announced"]/tbody/tr[' + str(i) + ']').find_elements_by_tag_name('td')
table_row.append(list(map(lambda x: x.text, temp)))
print(table_row)
df = pd.DataFrame(table_row,
columns=['SPAC', 'Target', 'Ticker', 'Announced', 'Deadline', 'TEV ($M)', 'TEV/IPO', 'Sector',
'Geography', 'Premium', 'Common', 'Warrant'])
A way in which I can think is to use a len() in the for loop and make it work, Is there another way to do it optimally? Please let me know, Thanks!
Try this once:
driver.implicitly_wait(10)
driver.get("https://www.spacresearch.com/symbol?s=live-deal§or=&geography=")
table = driver.find_elements_by_xpath("//table[#id='companies-table-deal-announced']//tbody//tr")
for i,tab in zip(range(1,len(table)+1),table):
datalist=[]
data = tab.find_elements_by_tag_name("td")
for d in data:
datalist.append(d.get_attribute("innerText"))
print("{}: {}".format(i,datalist))
Output:
1: ['Ace Global', 'DDC Enterprise', 'ACBA', '8/25/2021', '4/8/2022', '300']
2: ['NextGen Acquisition II', 'Virgin Orbit', 'NGCA', '8/23/2021', '3/25/2023', '3,218']
3: ['Aldel Financial', 'Hagerty', 'ADF', '8/18/2021', '4/13/2023', '3,134']
...
I guess you should use BeautifulSoup.
It can parse page and find elements by it's id and iterate over them.
Like:
from bs4 import BeautifulSoup
soup = BeautifulSoup([page], 'html.parser')
content = soup.find(id='companies-table-deal-announced')
for element in content:
...
I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.
My model returns information about PC games in the following format. The format is game index and game value. This is my sim_sorted.
[(778, 0.99999994), (1238, 0.9999997), (1409, 0.99999905), (1212, 0.99999815)]
I retrieve the information about the game by indexing the database (df_indieGames):
sims_sorted = sorted(enumerate(sims), key=lambda item: -item[1])
results = {}
for val in sims_sorted[:4]:
index, value = val[0], val[1]
results[df_indieGames.game_name.loc[index]] =
{
"Genre":df_indieGames.genre.loc[index],
"Rating": df_indieGames.score.loc[index],
"Link": df_indieGames.game_link[index]
}
However, such a data structure is hard to sort (by Rating). Is there a better way to store the information so retrieval and sorting is easier? Thanks.
Here's the output of results:
{u'Diehard Dungeon': {'Genre': u'Roguelike',
'Link': u'http://www.indiedb.com/games/diehard-dungeon',
'Rating': 8.4000000000000004},
u'Fork Truck Challenge': {'Genre': u'Realistic Sim',
'Link': u'http://www.indiedb.com/games/fork-truck-challenge',
'Rating': 7.4000000000000004},
u'Miniconomy': {'Genre': u'Realistic Sim',
'Link': u'http://www.indiedb.com/games/miniconomy',
'Rating': 7.2999999999999998},
u'World of Padman': {'Genre': u'First Person Shooter',
'Link': u'http://www.indiedb.com/games/world-of-padman',
'Rating': 9.0}}
UPDATE
The solution to the problem as suggested by ziddarth is the following:
result = sorted(results.iteritems(), key=lambda x: x[1]['Rating'], reverse=True)
You can sort by rating using code below. The lambda function is called with a tuple whose first element is the dictionary key and the second element is the dictionary value for the corresponding key, so you can use the lambda function to get to any value in the nested dictionary
sorted(results.iteritems(), key=lambda x: x[1]['Rating'])
Hi Everyone so I am having problems parsing out information from a query I made to Mapquest API. I am trying to parse out data from my geocode_data column and place into separate columns. I am trying to extract the address specifically the following components in the geocode data below. bolded words are the things I am trying to extract.
'providedLocation': {'latLng': {'lat': 52.38330319, 'lng': 4.7959011}}, 'locations': [{'adminArea6Type': 'Neighborhood', 'street': (4) '25 Philip Vingboonsstraat', 'adminArea4Type': 'County', 'adminArea3Type': 'State', 'displayLatLng': (9){'lat': 52.383324, (10){ 'lng': 4.795784}, (7) 'adminArea3': 'Noord-Holland', 'adminArea1Type': 'Country', 'linkId': '0', 'adminArea4': 'MRA', 'dragPoint': False, 'mapUrl': 'http://www.mapquestapi.com/staticmap/v4/getmap?key=Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp&type=map&size=225,160&pois=purple-1,52.3833236,4.7957837,0,0,|¢er=52.3833236,4.7957837&zoom=15&rand=-152222465', 'type': 's', '(5)postalCode': '1067BG', 'latLng': {'lat': 52.383324, 'lng': 4.795784},(6) 'adminArea5': 'Amsterdam', 'adminArea6': 'Amsterdam', 'geocodeQuality': 'ADDRESS', 'unknownInput': '', 'adminArea5Type': 'City', 'geocodeQualityCode': 'L1AAA', (8) 'adminArea1': 'NL', 'sideOfStreet': 'N'}]}
I have tried building my code but I keep getting KeyErrors. Can anyone fix my code so that I am able to extract the different address components for my study. Thanks! My code is correct until locations part towards the end. then I get an key error.
import pandas as pd
import json
import requests
df = pd.read_csv('/Users/albertgonzalobautista/Desktop/Testing11.csv')
df['geocode_data'] = ''
df['address']=''
df['st_pr_mn']= ' '
def reverseGeocode(latlng):
result = {}
url = 'http://www.mapquestapi.com/geocoding/v1/reverse?key={1}&location={0}'
apikey = 'Cxk9Ng7G6M8VlrJytSZaAACnZE6pG3xp'
request = url.format(latlng, apikey)
data = json.loads(requests.get(request).text)
if len(data['results']) > 0:
result = data['results'][0]
return result
for i, row in df.iterrows():
df['geocode_data'][i] = reverseGeocode(df['lat'][i].astype(str) + ',' + df['lon'][i].astype(str))
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['locations']:
print (row['locations'])
df['st_pr_mn'][i] = row['adminArea3']
First of all , according to your if condition , locations is a key in row['geocode_data'] , so you should try row['geocode_data']['locations'] , not row['locations'] , this is most probably the reason you are getting the KeyError.
Then according to the json you have given in the OP, seems like locations key stores a list, so iterate over each element (as you are doing now) and get the required element from component not row. Example -
for i, row in df.iterrows():
if 'locations' in row['geocode_data']:
for component in row['geocode_data']['locations']:
print (row['geocode_data']['locations'])
df['st_pr_mn'][i] = component['adminArea3']
Though this would overwrite df['st_pr_mn'][i] with a new value for component['adminArea3'] for every dictionary in the list of row['geocode_data']['locations'] . If there is only one element in the list then its fine, otherwise you would have to decide how to store the multiple values , maybe use a list for that.