cannot scrape ratings

cannot scrape ratings - python

My issue is that I cannot use bs4 to scrape sub ratings in its reviews.
Below is an example:
So far, I have discovered where these stars are, but their codes are the same regardless of the color (i.e., green or grey)... I need to be able to identify the color to identify the ratings, not just scrape the stars. Below is my code:
url='https://www.glassdoor.com/Reviews/Walmart-Reviews-E715_P2.htm?filter.iso3Language=eng'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
com = soup.find(class_ = "ratingNumber mr-xsm")
com1 = soup.find(class_ = "gdReview")
com1_1 = com1.find(class_ = "content")

For getting the star rating breakdown (which seems to have no numeric display or meta value), I don't think there's any very simple-and-straight-forward short method since it's done by css in a style tag connected by a class of the container element.
You could use something like soup.select('style:-soup-contains(".css-1nuumx7")') [ the css-1nuumx7 part is specific to rating mentioned above], but :-soup-contains needs html5lib parser and can be a bit slow, so it's better to figure out the data-emotion-css attribute of the style tag instead:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
classList = starCont.get('class', [])
if type(classList) != list: classList = [classList]
classList = [str(c) for c in classList if str(c).startswith('css-')]
if not classList:
if isv: print('Stars container has no "css-" class')
return None
demc = classList[0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
cssStyle = mSoup.select_one(demc_sel)
if not cssStyle:
if isv: print(f'Nothing found with selector {demc_sel}')
return None
cssStyle = cssStyle.get_text()
errMsg = ''
if '90deg,#0caa41 ' not in cssStyle: errMsg += 'No #0caa41'
if '%' not in cssStyle.split('90deg,#0caa41 ', 1)[-1][:20]:
errMsg += ' No %'
if not errMsg:
rPerc = cssStyle.split('90deg,#0caa41 ', 1)[-1]
rPerc = rPerc.split('%')[0]
try:
rPerc = float(rPerc)
if 0 <= rPerc <= 100:
if type(outOf) == int and outOf > 0: rPerc = (rPerc/100)*outOf
return float(f'{float(rPerc):.3}')
errMsg = f'{demc_sel} --> "{rPerc}" is out of range'
except: errMsg = f'{demc_sel} --> cannot convert to float "{rPerc}"'
if isv: print(f'{demc_sel} --> unexpected format {errMsg}')
return None
OR, if you don't care so much about why there's a missing rating:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
try:
demc = [c for c in starCont.get('class', []) if c[:4]=='css-'][0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
rPerc = float(mSoup.select_one(demc_sel).get_text().split('90deg,#0caa41 ', 1)[1].split('%')[0])
return float(f'{(rPerc/100)*outOf if type(outOf) == int and outOf > 0 else rPerc:.3}')
except: return None
Here's an example of how you might use it:
pcCon = 'div.px-std:has(h2 > a.reviewLink) + div.px-std'
pcDiv = f'{pcCon} div.v2__EIReviewDetailsV2__fullWidth'
refDict = {
'rating_num': 'span.ratingNumber',
'emp_status': 'div:has(> div > span.ratingNumber) + span',
'header': 'h2 > a.reviewLink',
'subheader': 'h2:has(> a.reviewLink) + span',
'pros': f'{pcDiv}:first-of-type > p.pb',
'cons': f'{pcDiv}:nth-of-type(2) > p.pb'
}
subRatSel = 'div:has(> .ratingNumber) ~ aside ul > li:has(div ~ div)'
empRevs = []
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
If empRevs is viewed as a table:
reviewId
[rating] Work/Life Balance
[rating] Culture & Values
[rating] Diversity & Inclusion
[rating] Career Opportunities
[rating] Compensation and Benefits
[rating] Senior Management
rating_num
emp_status
header
subheader
pros
cons
empReview_71400593
5
4
4
4
5
3
3
great pay but bit of obnoxious enviornment
Nov 26, 2022 - Sales Associate/Cashier in Bensalem, PA
-Walmart's fair pay policy is ...
-some locations wont build emp...
empReview_70963705
3
3
2
2
2
2
2
Former Employee
Walmart Employees Trained Thrown to the Wolves
Nov 10, 2022 - Data Entry
Getting a snack at break was e...
I worked at Walmart for a very...
empReview_71415031
4
4
4
4
4
4
5
Current Employee, more than 1 year
Work
Nov 27, 2022 - Warehouse Associate in Springfield, GA
The money there is good during...
It can get stressful at times ...
empReview_69136451
nan
nan
nan
nan
nan
nan
4
Current Employee
Walmart
Sep 16, 2022 - Sales Associate/Cashier
I'm a EXPERIENCED WORKER. I ✨...
In my opinion I believe that W...
empReview_71398525
4
3
4
3
4
3
4
Current Employee
Depends heavily on your team
Nov 26, 2022 - Personal Digital Shopper
I have a generally excellent t...
Generally, departments are sho...
empReview_71227029
1
1
1
1
3
1
1
Former Employee, less than 1 year
Managers are treated like a slave.
Nov 19, 2022 - Auto Care Center Manager (ACCM) in Cottonwood, AZ
Great if you like working with...
you only get to work in your a...
empReview_71329467
1
3
3
3
4
1
1
Current Employee, more than 3 years
No more values
Nov 23, 2022 - GM Coach in Houston, TX
Pay compare to other retails a...
Walmart is not a bad company t...
empReview_71512609
5
5
5
5
5
5
5
Former Employee
Walmart midnight stocker
Nov 30, 2022 - Midnight Stocker in Taylor, MI
2 paid 15 min breaks and 1 hou...
Honestly nothing that I can th...
empReview_70585957
3
4
4
4
4
4
4
Former Employee
Lots of Opportunity
Oct 28, 2022 - Human Resources People Lead
Plenty of opportunities if one...
As with any job, management is...
empReview_71519435
3
4
4
5
4
4
5
Current Employee, more than 3 years
Lot of work but worth it
Nov 30, 2022 - People Lead
I enjoy making associates live...
Sometimes an overwhelming amou...
Markdown for the table above was printed with pandas:
erdf = pandas.DataFrame(empRevs).set_index('reviewId')
erdf['pros'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['pros']]
erdf['cons'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['cons']]
print(erdf.to_markdown())

Related

How to speed up this python script with multiprocessing

I have a script that get data from a dataframe, use those data to make a request to a website, using fuzzywuzzy module find the exact href and then runs a function to scrape odds. I would speed up this script with the multiprocessing module, it is possible?
Date HomeTeam AwayTeam
0 Monday 6 December 2021 20:00 Everton Arsenal
1 Monday 6 December 2021 17:30 Empoli Udinese
2 Monday 6 December 2021 19:45 Cagliari Torino
3 Monday 6 December 2021 20:00 Getafe Athletic Bilbao
4 Monday 6 December 2021 15:00 Real Zaragoza Eibar
5 Monday 6 December 2021 17:15 Cartagena Tenerife
6 Monday 6 December 2021 20:00 Girona Leganes
7 Monday 6 December 2021 19:45 Niort Toulouse
8 Monday 6 December 2021 19:00 Jong Ajax FC Emmen
9 Monday 6 December 2021 19:00 Jong AZ Excelsior
Script
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
for i,(a,b,c) in enumerate(zip(dates, hometeams, awayteams)):
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
PS: The monthToNum function just replace the month name to his number

First, you make a function of your loop body with inputs i, a, b and c. Then, you create a multiprocessing.Pool and submit this function with the proper arguments (i, a, b, c) to the pool.
import multiprocessing
df = pd.read_excel(path)
dates = df.Date
hometeams = df.HomeTeam
awayteams = df.AwayTeam
matches_odds = list()
def fetch(data):
i, (a, b, c) = data
try:
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
except requests.exceptions.ConnectionError:
sleep(10)
r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
soup = BeautifulSoup(r.text, 'html.parser')
f = soup.find_all('td', class_="table-main__tt")
for tag in f:
match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
hour = a.split(" ")[4]
if hour.split(':')[0] == '23':
act_hour = '00' + ':' + hour.split(':')[1]
else:
act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
if match > 70 and act_hour == tag.find('span').text:
href_id = tag.find('a')['href']
table = get_odds(href_id)
matches_odds.append(table)
print(i, ' of ', len(dates))
if __name__ == '__main__':
num_processes = 20
with multiprocessing.Pool(num_processes) as pool:
pool.map(fetch, enumerate(zip(dates, hometeams, awayteams)))
Besides, multiprocessing is not the only way to improve the speed. Asynchronous programming can be used as well and is probably better for this scenario, although multiprocessing does the job, too - just want to mention that.
If carefully read the Python multiprocessing documentation, then it'll be obvious.

BeautifulSoup: find all instances when class name repeats

I have the following code:
import requests, pandas as pd
from bs4 import BeautifulSoup
s = requests.session()
url2 = r'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
r = s.get(url2)
soup = BeautifulSoup(r.text, 'html.parser')
z2 = soup.find_all("div", {"class": 'dc_blocks_2c'})
z2 returns a long list. How do I get all the variables and values in a dataframe? i.e. gather the dc_label and dc_value pairs.

when reading tables, it's sometimes easier to just use read_html() method. If it doesn't capture everything you want you can code for the other stuff. Just depends on what you need from the page.
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
list_of_dataframes = pd.read_html(url)
for df in list_of_dataframes:
print(df)
or get df by position in list. for example,
df = list_of_dataframes[2]
All dataframes captured:
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
Tax Year Cost/sqft Market Value Change Tax Assessment Change.1
0 2020 $114.36 $187,555 -4.88% $187,555 -4.88%
1 2019 $120.22 $197,168 -9.04% $197,168 -9.04%
2 2018 $132.18 $216,768 0.00% $216,768 0.00%
3 2017 $132.18 $216,768 5.74% $216,768 9.48%
4 2016 $125.00 $205,000 2.19% $198,000 6.90%
5 2015 $122.32 $200,612 18.71% $185,219 10.00%
6 2014 $103.05 $169,000 10.40% $168,381 10.00%
7 2013 $93.34 $153,074 0.00% $153,074 0.00%
8 2012 $93.34 $153,074 NaN $153,074 NaN
0 1
0 Market Land Value: $39,852
1 Market Improvement Value: $147,703
2 Total Market Value: $187,555
0 1
0 HOUSTON ISD: 1.1367 %
1 HARRIS COUNTY: 0.4071 %
2 HC FLOOD CONTROL DIST: 0.0279 %
3 PORT OF HOUSTON AUTHORITY: 0.0107 %
4 HC HOSPITAL DIST: 0.1659 %
5 HC DEPARTMENT OF EDUCATION: 0.0050 %
6 HOUSTON COMMUNITY COLLEGE: 0.1003 %
7 HOUSTON CITY OF: 0.5679 %
8 Total Tax Rate: 2.4216 %
0 1
0 Estimated Monthly Principal & Interest (Based on the calculation below) $ 951
1 Estimated Monthly Property Tax (Based on Tax Assessment 2020) $ 378
2 Home Owners Insurance Get a Quote

pd.DataFrame([el.find_all('div', {'dc_label','dc_value'}) for el in z2])
0 1
0 [MLS#:] [30509690 (HAR) ]
1 [Listing Price:] [$ 248,890 ($151.76/sqft.) , [], [$Convert ], ...
2 [Listing Status:] [[\n, [\n, <span class="status_icon_1" style="...
3 [Address:] [6408 Burgoyne Road #157]
4 [Unit No.:] [157]
5 [City:] [[Houston]]
6 [State:] [TX]
7 [Zip Code:] [[77057]]
8 [County:] [[Harris County]]
9 [Subdivision:] [ , [Briarwest T/H Condo (View subdivision pri...

Python splitting strings and convert them to a list that notices empty fields

it took me the whole day trying to fix this problem but I didn't found a solution so I hope you can help me. I already tried to extract the data from the website. But the problem is that I don't know how to split the list so that 500g converts to 500,g. The problem is that on the website sometimes the quantity is 1 and sometimes 1 1/2 kg or sth. And now I need to convert it into a CSV file and then into a MySQL database. What I want at the end is a CSV file with the columns: ingredients ID, ingredients, quantity, and the unit of the quantity from the ingredient. So for example:
0, meat, 500, g. This is the code I already have to extract the data from this website:
import re
from bs4 import BeautifulSoup
import requests
import csv
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
def read_recipes():
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content, "lxml")
for ingredient in soup2.select('.td-left'):
menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
ingredients.append([id2, *[ingredient.get_text(strip=True)]])
else:
ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
read_recipes()
I hope you can help me Thank You!

It appears that the strings containing fractions use the unicode symbols for 1/2 etc., so I think a good way of starting is replacing those by looking up the specific code and passing it to str.replace(). Splitting up the units and the amount for this example was easy, since they are separated by a space. But it might be necessary to generalize this more if you encounter other combinations.
The following code works for this specific example:
import re
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []
einheit = []
for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
soup2 = BeautifulSoup(requests.get(url).content)
for ingredient in soup2.select('.td-left'):
# get rid of multiple spaces and replace 1/2 unicode character
raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
# split into unit and number
splitlist = raw_string.split(" ")
menge.append(splitlist[0])
if len(splitlist) == 2:
einheit.append(splitlist[1])
else:
einheit.append('')
for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
if ingredient.name == 'h3':
continue
else:
ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
result.loc[:, "unit"] = einheit
result.loc[:, "amount"] = menge
Output:
>>> result
ID Ingredients unit amount
0 0 Beinscheibe(n), vom Rind, ca. 4 cm dick geschn... 4
1 0 Mehl etwas
2 0 Zwiebel(n) 1
3 0 Knoblauchzehe(n) 2
4 0 Karotte(n) 1
5 0 Lauchstange(n) 1
6 0 Staudensellerie 0.5
7 0 Tomate(n), geschält Dose 1
8 0 Tomatenmark EL 1
9 0 Rotwein zum Ablöschen
10 0 Rinderfond oder Fleischbrühe Liter 0.5
11 0 Olivenöl zum Braten
12 0 Gewürznelke(n) 2
13 0 Pimentkörner 10
14 0 Wacholderbeere(n) 5
15 0 Pfefferkörner
16 0 Salz
17 0 Pfeffer, schwarz, aus der Mühle
18 0 Thymian
19 0 Rosmarin
20 0 Zitrone(n), unbehandelt 1
21 0 Knoblauchzehe(n) 2
22 0 Blattpetersilie Bund 1

Pandas can only convert an array of size 1 to a Python scalar

I have this dataframe, df_pm:
Player GameWeek Minutes \
PlayerMatchesDetailID
1 Alisson 1 90
2 Virgil van Dijk 1 90
3 Joseph Gomez 1 90
ForTeam AgainstTeam \
1 Liverpool Norwich City
2 Liverpool Norwich City
3 Liverpool Norwich City
Goals ShotsOnTarget ShotsInBox CloseShots \
1 0 0 0 0
2 1 1 1 1
3 0 0 0 0
TotalShots Headers GoalAssists ShotOnTargetCreated \
1 0 0 0 0
2 1 1 0 0
3 0 0 0 0
ShotInBoxCreated CloseShotCreated TotalShotCreated \
1 0 0 0
2 0 0 0
3 0 0 1
HeadersCreated
1 0
2 0
3 0
this second dataframe, df_melt:
MatchID GameWeek Date Team Home \
0 46605 1 2019-08-09 Liverpool Home
1 46605 1 2019-08-09 Norwich City Away
2 46606 1 2019-08-10 AFC Bournemouth Home
AgainstTeam
0 Norwich City
1 Liverpool
2 Sheffield United
3 AFC Bournemouth
...
575 Sheffield United
576 Newcastle United
577 Southampton
and this snippet, which uses both:
match_ids = []
home_away = []
dates = []
#For each row in the player matches dataframe...
for row in df_pm.itertuples():
#Look up the match id from the team matches dataframe
team = row.ForTeam
againstteam = row.AgainstTeam
gameweek = row.GameWeek
print (team,againstteam,gameweek)
match_id = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'MatchID'].item()
date = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Date'].item()
home = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Home'].item()
match_ids.append(match_id)
home_away.append(home)
dates.append(date)
At first iteration, I print:
Liverpool
Norwich City
1
But I'm getting the error:
Traceback (most recent call last):
File "tableau_data_generation.py", line 166, in <module>
'MatchID'].item()
File "/Users/me/anaconda2/envs/data_science/lib/python3.7/site-packages/pandas/core/base.py", line 652, in item
return self.values.item()
ValueError: can only convert an array of size 1 to a Python scalar
printing the whole df_melt dataframe, I see that these four datetime values are flawed:
540 46875 28 TBC Aston Villa Home
541 46875 28 TBC Sheffield United Away
...
548 46879 28 TBC Manchester City Home
549 46879 28 TBC Arsenal Away
How do I fix this?

When you use item() on a Series you should actually have received:
FutureWarning: `item` has been deprecated and will be removed in a future version
Since item() has been deprecated in version 0.25.0, it looks like you use
some outdated version of Pandas and possibly you should start from upgrading it.
Even in a newer version of Pandas you can use item(), but on a Numpy
array (at least now, not deprecated).
So change your code to:
df_melt.loc[...].values.item()
Another option is to use iloc[0], so you can also change your code to:
df_melt.loc[...].iloc[0]
Edit
The above solution still can raise an exception (IndexError) if df_melt
does not find any row meeting the given criteria.
To make your code resistant to such cases (and return some default value)
you can add a function getting the given attribute (attr, actually a
column) from the first row meeting the criteria given (gameweek, team,
and againstteam):
def getAttr(gameweek, team, againstteam, attr, default=None):
xx = df_melt.loc[(df_melt['GameWeek'] == gameweek)
& (df_melt['Team'] == team)
& (df_melt['AgainstTeam'] == againstteam)]
return default if xx.empty else xx.iloc[0].loc[attr]
Then, instead of all 3 ... = df_melt.loc[...].item() instructions run:
match_id = getAttr(gameweek, team, againstteam, 'MatchID', default=-1)
date = getAttr(gameweek, team, againstteam, 'Date')
home = getAttr(gameweek, team, againstteam, 'Home', default='????')

Creating a dataframe where one of the arrays has a different length

I am learning to scrape data from website through Python. Extracting weather information about San Francisco from this page. I get stuck while combining data into a Pandas Dataframe. Is it possible to create a dataframe where each rows have different length?
I have already tried 2 ways based on answers here, but they are not excatly what I am looking for. Both answers shift the values of temps column to up. Here is the screen what I try to explain..
1st way: https://stackoverflow.com/a/40442094/10179259
2nd way: https://stackoverflow.com/a/19736406/10179259
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
periods=[pt.get_text() for pt in seven_day.select('.tombstone-container .period-name')]
short_descs=[sd.get_text() for sd in seven_day.select('.tombstone-container .short-desc')]
temps=[t.get_text() for t in seven_day.select('.tombstone-container .temp')]
descs = [d['alt'] for d in seven_day.select('.tombstone-container img')]
#print(len(periods), len(short_descs), len(temps), len(descs))
weather = pd.DataFrame({
"period": periods, #length is 9
"short_desc": short_descs, #length is 9
"temp": temps, #problem here length is 8
#"desc":descs #length is 9
})
print(weather)
I expect that first row of the temp column to be Nan. Thank you.

You can loop each forecast_items value with iter and next for select first value, if not exist is assigned fo dictionary NaN value:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
out = []
for x in forecast_items:
periods = next(iter([t.get_text() for t in x.select('.period-name')]), np.nan)
short_descs = next(iter([t.get_text() for t in x.select('.short-desc')]), np.nan)
temps = next(iter([t.get_text() for t in x.select('.temp')]), np.nan)
descs = next(iter([d['alt'] for d in x.select('img')]), np.nan)
out.append({'period':periods, 'short_desc':short_descs, 'temp':temps, 'descs':descs})
weather = pd.DataFrame(out)
print (weather)
descs period \
0 NOW until4:00pm Sat
1 Today: Showers, with thunderstorms also possib... Today
2 Tonight: Showers likely and possibly a thunder... Tonight
3 Sunday: A chance of showers before 11am, then ... Sunday
4 Sunday Night: Rain before 11pm, then a chance ... SundayNight
5 Monday: A 40 percent chance of showers. Cloud... Monday
6 Monday Night: A 30 percent chance of showers. ... MondayNight
7 Tuesday: A 50 percent chance of rain. Cloudy,... Tuesday
8 Tuesday Night: Rain. Cloudy, with a low aroun... TuesdayNight
short_desc temp
0 Wind Advisory NaN
1 Showers andBreezy High: 56 °F
2 ShowersLikely Low: 49 °F
3 Heavy Rainand Windy High: 56 °F
4 Heavy Rainand Breezythen ChanceShowers Low: 52 °F
5 ChanceShowers High: 58 °F
6 ChanceShowers Low: 53 °F
7 Chance Rain High: 59 °F
8 Rain Low: 53 °F

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

cannot scrape ratings - python

Related

How to speed up this python script with multiprocessing

BeautifulSoup: find all instances when class name repeats

Python splitting strings and convert them to a list that notices empty fields

Pandas can only convert an array of size 1 to a Python scalar

Creating a dataframe where one of the arrays has a different length

Categories

Resources