BeautifulSoup: find all instances when class name repeats

BeautifulSoup: find all instances when class name repeats - python

I have the following code:
import requests, pandas as pd
from bs4 import BeautifulSoup
s = requests.session()
url2 = r'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
r = s.get(url2)
soup = BeautifulSoup(r.text, 'html.parser')
z2 = soup.find_all("div", {"class": 'dc_blocks_2c'})
z2 returns a long list. How do I get all the variables and values in a dataframe? i.e. gather the dc_label and dc_value pairs.

when reading tables, it's sometimes easier to just use read_html() method. If it doesn't capture everything you want you can code for the other stuff. Just depends on what you need from the page.
url = 'https://www.har.com/homedetail/6408-burgoyne-rd-157-houston-tx-77057/3380601'
list_of_dataframes = pd.read_html(url)
for df in list_of_dataframes:
print(df)
or get df by position in list. for example,
df = list_of_dataframes[2]
All dataframes captured:
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
0 1
0 Original List Price: $249,890
1 Price Reduced: -$1,000
2 Current List Price: $248,890
3 Last Reduction on: 05/14/2021
Tax Year Cost/sqft Market Value Change Tax Assessment Change.1
0 2020 $114.36 $187,555 -4.88% $187,555 -4.88%
1 2019 $120.22 $197,168 -9.04% $197,168 -9.04%
2 2018 $132.18 $216,768 0.00% $216,768 0.00%
3 2017 $132.18 $216,768 5.74% $216,768 9.48%
4 2016 $125.00 $205,000 2.19% $198,000 6.90%
5 2015 $122.32 $200,612 18.71% $185,219 10.00%
6 2014 $103.05 $169,000 10.40% $168,381 10.00%
7 2013 $93.34 $153,074 0.00% $153,074 0.00%
8 2012 $93.34 $153,074 NaN $153,074 NaN
0 1
0 Market Land Value: $39,852
1 Market Improvement Value: $147,703
2 Total Market Value: $187,555
0 1
0 HOUSTON ISD: 1.1367 %
1 HARRIS COUNTY: 0.4071 %
2 HC FLOOD CONTROL DIST: 0.0279 %
3 PORT OF HOUSTON AUTHORITY: 0.0107 %
4 HC HOSPITAL DIST: 0.1659 %
5 HC DEPARTMENT OF EDUCATION: 0.0050 %
6 HOUSTON COMMUNITY COLLEGE: 0.1003 %
7 HOUSTON CITY OF: 0.5679 %
8 Total Tax Rate: 2.4216 %
0 1
0 Estimated Monthly Principal & Interest (Based on the calculation below) $ 951
1 Estimated Monthly Property Tax (Based on Tax Assessment 2020) $ 378
2 Home Owners Insurance Get a Quote

pd.DataFrame([el.find_all('div', {'dc_label','dc_value'}) for el in z2])
0 1
0 [MLS#:] [30509690 (HAR) ]
1 [Listing Price:] [$ 248,890 ($151.76/sqft.) , [], [$Convert ], ...
2 [Listing Status:] [[\n, [\n, <span class="status_icon_1" style="...
3 [Address:] [6408 Burgoyne Road #157]
4 [Unit No.:] [157]
5 [City:] [[Houston]]
6 [State:] [TX]
7 [Zip Code:] [[77057]]
8 [County:] [[Harris County]]
9 [Subdivision:] [ , [Briarwest T/H Condo (View subdivision pri...

Related

cannot scrape ratings

My issue is that I cannot use bs4 to scrape sub ratings in its reviews.
Below is an example:
So far, I have discovered where these stars are, but their codes are the same regardless of the color (i.e., green or grey)... I need to be able to identify the color to identify the ratings, not just scrape the stars. Below is my code:
url='https://www.glassdoor.com/Reviews/Walmart-Reviews-E715_P2.htm?filter.iso3Language=eng'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
com = soup.find(class_ = "ratingNumber mr-xsm")
com1 = soup.find(class_ = "gdReview")
com1_1 = com1.find(class_ = "content")

For getting the star rating breakdown (which seems to have no numeric display or meta value), I don't think there's any very simple-and-straight-forward short method since it's done by css in a style tag connected by a class of the container element.
You could use something like soup.select('style:-soup-contains(".css-1nuumx7")') [ the css-1nuumx7 part is specific to rating mentioned above], but :-soup-contains needs html5lib parser and can be a bit slow, so it's better to figure out the data-emotion-css attribute of the style tag instead:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
classList = starCont.get('class', [])
if type(classList) != list: classList = [classList]
classList = [str(c) for c in classList if str(c).startswith('css-')]
if not classList:
if isv: print('Stars container has no "css-" class')
return None
demc = classList[0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
cssStyle = mSoup.select_one(demc_sel)
if not cssStyle:
if isv: print(f'Nothing found with selector {demc_sel}')
return None
cssStyle = cssStyle.get_text()
errMsg = ''
if '90deg,#0caa41 ' not in cssStyle: errMsg += 'No #0caa41'
if '%' not in cssStyle.split('90deg,#0caa41 ', 1)[-1][:20]:
errMsg += ' No %'
if not errMsg:
rPerc = cssStyle.split('90deg,#0caa41 ', 1)[-1]
rPerc = rPerc.split('%')[0]
try:
rPerc = float(rPerc)
if 0 <= rPerc <= 100:
if type(outOf) == int and outOf > 0: rPerc = (rPerc/100)*outOf
return float(f'{float(rPerc):.3}')
errMsg = f'{demc_sel} --> "{rPerc}" is out of range'
except: errMsg = f'{demc_sel} --> cannot convert to float "{rPerc}"'
if isv: print(f'{demc_sel} --> unexpected format {errMsg}')
return None
OR, if you don't care so much about why there's a missing rating:
def getDECstars(starCont, mSoup, outOf=5, isv=False):
try:
demc = [c for c in starCont.get('class', []) if c[:4]=='css-'][0].replace('css-', '', 1)
demc_sel = f'style[data-emotion-css="{demc}"]'
rPerc = float(mSoup.select_one(demc_sel).get_text().split('90deg,#0caa41 ', 1)[1].split('%')[0])
return float(f'{(rPerc/100)*outOf if type(outOf) == int and outOf > 0 else rPerc:.3}')
except: return None
Here's an example of how you might use it:
pcCon = 'div.px-std:has(h2 > a.reviewLink) + div.px-std'
pcDiv = f'{pcCon} div.v2__EIReviewDetailsV2__fullWidth'
refDict = {
'rating_num': 'span.ratingNumber',
'emp_status': 'div:has(> div > span.ratingNumber) + span',
'header': 'h2 > a.reviewLink',
'subheader': 'h2:has(> a.reviewLink) + span',
'pros': f'{pcDiv}:first-of-type > p.pb',
'cons': f'{pcDiv}:nth-of-type(2) > p.pb'
}
subRatSel = 'div:has(> .ratingNumber) ~ aside ul > li:has(div ~ div)'
empRevs = []
for r in soup.select('li[id^="empReview_"]'):
rDet = {'reviewId': r.get('id')}
for sr in r.select(subRatSel):
k = sr.select_one('div:first-of-type').get_text(' ').strip()
sval = getDECstars(sr.select_one('div:nth-of-type(2)'), soup)
rDet[f'[rating] {k}'] = sval
for k, sel in refDict.items():
sval = r.select_one(sel)
if sval: sval = sval.get_text(' ').strip()
rDet[k] = sval
empRevs.append(rDet)
If empRevs is viewed as a table:
reviewId
[rating] Work/Life Balance
[rating] Culture & Values
[rating] Diversity & Inclusion
[rating] Career Opportunities
[rating] Compensation and Benefits
[rating] Senior Management
rating_num
emp_status
header
subheader
pros
cons
empReview_71400593
5
4
4
4
5
3
3
great pay but bit of obnoxious enviornment
Nov 26, 2022 - Sales Associate/Cashier in Bensalem, PA
-Walmart's fair pay policy is ...
-some locations wont build emp...
empReview_70963705
3
3
2
2
2
2
2
Former Employee
Walmart Employees Trained Thrown to the Wolves
Nov 10, 2022 - Data Entry
Getting a snack at break was e...
I worked at Walmart for a very...
empReview_71415031
4
4
4
4
4
4
5
Current Employee, more than 1 year
Work
Nov 27, 2022 - Warehouse Associate in Springfield, GA
The money there is good during...
It can get stressful at times ...
empReview_69136451
nan
nan
nan
nan
nan
nan
4
Current Employee
Walmart
Sep 16, 2022 - Sales Associate/Cashier
I'm a EXPERIENCED WORKER. I ✨...
In my opinion I believe that W...
empReview_71398525
4
3
4
3
4
3
4
Current Employee
Depends heavily on your team
Nov 26, 2022 - Personal Digital Shopper
I have a generally excellent t...
Generally, departments are sho...
empReview_71227029
1
1
1
1
3
1
1
Former Employee, less than 1 year
Managers are treated like a slave.
Nov 19, 2022 - Auto Care Center Manager (ACCM) in Cottonwood, AZ
Great if you like working with...
you only get to work in your a...
empReview_71329467
1
3
3
3
4
1
1
Current Employee, more than 3 years
No more values
Nov 23, 2022 - GM Coach in Houston, TX
Pay compare to other retails a...
Walmart is not a bad company t...
empReview_71512609
5
5
5
5
5
5
5
Former Employee
Walmart midnight stocker
Nov 30, 2022 - Midnight Stocker in Taylor, MI
2 paid 15 min breaks and 1 hou...
Honestly nothing that I can th...
empReview_70585957
3
4
4
4
4
4
4
Former Employee
Lots of Opportunity
Oct 28, 2022 - Human Resources People Lead
Plenty of opportunities if one...
As with any job, management is...
empReview_71519435
3
4
4
5
4
4
5
Current Employee, more than 3 years
Lot of work but worth it
Nov 30, 2022 - People Lead
I enjoy making associates live...
Sometimes an overwhelming amou...
Markdown for the table above was printed with pandas:
erdf = pandas.DataFrame(empRevs).set_index('reviewId')
erdf['pros'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['pros']]
erdf['cons'] = [p[:30] + '...' if len(p) > 33 else p for p in erdf['cons']]
print(erdf.to_markdown())

How can we create a Chord Diagram with a dataframe object?

I found this generic code online.
import pandas as pd
import holoviews as hv
from holoviews import opts, dim
from bokeh.sampledata.les_mis import data
hv.extension('bokeh')
hv.output(size=200)
links = pd.DataFrame(data['links'])
print(links.head(3))
hv.Chord(links)
nodes = hv.Dataset(pd.DataFrame(data['nodes']), 'index')
nodes.data.head()
chord = hv.Chord((links, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('source').str(),
labels='name', node_color=dim('index').str()))
That makes this, which looks nice.
[![enter image description here][1]][1]
The sample data is sourced from here.
https://holoviews.org/reference/elements/bokeh/Chord.html
Apparently, 'links' is a pandas dataframe and 'nodes' is a holoviews dataset, and the type is like this.
<class 'pandas.core.frame.DataFrame'>
<class 'holoviews.core.data.Dataset'>
So, my question is this...how can I feed a dataframe into a Chord Diagram? Here is my sample dataframe. Also, I don't know how to incorporate the <class 'holoviews.core.data.Dataset'> into the mix.

I think your data does not match the requirements of this function. Let me explain why I think so?
The Chord-function expects at least on dataset (this can be a pandas DataFrame) with three columns, but all elements are numbers.
source target value
0 1 0 1
1 2 0 8
2 3 0 10
A second dataset is optional. This can take strings in the second columns to add labels for example.
index name group
0 0 a 0
1 1 b 0
2 2 c 0
Basic Example
Your given data looks like this.
Measure Country Value
0 Arrivals Greece 1590
1 Arrivals Spain 1455
2 Arrivals France 1345
3 Arrivals Iceland 1100
4 Arrivals Iceland 1850
5 Departures America 2100
6 Departures Ireland 1000
7 Departures America 950
8 Departures Ireland 1200
9 Departures Japan 1050
You can bring your date in the basic form, if you replace the strings in your DataFrame df by numbers like this:
_df = df.copy()
values = list(_df.Measure.unique())+list(_df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
_df.Measure = _df.Measure.apply(str2num)
_df.Country = _df.Country.apply(str2num)
>>> df
Measure Country Value
0 0 2 1590
1 0 3 1455
2 0 4 1345
3 0 5 1100
4 0 5 1850
5 1 6 2100
6 1 7 1000
7 1 6 950
8 1 7 1200
9 1 8 1050
Now your data matches the basic conditions and you can create a Chord diagram.
chord = hv.Chord(_df).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20',
edge_color=dim('Measure').str(),
labels='Country',
node_color=dim('index').str()))
As you can see, all the conection lines only have one of two colors. This is because in the Measure column are only two elements. Therefor I think, this is not what you want.
Modificated Example
Let's Modify your data a tiny bit:
_list = list(df.Country.values)
new_df = pd.DataFrame({'From':_list, 'To':_list[3:]+_list[:3], 'Value':df.Value})
>>> new_df
From To Value
0 Greece Iceland 1590
1 Spain Iceland 1455
2 France America 1345
3 Iceland Ireland 1100
4 Iceland America 1850
5 America Ireland 2100
6 Ireland Japan 1000
7 America Greece 950
8 Ireland Spain 1200
9 Japan France 1050
and:
node = pd.DataFrame()
for i, value in enumerate(df.Measure.unique()):
_list = list(df[df['Measure']==value].Country.unique())
node = pd.concat([node, pd.DataFrame({'Name':_list, 'Group':i})], ignore_index=True)
>>> node
Name Group
0 Greece 0
1 Spain 0
2 France 0
3 Iceland 0
4 America 1
5 Ireland 1
6 Japan 1
Now we have to replace the strings in new_df again and can call the Chord-function again.
values = list(df.Country.unique())
d = {value: i for i, value in enumerate(values)}
def str2num(s):
return d[s]
new_df.From = new_df.From.apply(str2num)
new_df.To = new_df.To.apply(str2num)
hv.Chord(new_df)
nodes = hv.Dataset(pd.DataFrame(node), 'index')
chord = hv.Chord((new_df, nodes)).select(value=(5, None))
chord.opts(
opts.Chord(cmap='Category20', edge_cmap='Category20', edge_color=dim('From').str(),
labels='Name', node_color=dim('index').str()
)
)
The are now two groups added to the HoverTool.

Pandas can only convert an array of size 1 to a Python scalar

I have this dataframe, df_pm:
Player GameWeek Minutes \
PlayerMatchesDetailID
1 Alisson 1 90
2 Virgil van Dijk 1 90
3 Joseph Gomez 1 90
ForTeam AgainstTeam \
1 Liverpool Norwich City
2 Liverpool Norwich City
3 Liverpool Norwich City
Goals ShotsOnTarget ShotsInBox CloseShots \
1 0 0 0 0
2 1 1 1 1
3 0 0 0 0
TotalShots Headers GoalAssists ShotOnTargetCreated \
1 0 0 0 0
2 1 1 0 0
3 0 0 0 0
ShotInBoxCreated CloseShotCreated TotalShotCreated \
1 0 0 0
2 0 0 0
3 0 0 1
HeadersCreated
1 0
2 0
3 0
this second dataframe, df_melt:
MatchID GameWeek Date Team Home \
0 46605 1 2019-08-09 Liverpool Home
1 46605 1 2019-08-09 Norwich City Away
2 46606 1 2019-08-10 AFC Bournemouth Home
AgainstTeam
0 Norwich City
1 Liverpool
2 Sheffield United
3 AFC Bournemouth
...
575 Sheffield United
576 Newcastle United
577 Southampton
and this snippet, which uses both:
match_ids = []
home_away = []
dates = []
#For each row in the player matches dataframe...
for row in df_pm.itertuples():
#Look up the match id from the team matches dataframe
team = row.ForTeam
againstteam = row.AgainstTeam
gameweek = row.GameWeek
print (team,againstteam,gameweek)
match_id = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'MatchID'].item()
date = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Date'].item()
home = df_melt.loc[(df_melt['GameWeek']==gameweek)
&(df_melt['Team']==team)
&(df_melt['AgainstTeam']==againstteam),
'Home'].item()
match_ids.append(match_id)
home_away.append(home)
dates.append(date)
At first iteration, I print:
Liverpool
Norwich City
1
But I'm getting the error:
Traceback (most recent call last):
File "tableau_data_generation.py", line 166, in <module>
'MatchID'].item()
File "/Users/me/anaconda2/envs/data_science/lib/python3.7/site-packages/pandas/core/base.py", line 652, in item
return self.values.item()
ValueError: can only convert an array of size 1 to a Python scalar
printing the whole df_melt dataframe, I see that these four datetime values are flawed:
540 46875 28 TBC Aston Villa Home
541 46875 28 TBC Sheffield United Away
...
548 46879 28 TBC Manchester City Home
549 46879 28 TBC Arsenal Away
How do I fix this?

When you use item() on a Series you should actually have received:
FutureWarning: `item` has been deprecated and will be removed in a future version
Since item() has been deprecated in version 0.25.0, it looks like you use
some outdated version of Pandas and possibly you should start from upgrading it.
Even in a newer version of Pandas you can use item(), but on a Numpy
array (at least now, not deprecated).
So change your code to:
df_melt.loc[...].values.item()
Another option is to use iloc[0], so you can also change your code to:
df_melt.loc[...].iloc[0]
Edit
The above solution still can raise an exception (IndexError) if df_melt
does not find any row meeting the given criteria.
To make your code resistant to such cases (and return some default value)
you can add a function getting the given attribute (attr, actually a
column) from the first row meeting the criteria given (gameweek, team,
and againstteam):
def getAttr(gameweek, team, againstteam, attr, default=None):
xx = df_melt.loc[(df_melt['GameWeek'] == gameweek)
& (df_melt['Team'] == team)
& (df_melt['AgainstTeam'] == againstteam)]
return default if xx.empty else xx.iloc[0].loc[attr]
Then, instead of all 3 ... = df_melt.loc[...].item() instructions run:
match_id = getAttr(gameweek, team, againstteam, 'MatchID', default=-1)
date = getAttr(gameweek, team, againstteam, 'Date')
home = getAttr(gameweek, team, againstteam, 'Home', default='????')

how to get sum of the row by using column*value in pandas pivot table?

Im trying to get following output. Stuck to get Total.
Here is my code;
def generate_invoice_summary_info():
file_path = 'output.xlsx'
df = pd.read_excel(file_path, sheet_name='Invoice Details', usecols="E:F,I,L:M")
df['Price'] = df['Price'].astype(float)
# df['Total'] = df.groupby(["Invoice Cost Centre", "Invoice Category"]).agg({'Price': 'sum'}).reset_index()
df = pd.pivot_table(df, index=["Invoice Cost Centre", "Invoice Category"],columns=['Price','Reporting Frequency','Data Feed'],
aggfunc=len ,fill_value=0,margins=True)
print(df.head())
df.to_excel('a.xlsx',sheet_name='Invoice Summary')
Above code produces following output (90% right)
I got stuck to find the Total column.
Total column is calcultaed for each row, based on count* price
Total = count*price column
How can I do that in pivot table?
I used margins attribute , but it gives row sum only.
Edit
print(df):
Price 10.4 ... 85.0 All
Reporting Frequency M ... M
Data Feed BWH EMAIL ... StarBOS
Invoice Cost Centre Invoice Category ...
D3TM Reseller Non Equity 21 10 ... 0 125
EQUITYEMP Baileys 0 7 ... 0 10
Energy NSW 16 0 ... 0 32
Far North Queensland 3 0 ... 0 6
South East 6 0 ... 0 16
Cooper & Dysart 0 0 ... 0 3
Petro Fuel & Lubricants 8 0 ... 0 20
South East QLD Fuels 0 0 ... 0 19
R1M Retail QLD 60 0 ... 0 867

Find highest growth using python pandas?

I have a dataframe named growth with 4 columns.
State Name Average Fare ($)_x Average Fare ($)_y Average Fare ($)
0 AK 599.372368 577.790640 585.944324
1 AL 548.825867 545.144447 555.939466
2 AR 496.033146 511.867026 513.761296
3 AZ 324.641818 396.895324 389.545267
4 CA 368.937971 376.723839 366.918761
5 CO 502.611572 537.206439 531.191893
6 CT 394.105453 388.772428 370.904182
7 DC 390.872738 382.326510 392.394165
8 FL 324.941100 329.728524 337.249248
9 GA 485.335737 480.606365 489.574241
10 HI 326.084793 335.547369 298.709998
11 IA 428.151682 445.625840 462.614195
12 ID 482.092567 475.822275 491.714945
13 IL 329.449503 349.938794 346.022226
14 IN 391.627917 418.945137 412.242053
15 KS 452.312058 490.024059 420.182836
The last three columns are the average fare of each year of each state.
2nd,3rd,4th column being year 2017,2018,2019 respectively.
I wanted to find out that which state has highest growth in fare since 2017.
I tried with this code of mine and it gives some output that I cant really understand.
I just need to find the state that has highest fare growth since 2017.
my code:
growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change()

You can you this
df.set_index('State_name').pct_change(periods = 1, axis='columns').idxmax()
Change the periods value to 2 if you want to calculate the difference between first year & the 3rd year.
output
Average_fare_x NaN
Average_fare_y AZ #state with max change between 1st & 2nd year
Average_fare WV #state with max change between 2nd & 3rd year

growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns')
This should give you the percentage change between each year.
growth['variation_percentage'] = growth[['Average Fare ($)_x','Average Fare ($)_y','Average Fare ($)']].pct_change(axis='columns').sum(axis=1)
This should give you the cumulative percentage change.

Since you are talking about variation prices the total growth/decrease in fare will be the variation from 2017 to your last available data (2019). Therefore you can compute this ratio and then just get the max() to find the row with the most growth.
growth['variation_fare'] = growth['Average Fare ($)'] / growth['Average Fare ($)_x']
growth = growth.sort_values(['variation_fare'],ascending=False)
print(growth.head(1))
Example:
import pandas as pd
a = {'State':['AK','AL','AR','AZ','CA'],'2017':[100,200,300,400,500],'2018':[120,242,324,457,592],'2019':[220,393,484,593,582]}
growth = pd.DataFrame(a)
growth['2018-2017 variation'] = (growth['2018'] / growth['2017']) - 1
growth['2019-2018 variation'] = (growth['2019'] / growth['2018']) - 1
growth['total variation'] = (growth['2019'] / growth['2017']) - 1
growth = growth.sort_values(['total variation'],ascending=False)
print(growth.head(5)) #Showing top 5
Output:
State 2017 2018 2019 2018-2017 variation 2019-2018 variation total variation
0 AK 100 120 220 0.2000 0.833333 1.200000
1 AL 200 242 393 0.2100 0.623967 0.965000
2 AR 300 324 484 0.0800 0.493827 0.613333
3 AZ 400 457 593 0.1425 0.297593 0.482500
4 CA 500 592 582 0.1840 -0.016892 0.164000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup: find all instances when class name repeats - python

Related

cannot scrape ratings

How can we create a Chord Diagram with a dataframe object?

Pandas can only convert an array of size 1 to a Python scalar

how to get sum of the row by using column*value in pandas pivot table?

Find highest growth using python pandas?

Categories

Resources