I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks
You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0
I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92
I am trying to extract only the rented prices (green dots) from this site but I can't find where the data is coming from. I want to use beautifulsoup or scrapy to do the web scraping. Is the data being imported as a JSON or how is it appearing on the website? Apologizes for such a broad question I am relatively new to python programming language. There are URLs in source code that may lead to the data but I can't figure it out. Any push in the right direction would be so appreciated.
Here is the website: https://www.redweek.com/whats-my-timeshare-worth/P5035-wyndham-bonnet-creek-resort/rental-historical
I am only going to help you find the place where the data is coming from. Parsing the JSON is up to you. If you open up the Network Tab in Chrome Developer's Console, you can see:
xhr?resort_id=5035&type=rental&active=0
Now, when you click on that, you will get the Request URL option on the right hand side. This is where the data is coming from:
https://redweek.com/whats-my-timeshare-worth/xhr?resort_id=5035&type=rental&active=0
Hamza said where to find it. But re-iterate, when you are on the site, right-click and select "Inspect" (or Ctrl-Shift-I). In the right pannel you'll find it in Network, XHR, Headers (you may need to reload the page once you have the panel open)
Here's the code to turn that json to a table:
import pandas as pd
import requests
url = 'https://www.redweek.com/whats-my-timeshare-worth/xhr?resort_id=5035&type=rental&active=0'
headers = {'User-Agent': 'Mozilla/5.0'}
jsonData = requests.get(url, headers=headers).json()
cols = {}
for idx, each in enumerate(jsonData['cols']):
cols.update({idx:each['label']})
cols.update({0:'Week'})
rows = []
for row in jsonData['rows']:
temp_row = {}
for idx, each in enumerate(row['c']):
w=1
temp_row.update({cols[idx]:each['v']})
rows.append(temp_row)
df = pd.DataFrame(rows)
df['Price'] = df['Rented'].fillna(df['Unknown'])
df = df.drop(['Not Rented','Active posting','Rented','Unknown'],axis=1)
Output:
print(df)
Bedrooms Status Week Price
0 1 Unknown 52 120.0
1 1 Unknown 52 130.0
2 1 Unknown 53 120.0
3 1 Unknown 1 60.0
4 1 Unknown 3 140.0
5 1 Unknown 5 100.0
6 1 Unknown 11 170.0
7 1 Unknown 11 90.0
8 1 Unknown 20 90.0
9 1 Unknown 22 130.0
10 1 Unknown 23 100.0
11 1 Unknown 24 100.0
12 1 Unknown 24 180.0
13 1 Unknown 25 100.0
14 1 Unknown 27 90.0
15 1 Unknown 28 90.0
16 1 Unknown 29 90.0
17 1 Unknown 30 90.0
18 1 Unknown 47 100.0
19 1 Unknown 52 100.0
20 1 Unknown 1 140.0
21 1 Unknown 10 140.0
22 1 Unknown 12 130.0
23 1 Unknown 14 100.0
24 1 Unknown 14 160.0
25 1 Unknown 26 110.0
26 1 Unknown 34 90.0
27 1 Unknown 39 140.0
28 1 Unknown 43 160.0
29 1 Unknown 51 100.0
... ... ... ...
4035 3 Rented 12 250.0
4036 3 Rented 13 270.0
4037 3 Rented 14 230.0
4038 3 Rented 18 280.0
4039 3 Rented 27 180.0
4040 3 Rented 35 90.0
4041 4 Rented 53 330.0
4042 4 Rented 15 170.0
4043 4 Rented 14 310.0
4044 4 Rented 18 250.0
4045 4 Rented 19 250.0
4046 4 Rented 46 300.0
4047 4 Rented 18 250.0
4048 4 Rented 19 200.0
4049 4 Rented 8 190.0
4050 4 Rented 12 240.0
4051 4 Rented 27 200.0
4052 4 Rented 7 240.0
4053 4 Rented 18 200.0
4054 4 Rented 47 310.0
4055 4 Rented 45 210.0
4056 4 Rented 7 320.0
4057 4 Rented 51 320.0
4058 4 Rented 9 300.0
4059 4 Rented 15 220.0
4060 4 Rented 39 210.0
4061 4 Rented 41 280.0
4062 4 Rented 4 200.0
4063 4 Rented 5 260.0
4064 4 Rented 35 130.0
[4065 rows x 4 columns]
I would like to apply a function to one pandas dataframe column which does the following task:
I have a cycle counter that starts from a value but sometimes restarts.
I would like to have the counter continue and increase its value.
The function I use at the moment is the following one:
Code
import pandas as pd
d = {'Cycle':[100,100,100,100,101,101,101,102,102,102,102,102,102,103,103,103,100,100,100,100,101,101,101,101]}
df = pd.DataFrame(data=d)
df.loc[:,'counter'] = df['Cycle'].to_numpy()
df.loc[:,'counter'] = df['counter'].rolling(2).apply(lambda x: x[0] if (x[0] == x[1]) else x[0]+1, raw=True)
print(df)
Output
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 100.0
18 100 100.0
19 100 100.0
20 101 101.0
21 101 101.0
22 101 101.0
23 101 101.0
My goal is to get a dataframe similar to this one:
Cycle counter
0 100 NaN
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
How do I use the rolling function with one overlap?
Do you have any recommendation to reach my goal?
Best regards,
Matteo
Another approach would be to identify the points in the Cycle column where the value changes using .diff(). Then at those points increment from the original initial cycle value and merge to the original dataframe forward filling the new values.
df2 = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index()
df2['Target Count'] = df[df['Cycle'].diff().apply(lambda x: x!=0)].reset_index().reset_index().apply(lambda x: df.iloc[0,0] + x['level_0'], axis = 1)
df = df.merge(df2.drop('Cycle', axis = 1), right_on = 'index', left_index = True, how = 'left').ffill().set_index('index', drop = True)
def df.index.name
df
Cycle Target Count
0 100 100.0
1 100 100.0
2 100 100.0
3 100 100.0
4 101 101.0
5 101 101.0
6 101 101.0
7 102 102.0
8 102 102.0
9 102 102.0
10 102 102.0
11 102 102.0
12 102 102.0
13 103 103.0
14 103 103.0
15 103 103.0
16 100 104.0
17 100 104.0
18 100 104.0
19 100 104.0
20 101 105.0
21 101 105.0
22 101 105.0
23 101 105.0
We can use shift and ne (same as !=) to check where the Cycle column changes.
Then we use cumsum to make a counter which changes each time Cycle changes.
We add the first value of Cycle to the counter -1, to let it start at 100:
groups = df['Cycle'].ne(df['Cycle'].shift()).cumsum()
df['counter'] = groups + df['Cycle'].iat[0] - 1
Cycle counter
0 100 100
1 100 100
2 100 100
3 100 100
4 101 101
5 101 101
6 101 101
7 102 102
8 102 102
9 102 102
10 102 102
11 102 102
12 102 102
13 103 103
14 103 103
15 103 103
16 100 104
17 100 104
18 100 104
19 100 104
20 101 105
21 101 105
22 101 105
23 101 105
Details: groups gives us a counter starting at 1:
print(groups)
0 1
1 1
2 1
3 1
4 2
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 3
13 4
14 4
15 4
16 5
17 5
18 5
19 5
20 6
21 6
22 6
23 6
Name: Cycle, dtype: int64
I have 2 data frames, df1 and df2, both have the same format.
For example, df1 looks like this:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
df2 may look like this: The only difference is the start date
Date A B C D E
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
I want to append all the rows from df2 to df1, in this case, all the rows from 2018-03-06 so that df1 becomes:
Date A B C D E
2018-03-01 1 40 30 30 70
2018-03-02 3 60 70 50 55
2018-03-03 4 60 70 45 80
2018-03-04 5 80 90 30 47
2018-03-05 3 40 40 37 20
2018-03-06 7 55 26 46 42
2018-03-07 2 73 46 33 25
Note: df2 may skip 2018-03-06, so all rows from 2018-03-07 will be copied and appended if that's the case.
My dtype for df['Date'] is datetime64. I got an error when I tried to index the last_date of df1 to find the next_date to copy from df2.
>>>> last_date = df1['Date'].tail(1)
>>>> next_date = datetime.datetime(last_date) + datetime.timedelta(days=1)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'
Alternatively, how would you copy all the rows in df2 (starting from the date after the last date of df1) and append them to df1? Thanks.
Option 1
Use combine_first on the Date column:
i = df1.set_index('Date')
j = df2[df2.Date.gt(df1.Date.max())].set_index('Date')
i.combine_first(j).reset_index()
Date A B C D E
0 2018-03-01 1.0 40.0 30.0 30.0 70.0
1 2018-03-02 3.0 60.0 70.0 50.0 55.0
2 2018-03-03 4.0 60.0 70.0 45.0 80.0
3 2018-03-04 5.0 80.0 90.0 30.0 47.0
4 2018-03-05 3.0 40.0 40.0 37.0 20.0
5 2018-03-06 7.0 55.0 26.0 46.0 42.0
6 2018-03-07 2.0 73.0 46.0 33.0 25.0
Option 2
concat + groupby
pd.concat([i, j]).groupby('Date').first().reset_index()
Date A B C D E
0 2018-03-01 1 40 30 30 70
1 2018-03-02 3 60 70 50 55
2 2018-03-03 4 60 70 45 80
3 2018-03-04 5 80 90 30 47
4 2018-03-05 3 40 40 37 20
5 2018-03-06 7 55 26 46 42
6 2018-03-07 2 73 46 33 25