Pandas groupby using range and type - python

I have a Dataframe where I have "room_type" and "review_scores_rating" as labels
The dataframe looks like this
room_type review_scores_rating
0 Private room 98.0
1 Private room 89.0
2 Entire home/apt 100.0
3 Private room 99.0
4 Private room 97.0
I already use groupby so I also have this dataframe
review_scores_rating
room_type
Entire home/apt 11930
Hotel room 97
Private room 3116
Shared room 44
I want to create a dataframe where I have as columns the different room types and each row counts how many are in for different ranges of the rating
I was able to get to this point
count
review_scores_rating
(19.92, 30.0] 24
(30.0, 40.0] 23
(40.0, 50.0] 9
(50.0, 60.0] 97
(60.0, 70.0] 74
(70.0, 80.0] 486
(80.0, 90.0] 1701
(90.0, 100.0] 12773
But I donĀ“t know how to make it count not only by range of the score but also for room type so I can now for example how many private room have a review score rating between 30 and 40

You can use a crosstab with cut:
pd.crosstab(pd.cut(df['review_scores_rating'], bins=range(0, 101, 10)),
df['room_type'])
Output:
room_type Entire home/apt Private room
review_scores_rating
(80, 90] 0 1
(90, 100] 1 3
Or groupby.count:
df.groupby(['room_type', pd.cut(df['review_scores_rating'], bins=range(0, 101, 10))]).count()
Output:
review_scores_rating
room_type review_scores_rating
Entire home/apt (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 0
(90, 100] 1
Private room (0, 10] 0
(10, 20] 0
(20, 30] 0
(30, 40] 0
(40, 50] 0
(50, 60] 0
(60, 70] 0
(70, 80] 0
(80, 90] 1
(90, 100] 3

Related

How do I convert this dictionary to a dataframe in python?

I have a dictionary of Student information in this format. I cannot change this, it is the output from another program I am trying to use.
student_info_dict = {
"Student_1_Name": "Alice",
"Student_1_Age": 23,
"Student_1_Phone_Number": 1111,
"Student_1_before_after": (120, 109),
"Student_2_Name": "Bob",
"Student_2_Age": 56,
"Student_2_Phone_Number": 1234,
"Student_2_before_after": (115, 107),
"Student_3_Name": "Casie",
"Student_3_Age": 47,
"Student_3_Phone_Number": 4567,
"Student_3_before_after": (180, 140),
"Student_4_Name": "Donna",
"Student_4_Age": 33,
"Student_4_Phone_Number": 6789,
"Student_4_before_after": (150, 138),
}
The keys to my dictionary increment by 1 to display the next students information. How do I convert this to a DataFrame that looks like this:
Name Age Phone_Number Before_and_After
0 Alice 23 1111 (120,109)
1 Bob 56 1234 (115,107)
3 Casie 47 4567 (180,140)
4 Donna 33 6789 (150,138)
Use:
#create Series
s = pd.Series(student_info_dict)
#split index created by keys to second _
s.index = s.index.str.split('_', n=2, expand=True)
#remove first level (Student) and reshape to DataFrame
df = s.droplevel(0).unstack()
print (df)
Age Name Phone_Number before_after
1 23 Alice 1111 (120, 109)
2 56 Bob 1234 (115, 107)
3 47 Casie 4567 (180, 140)
4 33 Donna 6789 (150, 138)
You can use a simple dictionary comprehension feeding the Series constructor, getting the Student number and the field name as keys, then unstacking to DataFrame:
df = (pd.Series({tuple(k.split('_',2)[1:]): v
for k,v in student_info_dict.items()})
.unstack(1)
)
output:
Age Name Phone_Number before_after
1 23 Alice 1111 (120, 109)
2 56 Bob 1234 (115, 107)
3 47 Casie 4567 (180, 140)
4 33 Donna 6789 (150, 138)
use this code:
import pandas as pd
name=[]
age=[]
pn=[]
baa=[]
for i in student_info_dict.keys():
if i.find('Name')>=0:
name.append(i)
elif i.find('Age')>=0:
age.append(i)
elif i.find('Phone')>=0:
pn.append(i)
elif i.find('before')>=0:
baa.append(i)
df = pd.DataFrame('Name':name, 'Age':age, 'Phone_number':pn, 'before_after':baa)

Matplotlib error plotting interval bins for discretized values form pandas dataframe

An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)

Is there a more elegant way to do conditional cumulative sums in pandas?

I'm trying to build a little portfolio app and calculate what my average entry price is and realised gains off the back of that. Here's what I have so far which works but curious to know if there's a more elegant way to get conditional cumulative sums without creating extra columns. Seems like a lot of steps for effectively a sumifs statement in excel.
Input dataframe:
hist_pos = pd.DataFrame(data=[
[datetime(2020, 5, 1), 'PPT.AX', 30, 20.00, 15.00, 'Buy'],
[datetime(2020, 5, 2), 'RIO.AX', 25, 25.00, 15.00, 'Buy'],
[datetime(2018, 5, 3), 'BHP.AX', 100, 4.00, 15.00, 'Buy'],
[datetime(2019, 5, 3), 'BHP.AX', 50, 4.00, 15.00, 'Sell'],
[datetime(2019, 12, 3), 'PPT.AX', 80, 4.00, 15.00, 'Buy'],
[datetime(2020, 5, 3), 'RIO.AX', 100, 4.00, 15.00, 'Buy'],
[datetime(2020, 5, 5), 'PPT.AX', 50, 40.00, 15.00, 'Sell'],
[datetime(2020, 5, 10), 'PPT.AX', 15, 45.00, 15.00, 'Sell'],
[datetime(2020, 5, 18), 'PPT.AX', 30, 20.00, 15.00, 'Sell']],
columns=['Date', 'Ticker', 'Quantity', 'Price', 'Fees', 'Direction'])
Code base:
hist_pos.sort_values(['Ticker', 'Date'], inplace=True)
hist_pos.Quantity = pd.to_numeric(hist_pos.Quantity) #convert to number
# where direction is sale, make quantity negative
hist_pos['AdjQ'] = np.where(
hist_pos.Direction == 'Buy', 1, -1)*hist_pos.Quantity
#Sum quantity to get closing quantity for each ticker using the AdjQ column
hist_pos['CumQuan'] = hist_pos.groupby('Ticker')['AdjQ'].cumsum()
Expected Output:
Date Ticker Quantity Price Fees Direction AdjQ CumQuan
2 2018-05-03 BHP.AX 100 4.0 15.0 Buy 100 100
3 2019-05-03 BHP.AX 50 4.0 15.0 Sell -50 50
4 2019-12-03 PPT.AX 80 4.0 15.0 Buy 80 80
0 2020-05-01 PPT.AX 30 20.0 15.0 Buy 30 110
6 2020-05-05 PPT.AX 50 40.0 15.0 Sell -50 60
7 2020-05-10 PPT.AX 15 45.0 15.0 Sell -15 45
8 2020-05-18 PPT.AX 30 20.0 15.0 Sell -30 15
1 2020-05-02 RIO.AX 25 25.0 15.0 Buy 25 25
5 2020-05-03 RIO.AX 100 4.0 15.0 Buy 100 125
The code above works fine and produces the expected output for column CumQuan. However, I have broader code (here in Repl) where I need to go through this process a number of times for various columns. So wondering if there was a simpler way to process the data to use group by, cumulative sum with a conditional.
Grouping together is the only thing I can think of.
hist_pos2 = hist_pos.groupby('Ticker').agg(CumQuan=('AdjQ','cumsum'), CumCost=('CFBuy','cumsum'))
CumQuan CumCost
2 100 -415.0
3 50 -415.0
4 80 -335.0
0 110 -950.0
6 60 -950.0
7 45 -950.0
8 15 -950.0
1 25 -640.0
5 125 -1055.0

How do a join two columns into another seperate column in Pandas?

Any help would be greatly appreciated. This is probably easy, but im new to Python.
I want to add two columns which are Latitude and Longitude and put it into a column called Location.
For example:
First row in Latitude will have a value of 41.864073 and the first row of Longitude will have a value of -87.706819.
I would like the 'Locations' column to display 41.864073, -87.706819.
please and thank you.
Setup
df = pd.DataFrame(dict(lat=range(10, 20), lon=range(100, 110)))
zip
This should be better than using apply
df.assign(location=[*zip(df.lat, df.lon)])
lat lon location
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
list variant
Though I'd still suggest tuple
df.assign(location=df[['lat', 'lon']].values.tolist())
lat lon location
0 10 100 [10, 100]
1 11 101 [11, 101]
2 12 102 [12, 102]
3 13 103 [13, 103]
4 14 104 [14, 104]
5 15 105 [15, 105]
6 16 106 [16, 106]
7 17 107 [17, 107]
8 18 108 [18, 108]
9 19 109 [19, 109]
I question the usefulness of this column, but you can generate it by applying the tuple callable over the columns.
>>> df = pd.DataFrame([[1, 2], [3,4]], columns=['lon', 'lat'])
>>> df
>>>
lon lat
0 1 2
1 3 4
>>>
>>> df['Location'] = df.apply(tuple, axis=1)
>>> df
>>>
lon lat Location
0 1 2 (1, 2)
1 3 4 (3, 4)
If there are other columns than 'lon' and 'lat' in your dataframe, use
df['Location'] = df[['lon', 'lat']].apply(tuple, axis=1)
Data from Pir
df['New']=tuple(zip(*df[['lat','lon']].values.T))
df
Out[106]:
lat lon New
0 10 100 (10, 100)
1 11 101 (11, 101)
2 12 102 (12, 102)
3 13 103 (13, 103)
4 14 104 (14, 104)
5 15 105 (15, 105)
6 16 106 (16, 106)
7 17 107 (17, 107)
8 18 108 (18, 108)
9 19 109 (19, 109)
I definitely learned something from W-B and timgeb. My idea was to just convert to strings and concatenate. I posted my answer in case you wanted the result as a string. Otherwise it looks like the answers above are the way to go.
import pandas as pd
from pandas import *
Dic = {'Lattitude': [41.864073], 'Longitude': [-87.706819]}
DF = pd.DataFrame.from_dict(Dic)
DF['Location'] = DF['Lattitude'].astype(str) + ',' + DF['Longitude'].astype(str)

Grouping data into bins

I want to subset the following data frame df into bins of a size 50:
ID FREQ
0 358081 6151
1 431511 952
2 410632 350
3 398149 220
4 177791 158
5 509179 151
6 485346 99
7 536655 50
8 389180 51
9 406622 45
10 410191 112
The result should be this one:
FREQ_BIN QTY_IDs
>200 3
150-200 2
100-150 1
50-100 3
<50 1
How can I do it? Should I use groupBy or any other approach?
You could use pd.cut.
df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
right=False ensures that we take half-open intervals as your output suggests, and unlike np.digitize we need to include np.inf in the bins for "infinite endpoints".
Demo
>>> df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
FREQ
[-inf, 50) 1
[50, 100) 3
[100, 150) 1
[150, 200) 2
[200, inf) 4
dtype: int64

Categories