Pandas Groupby and generate "duplicate" columns for each groupby value - python

I have a vertical data frame that I am looking to make more horizontal by "duplicating" columns for each item in the groupby column.
I have the following data frame:
pd.DataFrame({'posteam': {0: 'ARI', 1: 'ARI', 2: 'ARI', 3: 'ARI', 4: 'ARI'},
'offense_grouping': {0: 'personnel_00',
1: 'personnel_01',
2: 'personnel_02',
3: 'personnel_10',
4: 'personnel_11'},
'snap_ct': {0: 1, 1: 6, 2: 4, 3: 396, 4: 1441},
'personnel_epa': {0: 0.1539720594882965,
1: 0.7805194854736328,
2: -0.2678736448287964,
3: 0.1886662095785141,
4: 0.005721719935536385}})
And in its current state, there are 5 duplicate values in the 'posteam' column and 5 different values in the 'offense_grouping' column. Ideally, I would like to group by 'posteam' (so the team only has one row) and by 'offense_grouping'. Each 'offense_grouping' value is corresponded with 'snap_ct' and 'personnel_epa' values. I would like the end result of this group to be something like this:
posteam
personnel_00_snap_ct
personnel_00_personnel_epa
personnel_01_snap_ct
personnel_01_personnel_epa
personnel_02_snap_ct
personnel_02_personnel_epa
ARI
1
.1539...
6
.7805...
4
-.2679
And so on. How can this be achieved?

Given the data you provide, the following would give the expected result. But there might be more complex cases in your data.
z = (
df
.set_index(['posteam', 'offense_grouping'])
.unstack('offense_grouping')
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
# or, alternatively (might be better if you have multiple values
# for some given indices./columns):
z = (
df
.pivot_table(index='posteam', columns='offense_grouping', values=['snap_ct', 'personnel_epa'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
>>> z
offense_grouping personnel_00 personnel_01 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 1 0.153972 6 0.780519
offense_grouping personnel_02 personnel_10 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 4 -0.267874 396 0.188666
offense_grouping personnel_11
snap_ct personnel_epa
posteam
ARI 1441 0.005722
Then you can join the two levels of columns:
res = z.set_axis([f'{b}_{a}' for a, b in z.columns], axis=1)
>>> res
snap_ct_personnel_00 personnel_epa_personnel_00 snap_ct_personnel_01 personnel_epa_personnel_01 snap_ct_personnel_02 personnel_epa_personnel_02 snap_ct_personnel_10 personnel_epa_personnel_10 snap_ct_personnel_11 personnel_epa_personnel_11
posteam
ARI 1 0.153972 6 0.780519 4 -0.267874 396 0.188666 1441 0.005722
​```

Related

Pandas merging value of two rows in columns of a single row

I have data like this, it's output of a groupby:
numUsers = df.groupby(["user","isvalid"]).count()
count
user isvalid
5 0.0 1336
1.0 387
But I need to have count of count_valid and count_invalid columns for each user, like this:
count_valid count_invalid
user
5 387 1336
How can I do it in optimized way in Pandas?
You can use:
out = (df.groupby(["user","isvalid"]).count()
.rename({0: 'count_invalid', 1: 'count_valid'}, level=1)
['count'].unstack()
)
Output:
isvalid count_invalid count_valid
user
5 1336 387
Or, more generic if you have multiple columns, using a MultiIndex:
out = (df.groupby(["user","isvalid"]).count()
.unstack().rename(columns={0: 'invalid', 1: 'valid'}, level=1)
)
out.columns = out.columns.map('_'.join)
Output:
count_invalid count_valid
user
5 1336 387
Or from the original dataset with a crosstab:
pd.crosstab(df['user'], df['isvalid'].map({0: 'count_invalid', 1: 'count_valid'}))
You can replace groupby_count by value_counts:
>>> (df.replace({'isvalid': {0: 'count_invalid', 1: 'count_valid'}})
.value_counts(['user', 'isvalid']).unstack('isvalid')
.rename_axis(columns=None))
count_invalid count_valid
user
5 1336 387
Another version with pivot_table:
>>> (df.replace({'isvalid': {0: 'count_invalid', 1: 'count_valid'}}).assign(count=1)
.pivot_table(index='user', columns='isvalid', values='count', aggfunc='count')
.rename_axis(columns=None))
count_invalid count_valid
user
5 1336 387

Filter dataframe with multiple conditions including OR

I wrote a little script that loops through constraints to filter a dataframe. Example and follow up explaining the issue are below.
constraints = [['stand','==','L'],['zone','<','20']]
for x in constraints:
vari = x[2]
df = df.query("{0} {1} #vari".format(x[0],x[1]))
zone
stand
speed
type
0
2
L
83.7
CH
1
7
L
95.9
SI
2
14
L
94.9
FS
3
11
L
93.3
FS
4
13
L
86.9
CH
5
7
L
96.4
SI
6
13
L
82.6
SL
I can't figure out a way to filter when there is an OR condition. For example, in the table above I'd like to return a dataframe using the constraints in the code example along with any rows that contain SI or CH in the type column. Does anyone have ideas on how to accomplish this? Any help would be greatly appreciated.
This seems to have gotten the job done but there is probably a much better way of going about it.
for x in constraints:
vari = x[2]
if isinstance(vari,list):
frame = frame[frame[x[0]].isin(vari)]
else:
frame = frame.query("{0} {1} #vari".format(x[0],x[1]))
IIUC (see my question in the comment) you can do it like this:
Made a little different df to show you the result (I guess the table you show is already filtered)
df = pd.DataFrame(
{'zone': {0: 2, 1: 11, 2: 25, 3: 11, 4: 23, 5: 7, 6: 13},
'stand': {0: 'L', 1: 'L', 2: 'L', 3: 'C', 4: 'L', 5: 'K', 6: 'L'},
'speed': {0: 83.7, 1: 95.9, 2: 94.9, 3: 93.3, 4: 86.9, 5: 96.4, 6: 82.6},
'type': {0: 'CH', 1: 'SI', 2: 'FS', 3: 'FS', 4: 'CH', 5: 'SI', 6: 'SL'}})
print(df)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
2 25 L 94.9 FS
3 11 C 93.3 FS
4 23 L 86.9 CH
5 7 K 96.4 SI
6 13 L 82.6 SL
res = df.loc[ ( (df['type']=='SI') | (df['type']=='CH') ) & ( (df['zone']<20) & (df['stand']=='L') ) ]
print(res)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
Let me know if that is what you are searching for.

Create and trip report with end latitude and logitude

Please help, I have a data set structured like below
ss={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride1',3: 'ride2',4: 'ride2',
5: 'ride2',6: 'ride2',7: 'ride3',8: 'ride3',9: 'ride3',10: 'ride3'},
'lat': {0: 5.616526,1: 5.623686, 2: 5.616555,3: 5.616556,4: 5.613834, 5: 5.612899,
6: 5.610804,7: 5.616614,8: 5.644431,9: 5.650771, 10: 5.610828},
'long': {0: -0.231901,1: -0.227248,2: -0.23192,3: -0.23168,4: -0.223812,
5: -0.22869,6: -0.226193,7: -0.231461,8: -0.237549,9: -0.271337,10: -0.226157},
'distance': {0: 0.0,1: 90.021,2: 138.0751,3: 0.0,4: 90.0041,5: 180.0293,6: 180.562, 7:0.0,8: 90.004,9: 180.0209,10: 189.0702},}
df=pd.DataFrame(ss)
the ride_id column indicates the number of trips taken in a window to make up the ride.
For example, ride1 consists of 2 trips, the first trip starts at index 0 and ends at index 1, then trip 2 starts at index 1 and ends at index 2.
I want to create a new data frame of trip reports, where each row will have the start coordinates (lat, long) and trip end coordinates(end_lat,end_long) taken from the next row and then distance. The results should look like the data frame below
sf={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride2',3: 'ride2',4: 'ride2',},
'lat': {0: 5.616526,1: 5.623686,2: 5.616556,3: 3.613834, 4: 5.612899},
'long': {0: -0.231901,1: -0.227248,2: -0.23168,3: -0.223812,4: -0.22869},
'end_lat':{0: 5.623686,1: 5.616555,2: 5.613834,3: 5.612899,4: 5.610804},
'end_long':{0: -0.227248,1: -0.23192,2: -0.223812,3: -0.22869,4: -0.226193},
'distance': {0: 90.02100,1: 138.07510,2: 90.00410,3: 180.02930,4: 180.5621},}
df_s=pd.DataFrame(sf)
df_s
OUT:
ride_id lat long end_lat end_long distance
0 ride1 5.616526 -0.231901 5.623686 -0.227248 90.0210
1 ride1 5.623686 -0.227248 5.616555 -0.231920 138.0751
2 ride2 5.616556 -0.231680 5.613834 -0.223812 90.0041
3 ride2 3.613834 -0.223812 5.612899 -0.228690 180.0293
4 ride2 5.612899 -0.228690 5.610804 -0.226193 180.5621
I tried to group the data frame by the ride_id to isolate each ride_id, but I'm stuck, any ideas are warmly welcomed.
We can do groupby with shift then dropna
df['start_lat'] = df.groupby('ride_id')['lat'].shift()
df['start_long'] = df.groupby('ride_id')['long'].shift()
df = df.dropna()
df
Out[480]:
ride_id lat long distance start_lat start_long
1 ride1 5.623686 -0.227248 90.0210 5.616526 -0.231901
2 ride1 5.616555 -0.231920 138.0751 5.623686 -0.227248
4 ride2 5.613834 -0.223812 90.0041 5.616556 -0.231680
5 ride2 5.612899 -0.228690 180.0293 5.613834 -0.223812
6 ride2 5.610804 -0.226193 180.5620 5.612899 -0.228690
8 ride3 5.644431 -0.237549 90.0040 5.616614 -0.231461
9 ride3 5.650771 -0.271337 180.0209 5.644431 -0.237549
10 ride3 5.610828 -0.226157 189.0702 5.650771 -0.271337

Average, MIN , MAX in pandas for a column with string

I have a column like this
User time Column
User1 time1 44 db
User1 time2 55 db
User1 time3 43 db
User1 time4 no_available
How to calculate average, Min, MAX by just taking 44 55 43 (without db) and ignoring values like 'no_available' and 'no_power' for each user
Bonus, also how take last value of the day if user has for exemple 10 values for 10 times.
Regards,
thank you.
If all integers, you can use str.extract() to pull out the numbers. Then, return the mean, max, etc:
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '55 db', 2: '43 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.extract('(\d+)').astype(float)
print(df['Numbers'].mean(), df['Numbers'].max())
Out [1]:
47.333333333333336 55.0
Example with -, ., or , in the number:
import pandas as pd
df = pd.DataFrame({'User': {0: 'User1', 1: 'User1', 2: 'User1', 3: 'User1'},
'time': {0: 'time1', 1: 'time2', 2: 'time3', 3: 'time4'},
'Column': {0: '44 db', 1: '-45.32 db', 2: '4,452.03 db', 3: 'no_available'}})
df['Numbers'] = df['Column'].str.replace(',','').str.extract('(-?\d+.?\d+)').astype(float)
print(df['Numbers'])
0 44.00
1 -45.32
2 4452.03
3 NaN
Name: Numbers, dtype: float64
There are the MAX and MIN for that column.
Give me like. :D
import pandas as pd
a = ["user1","user1","user1","user1"]
a2 = ["time1","time2","time3","time4"]
a3 = ['45 db','55 db','43 db','no_available']
a = pd.DataFrame(a, columns=["user"])
a2 = pd.DataFrame(a2, columns=["time"])
a3 = pd.DataFrame(a3, columns=["column"])
data = pd.concat([a,a2,a3], axis=1)
data1 = list(data["column"])
h = []
for i in data1:
try:
if int(i[0:2]):
h.append(int(i[0:2]))
except:
print(i)
max(h)
min(h)

How to do conditionals operations in columns in python pandas?

I'm trying to make a code that calculates the variation of "prod"("rgdpna"/"emp") in relation to one specific year. In an excel data, that contain data from several countries, and I need to do it for all of them.
(country, year, rgdpna and emp, are the data from excel)
Contry year rgdpna emp "prod"(rgdpna/emp) "prodvar"
Brazil 1980 100 12 8.3 (8.3/8.3) = 1
Brazil 1981 120 12 10 (10/8.3) = 1.2
Brazil 1982 140 15 9.3 (9.3/8.3) = 1.1
...
Canada 1980 300 11 27.2 (27.2/27.2) = 1
Canada 1981 327 10 32.7 (32.7/27.2) = 1.2
Canada 1982 500 12 41.6 (41.6/27.2) = 1.5
...
Something like this : "prodvar" = ("prod" when "year" >= 1980) divided by ("prod" when "year"==1980)
And i think i need to do with "while", but i don't know.
df["prod"] = df["rgdpna"].div (df["emp"])
For pandas, avoid doing for and while loops wherever possible.
Try this.
df['prod'] = df.apply(lambda x: x['prod']/df['prod'].loc[(df['year']==1980)&(df['country']==x['country'])].values[0], axis=1)
First of all, let's get your data into a complete, minimal example. For that we don't need the intermediate columns, so let's keep the relevant column only, and call it 'value' for clearness' sake:
data_dict = {'country': {0: 'Brazil',
1: 'Brazil',
2: 'Brazil',
3: 'Canada',
4: 'Canada',
5: 'Canada'},
'value': {0: 8.3, 1: 10, 2: 9.3, 3: 27.2, 4: 32.7, 5: 41.6},
'year': {0: 1980.0, 1: 1981.0, 2: 1982.0, 3: 1980.0, 4: 1981.0, 5: 1982.0}}
df = pd.DataFrame(data_dict)
(I'm also using clear column names in the rest of this answer, even if they're long)
Secondly, we will create an intermediate values column, that just holds the value when year is 1980:
df['value_1980'] = df.apply(lambda row: df.set_index(['year','country']).loc[1980]['value'][row['country']], axis=1)
Finally, we just divide the two, as in your example:
df['value_relative_to_1980'] = df['value'] / df['value_1980']
Check the result.

Categories