How to do conditionals operations in columns in python pandas? - python

I'm trying to make a code that calculates the variation of "prod"("rgdpna"/"emp") in relation to one specific year. In an excel data, that contain data from several countries, and I need to do it for all of them.
(country, year, rgdpna and emp, are the data from excel)
Contry year rgdpna emp "prod"(rgdpna/emp) "prodvar"
Brazil 1980 100 12 8.3 (8.3/8.3) = 1
Brazil 1981 120 12 10 (10/8.3) = 1.2
Brazil 1982 140 15 9.3 (9.3/8.3) = 1.1
...
Canada 1980 300 11 27.2 (27.2/27.2) = 1
Canada 1981 327 10 32.7 (32.7/27.2) = 1.2
Canada 1982 500 12 41.6 (41.6/27.2) = 1.5
...
Something like this : "prodvar" = ("prod" when "year" >= 1980) divided by ("prod" when "year"==1980)
And i think i need to do with "while", but i don't know.
df["prod"] = df["rgdpna"].div (df["emp"])

For pandas, avoid doing for and while loops wherever possible.
Try this.
df['prod'] = df.apply(lambda x: x['prod']/df['prod'].loc[(df['year']==1980)&(df['country']==x['country'])].values[0], axis=1)

First of all, let's get your data into a complete, minimal example. For that we don't need the intermediate columns, so let's keep the relevant column only, and call it 'value' for clearness' sake:
data_dict = {'country': {0: 'Brazil',
1: 'Brazil',
2: 'Brazil',
3: 'Canada',
4: 'Canada',
5: 'Canada'},
'value': {0: 8.3, 1: 10, 2: 9.3, 3: 27.2, 4: 32.7, 5: 41.6},
'year': {0: 1980.0, 1: 1981.0, 2: 1982.0, 3: 1980.0, 4: 1981.0, 5: 1982.0}}
df = pd.DataFrame(data_dict)
(I'm also using clear column names in the rest of this answer, even if they're long)
Secondly, we will create an intermediate values column, that just holds the value when year is 1980:
df['value_1980'] = df.apply(lambda row: df.set_index(['year','country']).loc[1980]['value'][row['country']], axis=1)
Finally, we just divide the two, as in your example:
df['value_relative_to_1980'] = df['value'] / df['value_1980']
Check the result.

Related

Pandas Groupby and generate "duplicate" columns for each groupby value

I have a vertical data frame that I am looking to make more horizontal by "duplicating" columns for each item in the groupby column.
I have the following data frame:
pd.DataFrame({'posteam': {0: 'ARI', 1: 'ARI', 2: 'ARI', 3: 'ARI', 4: 'ARI'},
'offense_grouping': {0: 'personnel_00',
1: 'personnel_01',
2: 'personnel_02',
3: 'personnel_10',
4: 'personnel_11'},
'snap_ct': {0: 1, 1: 6, 2: 4, 3: 396, 4: 1441},
'personnel_epa': {0: 0.1539720594882965,
1: 0.7805194854736328,
2: -0.2678736448287964,
3: 0.1886662095785141,
4: 0.005721719935536385}})
And in its current state, there are 5 duplicate values in the 'posteam' column and 5 different values in the 'offense_grouping' column. Ideally, I would like to group by 'posteam' (so the team only has one row) and by 'offense_grouping'. Each 'offense_grouping' value is corresponded with 'snap_ct' and 'personnel_epa' values. I would like the end result of this group to be something like this:
posteam
personnel_00_snap_ct
personnel_00_personnel_epa
personnel_01_snap_ct
personnel_01_personnel_epa
personnel_02_snap_ct
personnel_02_personnel_epa
ARI
1
.1539...
6
.7805...
4
-.2679
And so on. How can this be achieved?
Given the data you provide, the following would give the expected result. But there might be more complex cases in your data.
z = (
df
.set_index(['posteam', 'offense_grouping'])
.unstack('offense_grouping')
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
# or, alternatively (might be better if you have multiple values
# for some given indices./columns):
z = (
df
.pivot_table(index='posteam', columns='offense_grouping', values=['snap_ct', 'personnel_epa'])
.swaplevel(axis=1)
.sort_index(axis=1, ascending=[True, False])
)
>>> z
offense_grouping personnel_00 personnel_01 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 1 0.153972 6 0.780519
offense_grouping personnel_02 personnel_10 \
snap_ct personnel_epa snap_ct personnel_epa
posteam
ARI 4 -0.267874 396 0.188666
offense_grouping personnel_11
snap_ct personnel_epa
posteam
ARI 1441 0.005722
Then you can join the two levels of columns:
res = z.set_axis([f'{b}_{a}' for a, b in z.columns], axis=1)
>>> res
snap_ct_personnel_00 personnel_epa_personnel_00 snap_ct_personnel_01 personnel_epa_personnel_01 snap_ct_personnel_02 personnel_epa_personnel_02 snap_ct_personnel_10 personnel_epa_personnel_10 snap_ct_personnel_11 personnel_epa_personnel_11
posteam
ARI 1 0.153972 6 0.780519 4 -0.267874 396 0.188666 1441 0.005722
​```

How can I iterate through each row of a pandas dataframe, then conditionally set a new value in that row?

I am working on a school project, so please no exact answers.
I have a pandas dataframe that has numerators and denominators rating images of dogs out of 10. When there are multiple dogs in the image, the rating is out of number of dogs * 10. I am trying to adjust it so that for example... if there are 5 dogs, and the rating is 40/50, then the new numerator/denominator is 8/10.
Here is an example of my code. I am aware that the syntax does not work in line 3, but I believe it accurately represents what I am trying to accomplish. twitter_archive is the dataframe.
twitter_archive['new_denom'] = 10
twitter_archive['new_numer'] = 0
for numer, denom in twitter_archive['rating_numerator','rating_denominator']:
if (denom > 10) & (denom % 10 == 0):
num_denom = denom / 10
new_numer = numer / num_denom
twitter_archive['new_numer'] = new_numer
So basically I am checking the denominator if it is above 10, and if it is, is it divisible by 10? if it is, then find out how many times 10 goes into it, and then divide the numerator by that value to get an new numerator. I think my logic for that works fine, but the issue I have is grabbing that row, and then adding that new value to the new column I created, in that row.
edit: added df head
tweet_id
timestamp
text
rating_numerator
rating_denominator
name
doggo
floofer
pupper
puppo
avg_numerator
avg_denom
avg_numer
0
8.924206e+17
2017-08-01 16:23:56+00:00
This is Phineas. He's a mystical boy. Only eve...
13.0
10.0
phineas
None
None
None
None
0.0
10
0
1
8.921774e+17
2017-08-01 00:17:27+00:00
This is Tilly. She's just checking pup on you....
13.0
10.0
tilly
None
None
None
None
0.0
10
0
2
8.918152e+17
2017-07-31 00:18:03+00:00
This is Archie. He is a rare Norwegian Pouncin...
12.0
10.0
archie
None
None
None
None
0.0
10
0
3
8.916896e+17
2017-07-30 15:58:51+00:00
This is Darla. She commenced a snooze mid meal...
13.0
10.0
darla
None
None
None
None
0.0
10
0
4
8.913276e+17
2017-07-29 16:00:24+00:00
This is Franklin. He would like you to stop ca...
12.0
10.0
franklin
None
None
None
None
0.0
10
0
copy/paste head below:
{'tweet_id': {0: 8.924206435553362e+17,
1: 8.921774213063434e+17,
2: 8.918151813780849e+17,
3: 8.916895572798587e+17,
4: 8.913275589266883e+17},
'timestamp': {0: Timestamp('2017-08-01 16:23:56+0000', tz='UTC'),
1: Timestamp('2017-08-01 00:17:27+0000', tz='UTC'),
2: Timestamp('2017-07-31 00:18:03+0000', tz='UTC'),
3: Timestamp('2017-07-30 15:58:51+0000', tz='UTC'),
4: Timestamp('2017-07-29 16:00:24+0000', tz='UTC')},
'text': {0: "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 ",
1: "This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 ",
2: 'This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 ',
3: 'This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us ',
4: 'This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek '},
'rating_numerator': {0: 13.0, 1: 13.0, 2: 12.0, 3: 13.0, 4: 12.0},
'rating_denominator': {0: 10.0, 1: 10.0, 2: 10.0, 3: 10.0, 4: 10.0},
'name': {0: 'phineas', 1: 'tilly', 2: 'archie', 3: 'darla', 4: 'franklin'},
'doggo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'floofer': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'pupper': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'},
'puppo': {0: 'None', 1: 'None', 2: 'None', 3: 'None', 4: 'None'}}
If you want to use for loop to get row values, you can use iterrows() function.
for idx, row in twitter_archive.iterrows():
denom = row['rating_denominator']
numer = row['rating_numerator']
# You can add values in list and concat it with df
Faster way to iterate on df is itertuples():
for row in twitter_archive.itertuples():
denom = row[1]
numer = row[2]
But I think best way to create new col from old ones is to use pandas apply function .
df = pd.DataFrame(data={'a' : [1,2], 'b': [3,5]})
df['c'] = df.apply(lambda x: 'sum_is_odd' if (x['a'] + x['b']) % 2 == 1 else 'sum_is_even', axis=1)
In this case, 'c' is a new column and value is calculated using 'a' and 'b' columns.

Filter dataframe with multiple conditions including OR

I wrote a little script that loops through constraints to filter a dataframe. Example and follow up explaining the issue are below.
constraints = [['stand','==','L'],['zone','<','20']]
for x in constraints:
vari = x[2]
df = df.query("{0} {1} #vari".format(x[0],x[1]))
zone
stand
speed
type
0
2
L
83.7
CH
1
7
L
95.9
SI
2
14
L
94.9
FS
3
11
L
93.3
FS
4
13
L
86.9
CH
5
7
L
96.4
SI
6
13
L
82.6
SL
I can't figure out a way to filter when there is an OR condition. For example, in the table above I'd like to return a dataframe using the constraints in the code example along with any rows that contain SI or CH in the type column. Does anyone have ideas on how to accomplish this? Any help would be greatly appreciated.
This seems to have gotten the job done but there is probably a much better way of going about it.
for x in constraints:
vari = x[2]
if isinstance(vari,list):
frame = frame[frame[x[0]].isin(vari)]
else:
frame = frame.query("{0} {1} #vari".format(x[0],x[1]))
IIUC (see my question in the comment) you can do it like this:
Made a little different df to show you the result (I guess the table you show is already filtered)
df = pd.DataFrame(
{'zone': {0: 2, 1: 11, 2: 25, 3: 11, 4: 23, 5: 7, 6: 13},
'stand': {0: 'L', 1: 'L', 2: 'L', 3: 'C', 4: 'L', 5: 'K', 6: 'L'},
'speed': {0: 83.7, 1: 95.9, 2: 94.9, 3: 93.3, 4: 86.9, 5: 96.4, 6: 82.6},
'type': {0: 'CH', 1: 'SI', 2: 'FS', 3: 'FS', 4: 'CH', 5: 'SI', 6: 'SL'}})
print(df)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
2 25 L 94.9 FS
3 11 C 93.3 FS
4 23 L 86.9 CH
5 7 K 96.4 SI
6 13 L 82.6 SL
res = df.loc[ ( (df['type']=='SI') | (df['type']=='CH') ) & ( (df['zone']<20) & (df['stand']=='L') ) ]
print(res)
zone stand speed type
0 2 L 83.7 CH
1 11 L 95.9 SI
Let me know if that is what you are searching for.

Create and trip report with end latitude and logitude

Please help, I have a data set structured like below
ss={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride1',3: 'ride2',4: 'ride2',
5: 'ride2',6: 'ride2',7: 'ride3',8: 'ride3',9: 'ride3',10: 'ride3'},
'lat': {0: 5.616526,1: 5.623686, 2: 5.616555,3: 5.616556,4: 5.613834, 5: 5.612899,
6: 5.610804,7: 5.616614,8: 5.644431,9: 5.650771, 10: 5.610828},
'long': {0: -0.231901,1: -0.227248,2: -0.23192,3: -0.23168,4: -0.223812,
5: -0.22869,6: -0.226193,7: -0.231461,8: -0.237549,9: -0.271337,10: -0.226157},
'distance': {0: 0.0,1: 90.021,2: 138.0751,3: 0.0,4: 90.0041,5: 180.0293,6: 180.562, 7:0.0,8: 90.004,9: 180.0209,10: 189.0702},}
df=pd.DataFrame(ss)
the ride_id column indicates the number of trips taken in a window to make up the ride.
For example, ride1 consists of 2 trips, the first trip starts at index 0 and ends at index 1, then trip 2 starts at index 1 and ends at index 2.
I want to create a new data frame of trip reports, where each row will have the start coordinates (lat, long) and trip end coordinates(end_lat,end_long) taken from the next row and then distance. The results should look like the data frame below
sf={'ride_id': {0: 'ride1',1: 'ride1',2: 'ride2',3: 'ride2',4: 'ride2',},
'lat': {0: 5.616526,1: 5.623686,2: 5.616556,3: 3.613834, 4: 5.612899},
'long': {0: -0.231901,1: -0.227248,2: -0.23168,3: -0.223812,4: -0.22869},
'end_lat':{0: 5.623686,1: 5.616555,2: 5.613834,3: 5.612899,4: 5.610804},
'end_long':{0: -0.227248,1: -0.23192,2: -0.223812,3: -0.22869,4: -0.226193},
'distance': {0: 90.02100,1: 138.07510,2: 90.00410,3: 180.02930,4: 180.5621},}
df_s=pd.DataFrame(sf)
df_s
OUT:
ride_id lat long end_lat end_long distance
0 ride1 5.616526 -0.231901 5.623686 -0.227248 90.0210
1 ride1 5.623686 -0.227248 5.616555 -0.231920 138.0751
2 ride2 5.616556 -0.231680 5.613834 -0.223812 90.0041
3 ride2 3.613834 -0.223812 5.612899 -0.228690 180.0293
4 ride2 5.612899 -0.228690 5.610804 -0.226193 180.5621
I tried to group the data frame by the ride_id to isolate each ride_id, but I'm stuck, any ideas are warmly welcomed.
We can do groupby with shift then dropna
df['start_lat'] = df.groupby('ride_id')['lat'].shift()
df['start_long'] = df.groupby('ride_id')['long'].shift()
df = df.dropna()
df
Out[480]:
ride_id lat long distance start_lat start_long
1 ride1 5.623686 -0.227248 90.0210 5.616526 -0.231901
2 ride1 5.616555 -0.231920 138.0751 5.623686 -0.227248
4 ride2 5.613834 -0.223812 90.0041 5.616556 -0.231680
5 ride2 5.612899 -0.228690 180.0293 5.613834 -0.223812
6 ride2 5.610804 -0.226193 180.5620 5.612899 -0.228690
8 ride3 5.644431 -0.237549 90.0040 5.616614 -0.231461
9 ride3 5.650771 -0.271337 180.0209 5.644431 -0.237549
10 ride3 5.610828 -0.226157 189.0702 5.650771 -0.271337

New index level name after DataFrame.stack()

(Note that this SO question is similar-looking but different.)
I have a MultiIndexed DataFrame with columns representing yearly data:
>>> x = pd.DataFrame({
'country': {0: 4.0, 1: 8.0, 2: 12.0},
'series': {0: 553.0, 1: 553.0, 2: 553.0},
'2000': {0: '1100', 1: '28', 2: '120'},
'2005': {0: '730', 1: '24', 2: '100'}
}).set_index(['country', 'series'])
>>> x
2000 2005
country series
4 553 1100 730
8 553 28 24
12 553 120 100
When I stack the years, the new index level has no name:
>>> x.stack()
country series
4 553 2000 1100
2005 730
8 553 2000 28
2005 24
12 553 2000 120
2005 100
dtype: object
Is there a nice way to tell stack I'd like the new level to be called 'year'? It doesn't mention this in the docs.
I can always do:
>>> x.columns.name = 'year'
>>> x.stack()
But, to my mind, this doesn't qualify as very 'nice'. Can anyone do it in one line?
There is a chaining-friendly way to do it in one line (although admittedly not much nicer) using DataFrame.rename_axis:
x.rename_axis('year', axis=1).stack()

Categories