I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days
I have a DataFrame that looks like this (but 149110 rows instead):
df = {'group':['a','a','a','a',
'b','b','b','b','b','b','b','b','b',
'c','c','c','c','c',
'd','d','d','d','d','d','d',
'e','e','e','e',],
'date':[np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),
np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),
np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),np.datetime64('2019-03-12'),
np.datetime64('2019-03-12'),
np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),np.datetime64('2020-01-01'),
np.datetime64('2020-01-01'),
np.datetime64('2019-01-17'),np.datetime64('2019-01-17'),np.datetime64('2019-01-17'),np.datetime64('2019-01-17'),
np.datetime64('2019-01-17'),np.datetime64('2019-01-17'),np.datetime64('2019-01-17'),
np.datetime64('2018-12-03'),np.datetime64('2018-12-03'),np.datetime64('2018-12-03'),np.datetime64('2018-12-03')],
'id':['tom','taliha','alyssa','randyl',
'tom','taliha','edward','aaron','daniel','jean','sigmund','albus','riddle',
'fellicia','ron','fred','george','alex',
'taliha','alyssa','locke','jon','jamie','sam','sydney',
'jon','jamie','sam','arya'],
'value':[1,2,3,4,
7,6,4,8,2,3,5,9,1,
1,2,3,4,5,
5,7,6,3,4,1,2,
3,2,1,4]}
df= pd.DataFrame(df)
df
group date id value
0 a 2020-01-01 tom 1
1 a 2020-01-01 taliha 2
2 a 2020-01-01 alyssa 3
3 a 2020-01-01 randyl 4
4 b 2019-03-12 tom 7
5 b 2019-03-12 taliha 6
6 b 2019-03-12 edward 4
7 b 2019-03-12 aaron 8
8 b 2019-03-12 daniel 2
9 b 2019-03-12 jean 3
10 b 2019-03-12 sigmund 5
11 b 2019-03-12 albus 9
12 b 2019-03-12 riddle 1
13 c 2020-01-01 fellicia1
14 c 2020-01-01 ron 2
15 c 2020-01-01 fred 3
16 c 2020-01-01 george 4
17 c 2020-01-01 alex 5
18 d 2019-01-17 taliha 5
19 d 2019-01-17 alyssa 7
20 d 2019-01-17 locke 6
21 d 2019-01-17 jon 3
22 d 2019-01-17 jamie 4
23 d 2019-01-17 sam 1
24 d 2019-01-17 sydney 2
25 e 2018-12-03 jon 3
26 e 2018-12-03 jamie 2
27 e 2018-12-03 sam 1
28 e 2018-12-03 arya 4
I need a column: together that returns 1 if the person has been with in a group with another person in the current group but in the past year.
For example in group 'a' we have 4 people, but tom and taliha were both in a group together in np.datetime64('2016-03-12') i.e. they were both in group 'b' together. We can also see that in group 'c' taliha and alyssa were also together. So I want the corresponding value of together for group a to have a 1 next to both tom, taliha and alyssa, but 0 for randyl because he hasn't been in a group with anyone else in the past year.
Then for group 'b' and 'c' because no one has been in a group with anyone else in the past year I want the value of together to be 0 for everyone.
For group 'd' within the last year we can see that jon, jamie and sam were in the same group i.e. they were part of group 'e'. So the value of together for jon, jamie and sam in group 'd' should be a 1 and the rest of the people a 0.
And as there were no data before group 'e', they should all be assigned 0.
Then I want to create another new column: rel based on this depending on the value that the people had in the pervious group. I want rel to be equal to 1 if that person in the past group had a value lower than the other person and to be -1 if their value was higher.
For example in group 'a' the value of rel for tom should be -1 because he had a higher value than taliha in 'b' and because of this the value of rel for taliha should be 1, because she had a lower value than tom in 'b'. For alyssa I want the value of rel to be set to -1 because in group d she has a higher value than talhia.
Basically the Idea is that the lower the value the better. I am trying to rank people by their past value. So for group 'a' I need a system that basically shows that
taliha > tom and taliha > alyssa. But we don't know the relationship between tom and alyssa so I have them as the same value. I also don't know the relationship between randyl and everyone else in the group 'a' so I want the value of rel set to 0 for him.
If for example I find the relationship like this: person 1 > person 2 > person 3 and no history for person 4. I want to have rel to reflect his relationship. I want the value of rel to look a little something like this person 1 = 2, person 2 = 0, person 3 = -2 and person 4 = 0.
So I want the resulting DataFrame to look something like this:
group date id value together rel
0 a 2020-01-01 tom 1 1 -1
1 a 2020-01-01 taliha 2 1 1
2 a 2020-01-01 alyssa 3 1 -1
3 a 2020-01-01 randyl 4 0 0
4 b 2019-03-12 tom 7 0 0
5 b 2019-03-12 taliha 6 0 0
6 b 2019-03-12 edward 4 0 0
7 b 2019-03-12 aaron 8 0 0
8 b 2019-03-12 daniel 2 0 0
9 b 2019-03-12 jean 3 0 0
10 b 2019-03-12 sigmund 5 0 0
11 b 2019-03-12 albus 9 0 0
12 b 2019-03-12 riddle 1 0 0
13 c 2020-01-01 fellicia1 0 0
14 c 2020-01-01 ron 2 0 0
15 c 2020-01-01 fred 3 0 0
16 c 2020-01-01 george 4 0 0
17 c 2020-01-01 alex 5 0 0
18 d 2019-01-17 taliha 5 0 0
19 d 2019-01-17 alyssa 7 0 0
20 d 2019-01-17 locke 6 0 0
21 d 2019-01-17 jon 3 1 -2
22 d 2019-01-17 jamie 4 1 0
23 d 2019-01-17 sam 1 1 2
24 d 2019-01-17 sydney 2 0 0
25 e 2018-12-03 jon 3 0 0
26 e 2018-12-03 jamie 2 0 0
27 e 2018-12-03 sam 1 0 0
28 e 2018-12-03 arya 4 0 0
I'll give it a try. The first task seems rather easy, the second gave me a headache. And my result for the second part differs slightly from your expectation. Maybe you've made a mistake, but most likely it's due to my misunderstanding.
from itertools import combinations
df_grps = df.groupby([df.date.dt.year, 'group']).id.apply(set)
df_vals = df.set_index([df.date.dt.year, 'group', 'id']).value
results = {}
for year in sorted(df.date.dt.year.unique())[1:]:
groups = {}
for group in df_grps.loc[year].index:
ids = df_grps.loc[year, group]
together = set().union(*(
i for i in (ids & h for h in df_grps.loc[year-1]) if len(i) > 1
))
if not together:
continue
together = {i: 0 for i in together}
for i, j in combinations(together, 2):
for group_old in df_grps.loc[year-1].index:
if not {i, j} <= df_grps.at[year-1, group_old]:
continue
i_val = df_vals.at[year-1, group_old, i]
j_val = df_vals.at[year-1, group_old, j]
if i_val < j_val:
together[i] += 1
together[j] -= 1
elif i_val > j_val:
together[i] -= 1
together[j] += 1
groups[group] = together
if groups:
results[year] = groups
df_res = pd.DataFrame(
[
[year, group, i, r]
for year, groups in results.items()
for group, rel in groups.items()
for i, r in rel.items()
],
columns=['date', 'group', 'id', 'rel']
).set_index(['date', 'group', 'id'])
df.set_index([df.date.dt.year, 'group', 'id'], inplace=True)
df['together'], df['rel'] = 0, 0
df.loc[df_res.index, 'together'] = 1
df.loc[df_res.index, 'rel'] = df_res.rel
Result for your sample frame:
date value together rel
date group id
2020 a tom 2020-01-01 1 1 -1
taliha 2020-01-01 2 1 2
alyssa 2020-01-01 3 1 -1
randyl 2020-01-01 4 0 0
2019 b tom 2019-03-12 7 0 0
taliha 2019-03-12 6 0 0
edward 2019-03-12 4 0 0
aaron 2019-03-12 8 0 0
daniel 2019-03-12 2 0 0
jean 2019-03-12 3 0 0
sigmund 2019-03-12 5 0 0
albus 2019-03-12 9 0 0
riddle 2019-03-12 1 0 0
2020 c fellicia 2020-01-01 1 0 0
ron 2020-01-01 2 0 0
fred 2020-01-01 3 0 0
george 2020-01-01 4 0 0
alex 2020-01-01 5 0 0
2019 d taliha 2019-01-17 5 0 0
alyssa 2019-01-17 7 0 0
locke 2019-01-17 6 0 0
jon 2019-01-17 3 1 -2
jamie 2019-01-17 4 1 0
sam 2019-01-17 1 1 2
sydney 2019-01-17 2 0 0
2018 e jon 2018-12-03 3 0 0
jamie 2018-12-03 2 0 0
sam 2018-12-03 1 0 0
arya 2018-12-03 4 0 0
PS: I also have a version that stays a bit more within the Pandas framework, but it's longer. I'll post it if you're interested.
I am trying to drop NA values from a pandas dataframe.
I have used dropna() (which should drop all NA rows from the dataframe). Yet, it does not work.
Here is the code:
import pandas as pd
import numpy as np
prison_data = pd.read_csv('https://andrewshinsuke.me/docs/compas-scores-two-years.csv')
That's how you get the data frame. As the following shows, the default read_csv method does indeed convert the NA data points to np.nan.
np.isnan(prison_data.head()['out_custody'][4])
Out[2]: True
Conveniently, the head() of the DF already contains a NaN values (in the column out_custody), so printing prison_data.head() this, you get:
id name first last compas_screening_date sex
0 1 miguel hernandez miguel hernandez 2013-08-14 Male
1 3 kevon dixon kevon dixon 2013-01-27 Male
2 4 ed philo ed philo 2013-04-14 Male
3 5 marcu brown marcu brown 2013-01-13 Male
4 6 bouthy pierrelouis bouthy pierrelouis 2013-03-26 Male
dob age age_cat race ...
0 1947-04-18 69 Greater than 45 Other ...
1 1982-01-22 34 25 - 45 African-American ...
2 1991-05-14 24 Less than 25 African-American ...
3 1993-01-21 23 Less than 25 African-American ...
4 1973-01-22 43 25 - 45 Other ...
v_decile_score v_score_text v_screening_date in_custody out_custody
0 1 Low 2013-08-14 2014-07-07 2014-07-14
1 1 Low 2013-01-27 2013-01-26 2013-02-05
2 3 Low 2013-04-14 2013-06-16 2013-06-16
3 6 Medium 2013-01-13 NaN NaN
4 1 Low 2013-03-26 NaN NaN
priors_count.1 start end event two_year_recid
0 0 0 327 0 0
1 0 9 159 1 1
2 4 0 63 0 1
3 1 0 1174 0 0
4 2 0 1102 0 0
However, running prison_data.dropna() does not change the dataframe in any way.
prison_data.dropna()
np.isnan(prison_data.head()['out_custody'][4])
Out[3]: True
df.dropna() by default returns a new dataset without NaN values. So, you have to assign it to the variable
df = df.dropna()
if you want it to modify the df inplace, you have to explicitly specify
df.dropna(inplace= True)
it wasn't working because there was at least one nan per row
I have my data in a pandas dataframe
out[1]:
NAME STORE AMOUNT
0 GARY GAP 20
1 GARY GAP 10
2 GARY KROGER 15
3 ASHLEY FOREVER21 30
4 ASHLEY KROGER 10
5 MARK GAP 10
6 ROGER KROGER 30
I'm trying to get grouping by name, sum their total amount spent, while also generating columns for each unique store in the dataframe.
Desired:
out[1]:
NAME GAP KROGER FOREVER21
0 GARY 30 15 0
1 ASHLEY 0 10 30
2 MARK 10 0 0
3 ROGER 0 30 0
Thanks for your help!
You need pivot_table:
df1 = df.pivot_table(index='NAME',
columns='STORE',
values='AMOUNT',
aggfunc='sum',
fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Alternative solution with aggregating by groupby and sum:
df1 = df.groupby(['NAME','STORE'])['AMOUNT'].sum().unstack(fill_value=0)
print (df1)
STORE FOREVER21 GAP KROGER
NAME
ASHLEY 30 0 10
GARY 0 30 15
MARK 0 10 0
ROGER 0 0 30
Last if need column from index values and remove column and index names:
print (df1.reset_index().rename_axis(None, axis=1).rename_axis(None))
NAME FOREVER21 GAP KROGER
0 ASHLEY 30 0 10
1 GARY 0 30 15
2 MARK 0 10 0
3 ROGER 0 0 30
I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1