I have a dataset of the following form.
id year
0 A 2000
1 A 2001
2 B 2005
3 B 2006
4 B 2007
5 C 2003
6 C 2004
7 D 2002
8 D 2003
Now two or more IDs are assumed to be part of an aggregated ID if they can be arranged in a consecutive order. Meaning that in the end I would like to have this grouping, in which A & D build a group and B & C another one:
id year match
0 A 2000 1
1 A 2001 1
7 D 2002 1
8 D 2003 1
5 C 2003 2
6 C 2004 2
2 B 2005 2
3 B 2006 2
4 B 2007 2
EDIT: Addressing #Dimitris_ps comments: Assuming an additional row
id year
9 A 2002
would change the desired result to
id year match
0 A 2000 1
1 A 2001 1
9 A 2002 1
5 C 2003 1
6 C 2004 1
2 B 2005 1
3 B 2006 1
4 B 2007 1
7 D 2002 2
8 D 2003 2
because now there is no longer a consecutive order for A & D but instead for A, C, and B with D having no match.
Recode your id to values and then you can sort based on year and id.
import pandas as pd
df = pd.DataFrame({'id':['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D'],
'year':[2000, 2001, 2005, 2006, 2007, 2003, 2004, 2002, 2003]}) # example dataframe
# Create a dict mapping id to values based on the minimum year
custom_dict = {el:i for i, el in enumerate(df.groupby('id')['year'].min().sort_values().index)}
# and the reverse to map back the values to the id
custom_dict_rev = {v:k for k, v in custom_dict.items()}
df['id'] = df['id'].map(custom_dict)
df = df.sort_values(['year', 'id'])
df['id'] = df['id'].map(custom_dict_rev)
df
Related
So I have a panel df that looks like this:
ID
year
value
1
2002
8
1
2003
9
1
2004
10
2
2002
11
2
2003
11
2
2004
12
I want to set the value for every ID and for all years to the value in 2004. How do I do this?
The df should then look like this:
ID
year
value
1
2002
10
1
2003
10
1
2004
10
2
2002
12
2
2003
12
2
2004
12
Could not find anything online. So far I have tried to get the value for every ID for year 2004, created a new df from that and then merged it back in. Though, that is super slow.
We can use Series.map for this, first we select the values and create our mapping:
mapping = df[df["year"].eq(2004)].set_index("ID")["value"]
df["value"] = df["ID"].map(mapping)
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
Let's convert the value where corresponding year is not 2004 to NaN then get the max value per ID.
df['value'] = (df.assign(value=df['value'].mask(df['year'].ne(2004)))
.groupby('ID')['value'].transform('max'))
print(df)
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Another method, for some variety.
# Make everything that isn't 2004 null~
df.loc[df.year.ne(2004), 'value'] = np.nan
# Fill the values by ID~
df['value'] = df.groupby('ID')['value'].bfill()
Output:
ID year value
0 1 2002 10.0
1 1 2003 10.0
2 1 2004 10.0
3 2 2002 12.0
4 2 2003 12.0
5 2 2004 12.0
Yet another method, a bit longer but should be quite intuitive. Basically creating a lookup table for ID->value then performing lookup using pandas.merge.
import pandas as pd
# Original dataframe
df_orig = pd.DataFrame([(1, 2002, 8), (1, 2003, 9), (1, 2004, 10), (2, 2002, 11), (2, 2003, 11), (2, 2004, 12)])
df_orig.columns = ['ID', 'year', 'value']
# Dataframe with 2004 IDs
df_2004 = df_orig[df_orig['year'] == 2004]
df_2004.drop(columns=['year'], inplace=True)
print(df_2004)
# Drop values from df_orig and replace with those from df_2004
df_orig.drop(columns=['value'], inplace=True)
df_final = pd.merge(df_orig, df_2004, on='ID', how='right')
print(df_final)
df_2004:
ID value
2 1 10
5 2 12
df_final:
ID year value
0 1 2002 10
1 1 2003 10
2 1 2004 10
3 2 2002 12
4 2 2003 12
5 2 2004 12
I have a dataframe
A B Value FY
1 5 a 2020
2 6 b 2020
3 7 c 2021
4 8 d 2021
I want to create a column 'prev_FY' which looks at the 'value' column and previous year and populates in current year row in FY column;
my desired output is:
A B Value FY prev_FY
1 5 a 2020
2 6 b 2020
3 7 c 2021 a
4 8 d 2021 b
I tried using pivottable but it does not work as the values remain the same as corresponding to the FY. SHIFT function is not feasible as I have millions of rows.
Use:
df['g'] = df.groupby('FY').cumcount()
df2 = df[['FY','Value','g']].assign(FY = df['FY'].add(1))
df = df.merge(df2, on=['FY','g'], how='left', suffixes=('','_prev')).drop('g', axis=1)
print (df)
A B Value FY Value_prev
0 1 5 a 2020 NaN
1 2 6 b 2020 NaN
2 3 7 c 2021 a
3 4 8 d 2021 b
I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
I have a dataframe like this
import pandas as pd
year = [2005, 2006, 2007]
A = [4, 5, 7]
B = [3, 3, 9]
C = [1, 7, 6]
df_old = pd.DataFrame({'year' : year, 'A' : A, 'B' : B, 'C' : C})
Out[25]:
A B C year
0 4 3 1 2005
1 5 3 7 2006
2 7 9 6 2007
I want to transform this to a new dataframe where the column headers ´A´, ´B´ and ´C´ are in the rows. I have this hack, which sorta does the job
df_new = pd.DataFrame({'year' : list(df_old['year']) + list(df_old['year'])\
+ list(df_old['year']),
'col' : ['A']*len(df_old['A']) + ['B']*len(df_old['B'])\
+ ['C']*len(df_old['C']),
'val' : list(df_old['A']) + list(df_old['B'])\
+ list(df_old['C'])})
Out[27]:
col val year
0 A 4 2005
1 A 5 2006
2 A 7 2007
3 B 3 2005
4 B 3 2006
5 B 9 2007
6 C 1 2005
7 C 7 2006
8 C 6 2007
Is there a better, more compressed way to do this? Needless to say, this becomes cumbersome when there are a lot of columns.
Use melt:
print (df_old.melt('year', value_name='val', var_name='col'))
year col val
0 2005 A 4
1 2006 A 5
2 2007 A 7
3 2005 B 3
4 2006 B 3
5 2007 B 9
6 2005 C 1
7 2006 C 7
8 2007 C 6
and for reorder columns reindex:
df=df_old.melt('year',value_name='val', var_name='col').reindex(columns=['col','val','year'])
print (df)
col val year
0 A 4 2005
1 A 5 2006
2 A 7 2007
3 B 3 2005
4 B 3 2006
5 B 9 2007
6 C 1 2005
7 C 7 2006
8 C 6 2007
I have the following DataFrame:
Date best a b c d
1990 a 5 4 7 2
1991 c 10 1 2 0
1992 d 2 1 4 12
1993 a 5 8 11 6
I would like to make a dataframe as follows:
Date best value
1990 a 5
1991 c 2
1992 d 12
1993 a 5
So I am looking to find a value based on another row value by using column names. For instance, the value for 1990 in the second df should lookup "a" from the first df and the second row should lookup "c" (=2) from the first df.
Any ideas?
There is a built in lookup function that can handle this type of situation (looks up by row/column). I don't know how optimized it is, but may be faster than the apply solution.
In [9]: df['value'] = df.lookup(df.index, df['best'])
In [10]: df
Out[10]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You create a lookup function and call apply on your dataframe row-wise, this isn't very efficient for large dfs though
In [245]:
def lookup(x):
return x[x.best]
df['value'] = df.apply(lambda row: lookup(row), axis=1)
df
Out[245]:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
You can do this using np.where like below. I think it will be more efficient
import numpy as np
import pandas as pd
df = pd.DataFrame([['1990', 'a', 5, 4, 7, 2], ['1991', 'c', 10, 1, 2, 0], ['1992', 'd', 2, 1, 4, 12], ['1993', 'a', 5, 8, 11, 6]], columns=('Date', 'best', 'a', 'b', 'c', 'd'))
arr = df.best.values
cols = df.columns[2:]
for col in cols:
arr2 = df[col].values
arr = np.where(arr==col, arr2, arr)
df.drop(columns=cols, inplace=True)
df["values"] = arr
df
Result
Date best values
0 1990 a 5
1 1991 c 2
2 1992 d 12
3 1993 a 5
lookup is deprecated since version 1.2.0. With melt you can 'unpivot' columns to the row axis, where the column names are stored per default in column variable and their values in value. query returns only such rows where the columns best and variable are equal. drop and sort_values are used to match your requested format.
df_new = (
df.melt(id_vars=['Date', 'best'], value_vars=['a', 'b', 'c', 'd'])
.query('best == variable')
.drop('variable', axis=1)
.sort_values('Date')
)
Output:
Date best value
0 1990 a 5
9 1991 c 2
14 1992 d 12
3 1993 a 5
A simple solution that uses a mapper dictionary:
vals = df[['a','b','c','d']].to_dict('list')
mapper = {k: vals[v][k] for k,v in zip(df.index, df['best'])}
df['value'] = df.index.map(mapper).to_numpy()
Output:
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5
Use looking up values by index column labels because DataFrame.lookup is deprecated since version 1.2.0:
idx, cols = pd.factorize(df['best'])
df['value'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
print (df)
Date best a b c d value
0 1990 a 5 4 7 2 5
1 1991 c 10 1 2 0 2
2 1992 d 2 1 4 12 12
3 1993 a 5 8 11 6 5