I have a DataFrame with a column named 'ID' that has duplicate observations. Each 'ID' row has one or more 'Article' values columns. I want to transpose the whole dataframe grouping by 'ID' adding new columns at the same row of a unique 'ID'. Order is important
My DataFrame:
ID Article Article_2
1 Banana NaN
2 Apple NaN
1 Apple Coconut
3 Tomatoe Coconut
1 Pineapple Tropical
2 Banana Coconut
4 Apple Coconut
5 Apple Coconut
3 Apple Pineapple
My code (by #Erfan):
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
Output:
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
0001 Banana Apple Pineapple NaN Coconut Tropical
0002 Apple Banana NaN Coconut NaN NaN
0003 Tomatoe Apple Coconut Pineapple NaN NaN
0004 Apple Coconut NaN NaN NaN NaN
0005 Apple Coconut NaN NaN NaN NaN
At first row, Article_4 is NaN and Article_5 and 6 have values. It should be Article_4 Coconut, Article_5 Tropical and Article_6 NaN.
At the second same, Article_3 is NaN and Article_4 is a valid value. It should be Article_3 valid and rest (4,5,6) NaNs
Needed output:
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
0001 Banana Apple Pineapple Coconut Tropical NaN
0002 Apple Banana Coconut NaN NaN NaN
0003 Tomatoe Apple Coconut Pineapple NaN NaN
0004 Apple Coconut NaN NaN NaN NaN
0005 Apple Coconut NaN NaN NaN NaN
Add DataFrame.dropna after melt for remove missing rows by value column:
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2']).dropna(subset=['value'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
print (dfn)
Article_1 Article_2 Article_3 Article_4 Article_5
1 Banana Apple Pineapple Coconut Tropical
2 Apple Banana Coconut NaN NaN
3 Tomatoe Apple Coconut Pineapple NaN
4 Apple Coconut NaN NaN NaN
5 Apple Coconut NaN NaN NaN
If need all columns use a bit modified justify function:
dfn = pd.melt(df1, id_vars='ID', value_vars=['Article', 'Article_2'])
dfn = dfn.pivot_table(index='ID',
columns=dfn.groupby('ID')['value'].cumcount().add(1),
values='value',
aggfunc='first').add_prefix('Article_').rename_axis(None, axis='index')
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = pd.notna(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val, dtype=object)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
dfn = pd.DataFrame(justify(dfn.values, invalid_val=np.nan, axis=1, side='left'),
index=dfn.index, columns=dfn.columns)
print (dfn)
Article_1 Article_2 Article_3 Article_4 Article_5 Article_6
1 Banana Apple Pineapple Coconut Tropical NaN
2 Apple Banana Coconut NaN NaN NaN
3 Tomatoe Apple Coconut Pineapple NaN NaN
4 Apple Coconut NaN NaN NaN NaN
5 Apple Coconut NaN NaN NaN NaN
Related
In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.
rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
{'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
{'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
id a_ts a_energy a_money b_ts b_color a_color b_money
0 1 2020-10-02 6 2.0 2020-10-02 blue NaN NaN
1 2 2020-02-02 2 5.0 2012-08-11 blue orange 10.0
2 3 2011-02-02 4 NaN NaN NaN NaN NaN
I want my output to look something like this.
energy money color
id ts
1 2020-10-02 6.0 2.0 blue
2 2020-02-02 2.0 5.0 orange
2012-08-11 NaN 10.0 blue
3 2011-02-02 4.0 NaN NaN
The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.
I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).
df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)
Use:
df = df.set_index(['id'])
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
id color energy money ts
0 1 NaN 6.0 2.0 2020-10-02
1 1 blue NaN NaN 2020-10-02
2 2 orange 2.0 5.0 2020-02-02
3 2 blue NaN 10.0 2012-08-11
4 3 NaN 4.0 NaN 2011-02-02
And then shift values per groups with removed only NaNs rows by custom lambda functions:
f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
color energy money
id ts
1 2020-10-02 blue 6.0 2.0
2 2012-08-11 blue NaN 10.0
2020-02-02 orange 2.0 5.0
3 2011-02-02 NaN 4.0 NaN
I tried to ask this question in a different format, and I got answers that addressed a specific part of the question but not the whole thing. In an effort not to confuse things I'm trying again and phrasing the question differently.
I have a dataframe where several columns have regular data but one column has, as elements, lists of dictionaries. Here's an example.
list_of_dicts = [{'a':'sam','b':2},{'a':'diana','c':'grape', 'd':5},{'a':'jody','c':7,'e':'foo','f':9}]
list_of_dicts_2 = [{'a':'joe','b':2},{'a':'steve','c':'pizza'},{'a':'alex','c':7,'e':'doh'}]
df4.loc[0,'lists_of_stuff'] = list_of_dicts
df4.loc[1,'lists_of_stuff'] = list_of_dicts_2
df4.loc[0,'other1'] = 'Susie'
df4.loc[1,'other1'] = 'Rachel'
df4.loc[0,'other2'] = 123
df4.loc[1,'other2'] = 456
df4
other1 lists_of_stuff other2
0 Susie [{'a':'sam','b':2},{'a':'diana','c':'grape', 'd':5},{'a':'jody','c':7,'e':'foo','f':9}] 123
1 Rachel [{'a':'joe','b':2},{'a':'steve','c':'pizza'},{'a':alex,'c':7,'e':'doh'}] 456
I'm trying to split up those dictionaries into columns so that I have a simpler dataframe. Something like this (column order might be different)
other1 a_1 b a_2 c d a_3 c_2 e f other2
0 Susie sam 2 diana grape 5 jody 7 foo 9 123
1 Rachel joe 2 steve pizza NaN alex 7 doh NaN 456
or like this
other1 a b c d e f other2
0 Susie sam 2 NaN NaN NaN NaN 123
1 Susie diana NaN 4 5 NaN NaN 123
2 Susie jody NaN 7 NaN foo 9 123
3 Rachel joe 2 NaN NaN NaN NaN 456
4 Rachel steve NaN pizza NaN NaN NaN 456
5 Rachel alex NaN 7 NaN doh NaN 456
Two thing that don't work are pd.DataFrame(df4['list_of_stuff']) (which just shows the dataframe as it is; i.e. it doesn't change anything) and pd.json_normalize(df4['list_of_stuff']) (which throws an error). Additionally, flatten_json and solutions involving Series have not yielded workable results.
What's the right pythonic way to turn df4 into one of the proposed outputs?
(Yes I asked nearly the same question elsewhere. List of variable size dicts to a dataframe. That question was unclear, so I decided to try again with a new question rather than adding a bunch of stuff to the other one and making it difficult to understand.)
Try:
# if the lists_of_stuff are strings, apply literal_eval
#from ast import literal_eval
#df["lists_of_stuff"] = df["lists_of_stuff"].apply(literal_eval)
df = df.explode("lists_of_stuff")
df = pd.concat([df, df.pop("lists_of_stuff").apply(pd.Series)], axis=1)
print(df)
Prints:
other1 other2 a b c d e f
0 Susie 123 sam 2.0 NaN NaN NaN NaN
0 Susie 123 diana NaN grape 5.0 NaN NaN
0 Susie 123 jody NaN 7 NaN foo 9.0
1 Rachel 456 joe 2.0 NaN NaN NaN NaN
1 Rachel 456 steve NaN pizza NaN NaN NaN
1 Rachel 456 alex NaN 7 NaN doh NaN
EDIT: To reindex columns:
#... code as above
df = df.reset_index(drop=True).reindex(
[*df.columns[:1]] + [*df.columns[2:]] + [*df.columns[1:2]], axis=1
)
print(df)
Prints:
other1 a b c d e f other2
0 Susie sam 2.0 NaN NaN NaN NaN 123
1 Susie diana NaN grape 5.0 NaN NaN 123
2 Susie jody NaN 7 NaN foo 9.0 123
3 Rachel joe 2.0 NaN NaN NaN NaN 456
4 Rachel steve NaN pizza NaN NaN NaN 456
5 Rachel alex NaN 7 NaN doh NaN 456
I have an empty column that is dependent on other 4 columns in the same df. Each row only contain the same string or NaN so I want to grab the first string that pops up in the columns.
I want to iterate through the 4 columns and if one of them contain a value that is not NaN, I want to print it in empty, if they're all NaN then I want empty to be NaN.
empty
1
2
3
4
NaN
NaN
apple
NaN
duck
NaN
duck
NaN
NaN
NaN
NaN
NaN
This is my desired outcome.
empty
1
2
3
4
apple
NaN
NaN
apple
NaN
duck
duck
NaN
duck
NaN
NaN
NaN
NaN
NaN
NaN
Try .bfill(axis=1):
df["Empty"] = df.loc[:, "1":].bfill(axis=1)["1"]
print(df)
Prints:
Empty 1 2 3 4
0 apple NaN NaN apple NaN
1 duck duck NaN duck NaN
2 NaN NaN NaN NaN NaN
I'm trying to match this final output, calculating the moving average (3) of Count,
Expected Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 5.0
5 Fruits Apple 6.0 6.5
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0
But the python .rolling code does not take into account of the inf values, is there any work around on this?
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.rolling(3,3).mean())
Current Output
Classification Name Count MA3
0 Fruits Apple inf NaN
1 Fruits Apple inf NaN
2 Fruits Apple inf NaN
3 Fruits Apple inf NaN
4 Fruits Apple 5.0 NaN
5 Fruits Apple 6.0 NaN
6 Fruits Apple 7.0 6.0
7 Fruits Apple 8.0 7.0
8 Veg Broc 10.0 NaN
9 Veg Broc 11.0 NaN
10 Veg Broc 12.0 11.0
Create a series S that contains the calculations replacing inf by nan and set min_periods=1. Then, creates a mask for the rows that need to be modified, that is, the ones that are one or two positions after an inf
df['MA3'] = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=3).mean())
S = df.groupby(['Classification', 'Name'])['Count'].transform(lambda x: x.replace(np.inf, np.nan).rolling(3, min_periods=1).mean())
mask = df['Count'].lt(np.inf) & df['MA3'].isnull() & (df['Count'].shift(1).eq(np.inf) | df['Count'].shift(2).eq(np.inf))
df.loc[mask, 'MA3'] = S.loc[mask]
Why does this function return None if missing? I only want NaN when there is a missing value.
def func(row):
if (pd.notna(row['year'])):
if (pd.notna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit']
elif (pd.isna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit']
else:
return np.NaN
df['new_col'] = df.apply(func, axis=1)
df fruit year price vol signifiance
0 apple 2010 1 5 NaN
1 apple 2011 2 4 NaN
2 apple 2012 3 3 NaN
3 NaN 2013 3 3 NaN
4 NaN NaN NaN 3 NaN
5 apple 2015 3 3 important
Actual Output:
df fruit year price vol signifiance new_col
0 apple 2010 1 5 NaN None
1 apple 2011 2 4 NaN None
2 apple 2012 3 3 NaN None
3 NaN 2013 3 3 NaN None
4 NaN NaN NaN 3 NaN None
5 apple 2015 3 3 important apple
Expected Output:
df fruit year price vol signifiance new_col
0 apple 2010 1 5 NaN NaN
1 apple 2011 2 4 NaN NaN
2 apple 2012 3 3 NaN NaN
3 NaN 2013 3 3 NaN NaN
4 NaN NaN NaN 3 NaN NaN
5 apple 2015 3 3 important apple
Change to
def func(row):
if (pd.notna(row['year'])):
if (pd.notna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit']
elif (pd.isna(row['fruit'])) & (pd.notna(row['signifiance'])):
return row['fruit']
else:
return np.NaN
else:
return np.NaN
df.apply(func,axis=1)
Out[178]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 apple
dtype: object