In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.
rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
{'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
{'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
id a_ts a_energy a_money b_ts b_color a_color b_money
0 1 2020-10-02 6 2.0 2020-10-02 blue NaN NaN
1 2 2020-02-02 2 5.0 2012-08-11 blue orange 10.0
2 3 2011-02-02 4 NaN NaN NaN NaN NaN
I want my output to look something like this.
energy money color
id ts
1 2020-10-02 6.0 2.0 blue
2 2020-02-02 2.0 5.0 orange
2012-08-11 NaN 10.0 blue
3 2011-02-02 4.0 NaN NaN
The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.
I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).
df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)
Use:
df = df.set_index(['id'])
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
id color energy money ts
0 1 NaN 6.0 2.0 2020-10-02
1 1 blue NaN NaN 2020-10-02
2 2 orange 2.0 5.0 2020-02-02
3 2 blue NaN 10.0 2012-08-11
4 3 NaN 4.0 NaN 2011-02-02
And then shift values per groups with removed only NaNs rows by custom lambda functions:
f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
color energy money
id ts
1 2020-10-02 blue 6.0 2.0
2 2012-08-11 blue NaN 10.0
2020-02-02 orange 2.0 5.0
3 2011-02-02 NaN 4.0 NaN
Related
I have a pandas dataframe that effectively contains several different datasets. Between each dataset is a row full of NaN. Can I split the dataframe on the NaN row to make two dataframes? Thanks in advance.
You can use this to split into many data frames based on all NaN rows:
#index of all NaN rows (+ beginning and end of df)
idx = [0] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
#list of data frames split at all NaN indices
list_of_dfs = [df.iloc[idx[n]:idx[n+1]] for n in range(len(idx)-1)]
And if you want to exclude the NaN rows from split data frames:
idx = [-1] + df.index[df.isnull().all(1)].tolist() + [df.shape[0]]
list_of_dfs = [df.iloc[idx[n]+1:idx[n+1]] for n in range(len(idx)-1)]
Example:
df:
0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN
3 NaN NaN
4 NaN NaN
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN
list_of_dfs:
[ 0 1
0 1.0 1.0
1 NaN 1.0
2 1.0 NaN,
Empty DataFrame
Columns: [0, 1]
Index: [],
0 1
5 1.0 1.0
6 1.0 1.0
7 NaN 1.0
8 1.0 NaN
9 1.0 NaN]
Use df[df[COLUMN_NAME].isnull()].index.tolist() to get a list of indices corresponding to the NaN rows. You can then split the dataframe into multiple dataframes by using the indices.
My solution allows to split your DataFrame into any number of chunks,
on each row full of NaNs.
Assume that the input DataFrame contains:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
3 NaN NaN NaN
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
7 NaN NaN NaN
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
so that "split points" are rows with indices 3 and 7.
To do your task:
Generate the grouping criterion Series:
grp = (df.isnull().sum(axis=1) == df.shape[1]).cumsum()
Drop rows full of NaN and group the result by the above criterion:
gr = df.dropna(axis=0, thresh=1).groupby(grp)
thresh=1 means that for the current row it is enough to have 1
non-NaN value to be kept in the result.
Perform actual split, as a list comprehension:
result = [ gr.get_group(key) for key in gr.groups ]
To print the result, you can run:
for i, chunk in enumerate(result):
print(f'Chunk {i}:')
print(chunk, end='\n\n')
getting:
Chunk 0:
A B C
0 10.0 Abc 20.0
1 11.0 NaN 21.0
2 12.0 Ghi NaN
Chunk 1:
A B C
4 NaN Hkx 30.0
5 21.0 Jkl 32.0
6 22.0 Mno 33.0
Chunk 2:
A B C
8 30.0 Pqr 40.0
9 NaN Stu NaN
10 32.0 Vwx 44.0
I have a data frame, and want to create a separate column. This columns must be based on the 'most right' value in a data frame. But, if the value is a nan/None, skip the column.
Data frame:
Column_0 Column_1 Column_2 Column_3
nan nan nan nan
1 2 nan nan
1 2 3 4
1 nan 3 nan
Output:
Column_Output
nan
2
4
3
I searched for solutions... but even finding the right search terms was causing me trouble. Thanks a lot in advance!
First forward filling missing values and then select last column:
df['Column_Output'] = df.ffill(axis=1).iloc[:, -1]
print (df)
Column_0 Column_1 Column_2 Column_3 Column_Output
0 NaN NaN NaN NaN NaN
1 1.0 2.0 NaN NaN 2.0
2 1.0 2.0 3.0 4.0 4.0
3 1.0 NaN 3.0 NaN 3.0
I'm facing the next situation. I have two dataframes lets say df1 and df2, and I need to join them by a key ( ID_ed , ID ) the second dataframe may have more than one occurrence of the key, what I need is to join the two dataframes, and add the repeated occurrences of the keys as new columns ( as shown in the next Image )
I tried to use merge = df2.join( df1 , lsuffix='_ZID', rsuffix='_IID' , how = "left" ) and concat operations but no luck so far .It seems that it only preserve the last occurrence ( as if it was overwriting the data )
Any help in this is really appreciated, and thanks in advance.
Another approach is to create a serial counter for the ID_ed column, set_index and unstack before calling the pivot_table. The pivot_table aggregation would be first. This approach would be fairly similar to this SO answer
Generate the data
import pandas as pd
import numpy as np
a = [['ID_ed','color'],[1,5],[2,8],[3,7]]
b = [['ID','code'],[1,1],[1,5],
[2,np.nan],[2,20],[2,74],
[3,10],[3,98],[3,85],
[3,21],[3,45]]
df1 = pd.DataFrame(a[1:], columns=a[0])
df2 = pd.DataFrame(b[1:], columns=b[0])
print(df1)
ID_ed color
0 1 5
1 2 8
2 3 7
print(df2)
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
5 3 10.0
6 3 98.0
7 3 85.0
8 3 21.0
9 3 45.0
First the merge and unstack
# Merge and add a serial counter column
df = df1.merge(df2, how='inner', left_on='ID_ed', right_on='ID')
df['counter'] = df.groupby('ID_ed').cumcount()+1
print(df)
ID_ed color ID code counter
0 1 5 1 1.0 1
1 1 5 1 5.0 2
2 2 8 2 NaN 1
3 2 8 2 20.0 2
4 2 8 2 74.0 3
5 3 7 3 10.0 1
6 3 7 3 98.0 2
7 3 7 3 85.0 3
8 3 7 3 21.0 4
9 3 7 3 45.0 5
# Set index and unstack
df.set_index(['ID_ed','color','counter']).\
unstack().\
swaplevel(1,0,axis=1).\
sort_index(level=0,axis=1).add_prefix('counter_')
print(df)
counter counter_1 counter_2 \
counter_ID counter_code counter_ID counter_code\
ID_ed color \
1 5 1.0 1.0 1.0 5.0\
2 8 2.0 NaN 2.0 20.0\
3 7 3.0 10.0 3.0 98.0 \
counter_3 counter_4 counter_5
counter_ID counter_code counter_ID counter_code counter_ID counter_code
NaN NaN NaN NaN NaN NaN
2.0 74.0 NaN NaN NaN NaN
3.0 85.0 3.0 21.0 3.0 45.0
Next generate the pivot table
# Pivot table with 'first' aggregation
dfp = pd.pivot_table(df, index=['ID_ed','color'],
columns=['counter'],
values=['ID', 'code'],
aggfunc='first')
print(dfp)
ID code
counter 1 2 3 4 5 1 2 3 4 5
ID_ed color
1 5 1.0 1.0 NaN NaN NaN 1.0 5.0 NaN NaN NaN
2 8 2.0 2.0 2.0 NaN NaN NaN 20.0 74.0 NaN NaN
3 7 3.0 3.0 3.0 3.0 3.0 10.0 98.0 85.0 21.0 45.0
Finally rename the columns and slice by partial column name
# Rename columns
level_1_names = list(dfp.columns.get_level_values(1))
level_0_names = list(dfp.columns.get_level_values(0))
new_cnames = [b+'_'+str(f) for f, b in zip(level_1_names, level_0_names)]
dfp.columns = new_cnames
# Slice by new column names
print(dfp.loc[:, dfp.columns.str.contains('code')].reset_index(drop=False))
ID_ed color code_1 code_2 code_3 code_4 code_5
0 1 5 1.0 5.0 NaN NaN NaN
1 2 8 NaN 20.0 74.0 NaN NaN
2 3 7 10.0 98.0 85.0 21.0 45.0
I'd use cumcount and pivot_table:
In [11]: df1
Out[11]:
ID color
0 1 5
1 2 8
2 3 7
In [12]: df2
Out[12]:
ID code
0 1 1.0
1 1 5.0
2 2 NaN
3 2 20.0
4 2 74.0
In [13]: res = df1.merge(df2) # This is a merge if the column names match
In [14]: res
Out[14]:
ID color code
0 1 5 1.0
1 1 5 5.0
2 2 8 NaN
3 2 8 20.0
4 2 8 74.0
In [15]: res['count'] = res.groupby('ID').cumcount()
In [16]: res.pivot_table('code', ['ID', 'color'], 'count')
Out[16]:
count 0 1 2
ID color
1 5 1.0 5.0 NaN
2 8 NaN 20.0 74.0
I have a df like below which
>>df
group sub_group max
0 A 1 30.0
1 B 1 300.0
2 B 2 3.0
3 A 2 2.0
I need to have group and sub_group as atrributes (columns) and max as row
So I do
>>> newdf.set_index(['group','sub_group']).T
group A B A
sub_group 1 1 2 2
max 30.0 300.0 3.0 2.0
This gives me my intended formatting
Now I need to merge it to another similar dataframe say
>>df2
group sub_group max
0 C 1 3000.0
1 A 1 4000.0
Such that my merge results in
group A B A C
sub_group 1 1 2 2 1
max 30.0 300.0 3.0 2.0 NaN
max 4000.0 NaN NaN NaN 3000.0
Basically at every new df we are placing values under appropriate heading, if there is a new group or subgroup we add it the larger df. I am not sure if my way of transposing and then trying to merge append is a good approach
Since these df are generated in loop (loop items being dates), I would like to get a way to replace max printed in 1st column(of expected op) by loop date.
dates=['20170525', '20170623', '20170726']
for date in dates:
df = pd.read_csv()
I think you can add parameter index_col to read_csv first for Multiindex from first and second column:
dfs = []
for date in dates:
df = pd.read_csv('name', index_col=[0,1])
dfs.append(df)
#another test df was added
print (df3)
max
group sub_group
D 1 3000.0
E 1 4000.0
Then concat them together with parameter keys by list, then reshape by unstack and transpose:
#dfs = [df,df2,df3]
dates=['20170525', '20170623', '20170726']
df = pd.concat(dfs, keys=dates)['max'].unstack(0).T
print (df)
group A B C D E
sub_group 1 2 1 2 1 1 1
20170525 30.0 2.0 300.0 3.0 NaN NaN NaN
20170623 4000.0 NaN NaN NaN 3000.0 NaN NaN
20170726 NaN NaN NaN NaN NaN 3000.0 4000.0
I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0