I'm trying to reference data from source dataframes, using indices stored in another dataframe.
For example, let's say we have a "shifts" dataframe with the names of the people on duty on each date (some values can be NaN):
a b c
2023-01-01 Sam Max NaN
2023-01-02 Mia NaN Max
2023-01-03 NaN Sam Mia
Then we have a "performance" dataframe, with the performance of each employee on each date. Row indices are the same as the shifts dataframe, but column names are different:
Sam Mia Max Ian
2023-01-01 4.5 NaN 3.0 NaN
2023-01-02 NaN 2.0 3.0 NaN
2023-01-03 4.0 3.0 NaN 4.0
and finally we have a "salary" dataframe, whose structure and indices are different from the other two dataframes:
Employee Salary
0 Sam 100
1 Mia 90
2 Max 80
3 Ian 70
I need to create two output dataframes, with same structure and indices as "shifts".
In the first one, I need to substitute the employee name with his/her performance on that date.
In the second output dataframe, the employee name is replaced with his/her salary. Theses are the expected outputs:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Any idea of how to do it? Thanks
For the first one:
(shifts
.reset_index().melt('index')
.merge(performance.stack().rename('p'),
left_on=['index', 'value'], right_index=True)
.pivot(index='index', columns='variable', values='p')
.reindex_like(shifts)
)
Output:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
For the second:
shifts.replace(salary.set_index('Employee')['Salary'])
Output:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Here's a way to do what your question asks:
out1 = ( shifts.stack()
.rename_axis(index=('date','shift'))
.reset_index().rename(columns={0:'employee'})
.pipe(lambda df: df.assign(perf=
df.apply(lambda row: perf.loc[row.date, row.employee], axis=1)))
.pivot(index='date', columns='shift', values='perf')
.rename_axis(index=None, columns=None) )
out2 = shifts.replace(salary.set_index('Employee').Salary)
Output:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Related
I need, for each (x) row of a dataframe, to get the value stored in the previous row (x-1) and in a specific target column. The header of the target column is stored in a column (Target_col) of the x row.
0 1 2 Target_col
Date
2022-01-01 37.0 26.0 NaN 0
2022-01-02 NaN 41.0 0.0 1
2022-01-03 NaN 40.0 43.0 1
2022-01-04 NaN NaN 23.0 2
For example, in the last row my Target_value is 43.0, which is stored in the column "2" of the previous row.
This is the expected output:
0 1 2 Target_col Target_value
Date
2022-01-01 37.0 26.0 NaN 0 NaN
2022-01-02 NaN 41.0 0.0 1 26.0
2022-01-03 NaN 40.0 43.0 1 41.0
2022-01-04 NaN NaN 23.0 2 43.0
I was able to get what I want by duplicating the df:
df2 = df.shift(periods=1)
df['Target_value'] = df2.lookup(df.index, df['Target_col'])
but I guess there is a smarter way to do that. Furthermore, lookup is deprecated. Any ideas?
Please note that I reshaped my question and the example df to make everything clearer, so itprorh66's answer and my comments to his answer are are no longer relevant.
I would approach the problem a little differently as illustrated below:
given a base dataframe of the form:
df:
date a b c
0 2022-01-01 12.0 11.0 NaN
1 2022-01-02 10.0 11.0 NaN
2 2022-01-03 NaN 10.0 10.0
3 2022-01-04 NaN 11.0 9.0
4 2022-01-05 NaN NaN 12.0
In stead of defining the column that contains the first valid data, I would create a column which just contains the first valid piece of data as follows:
# helper function to find first valid data
def findfirst(row, cols_list):
# return the first non-Nan value found within row
for c in cols_list:
if not np.isnan(row[c]):
return row[c]
return np.nan
Then using the above helper, I add the column 'First' which contains the desired data as follows:
df['First'] = df.apply(lambda row: findfirst(row, ['a', 'b', 'c']), axis= 1)
This create the following dataframe:
date a b c First
0 2022-01-01 12.0 11.0 NaN 12.0
1 2022-01-02 10.0 11.0 NaN 10.0
2 2022-01-03 NaN 10.0 10.0 10.0
3 2022-01-04 NaN 11.0 9.0 11.0
4 2022-01-05 NaN NaN 12.0 12.0
From the above you can then compute the change value as follows:
df['Change'] = (df['First']/df['First'].shift())-1
Which yields:
** date a b c First Change
0 2022-01-01 12.0 11.0 NaN 12.0 NaN
1 2022-01-02 10.0 11.0 NaN 10.0 -0.166667
2 2022-01-03 NaN 10.0 10.0 10.0 0.000000
3 2022-01-04 NaN 11.0 9.0 11.0 0.100000
4 2022-01-05 NaN NaN 12.0 12.0 0.090909**
It's a bit convoluted but this works:
cols = df.columns[:-1]
temp = df['Target_col'].shift(-1).values[:-1]
temp = np.append(temp, 0)
target_values = df[cols].to_numpy()[np.arange(len(df)), temp.astype(int)][:-1]
target_values = np.insert(target_values, 0, 0, axis=0)
df['target_values'] = target_values.tolist()
I have a dataframe where in some cases a case has its records in more than one row, with nulls in some rows as so:
date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 NaN NaN NaN NaN
1 2020-04-01 00:05:00 NaN 1.0 44.0 44.0 46.454
2 2020-04-01 00:05:00 NaN NaN NaN NaN NaN
I want to have only one row with the filled data, so far I have:
df.groupby(['date_rounded']).apply(lambda df0: df0.fillna(method='ffill').fillna(method='bfill').drop_duplicates())
this works, but it is slow, any better ideas?
Thanks
You can also use groupby and first:
df.groupby("date_rounded").first()
1 2 3 4 5
date_rounded
2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
If you need to fill within each group, you can use groupby().apply and bfill:
df.groupby('date_rounded', as_index=False).apply(lambda x: x.bfill().iloc[0])
Output:
0 date_rounded 1 2 3 4 5
0 2020-04-01 00:05:00 0.0 1.0 44.0 44.0 46.454
I am working with a very large dataframe (~3 million rows) and i need the count of values from multiple columns, grouped by time related data.
I have tried to stack the columns but the resulting dataframe was very long and wouldn't fit in the memory. Similarly df.apply gave memory issues.
For example if my sample dataframe is like,
id,date,field1,field2,field3
1,1/1/2014,abc,,abc
2,1/1/2014,abc,,abc
3,1/2/2014,,abc,abc
4,1/4/2014,xyz,abc,
1,1/1/2014,,abc,abc
1,1/1/2014,xyz,qwe,xyz
4,1/7/2014,,qwe,abc
2,1/4/2014,qwe,,qwe
2,1/4/2014,qwe,abc,qwe
2,1/5/2014,abc,,abc
3,1/5/2014,xyz,xyz,
I have written the following script that does the needed for a small sample but fails in a large dataframe.
df.set_index(["id", "date"], inplace=True)
df = df.stack(level=[0])
df = df.groupby(level=[0,1]).value_counts()
df = df.unstack(level=[1,2])
I also have a solution via apply but it has the same complications.
The expected result is,
date 1/1/2014 1/4/2014 ... 1/5/2014 1/4/2014 1/7/2014
abc xyz qwe qwe ... xyz xyz abc qwe
id ...
1 4.0 2.0 1.0 NaN ... NaN NaN NaN NaN
2 2.0 NaN NaN 4.0 ... NaN NaN NaN NaN
3 NaN NaN NaN NaN ... 2.0 NaN NaN NaN
4 NaN NaN NaN NaN ... NaN 1.0 1.0 1.0
I am looking for a more optimized version of what I have written.
Thanks for the help !!
You don't want to use stack. Therefore, another solution is using crosstab on id with each date and fields columns. Finally, concat them together, groupby() the index and sum. Use listcomp on df.columns[2:] to create each crosstab (note: I assume the first 2 columns is id and date as your sample):
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum()
Out[497]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4 1.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 2 0.0 0.0 0.0 1.0 4.0 0.0 2.0 0.0 0.0 0.0
3 0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
4 0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
I think showing 0 is better than NaN. However, if you want NaN instead of 0, you just need to chain additional replace as follows:
pd.concat([pd.crosstab([df.id], [df.date, df[col]]) for col in df.columns[2:]]).groupby(level=0).sum().replace({0: np.nan})
Out[501]:
1/1/2014 1/2/2014 1/4/2014 1/5/2014 1/7/2014
abc qwe xyz abc abc qwe xyz abc xyz abc qwe
id
1 4.0 1.0 2.0 NaN NaN NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN NaN 1.0 4.0 NaN 2.0 NaN NaN NaN
3 NaN NaN NaN 2.0 NaN NaN NaN NaN 2.0 NaN NaN
4 NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN 1.0 1.0
I have a long list of columns and I want to subtract the previous column from the current column and replace the current column with the difference.
So if I have:
A B C D
1 NaN 3 7
3 NaN 8 10
2 NaN 6 11
I want the output to be:
A B C D
1 NaN 2 4
3 NaN 5 2
2 NaN 4 5
I have been trying to use this code:
df2 = df1.diff(axis=1)
but this does not produce the desired output
Thanks in advance.
You can do this with df.where and then update to bring back the first non-null entry for each row of your DataFrame.
Sample Data: df
A B C D
0 1.0 NaN 3.0 7.0
1 1.0 4.0 5.0 9.0
2 NaN 4.0 NaN 4.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 7.0
5 3.0 NaN NaN 7.0
6 6.0 NaN NaN NaN
Code:
df_d = df.where(df.isnull(),
df.fillna(method='ffill', axis=1).diff(axis=1))
df_d.update(df.where(df.notnull().cumsum(1).cumsum(1) == 1))
Output: df_d
A B C D
0 1.0 NaN 2.0 4.0
1 1.0 3.0 1.0 4.0
2 NaN 4.0 NaN 0.0
3 NaN 4.0 NaN NaN
4 NaN NaN 3.0 4.0
5 3.0 NaN NaN 4.0
6 6.0 NaN NaN NaN
Actually, it is producing the desired result but you are trying to calculate diff on nan values which will be nan so diff is working as expected.
For your case just fetch the first column from original dataframe and you should be fine
df2=df1.diff(axis=1)
df2.A=df1.A
print(df2)
Output
A B C D
1 NaN 2.0 4.0
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.