If I have a dataframe similar to this one
Apples Bananas Grapes Kiwis
2 3 nan 1
1 3 7 nan
nan nan 2 3
I would like to add a column like this
Apples Bananas Grapes Kiwis Fruit Total
2 3 nan 1 6
1 3 7 nan 11
nan nan 2 3 5
I guess you could use df['Apples'] + df['Bananas'] and so on, but my actual dataframe is much larger than this. I was hoping a formula like df['Fruit Total']=df[-4:-1].sum could do the trick in one line of code. That didn't work however. Is there any way to do it without explicitly summing up all columns?
You can first select by iloc and then sum:
df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1)
print (df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 2.0
For sum all columns use:
df['Fruit Total']= df.sum(axis=1)
This may be helpful for beginners, so for the sake of completeness, if you know the column names (e.g. they are in a list), you can use:
column_names = ['Apples', 'Bananas', 'Grapes', 'Kiwis']
df['Fruit Total']= df[column_names].sum(axis=1)
This gives you flexibility about which columns you use as you simply have to manipulate the list column_names and you can do things like pick only columns with the letter 'a' in their name. Another benefit of this is that it's easier for humans to understand what they are doing through column names. Combine this with list(df.columns) to get the column names in a list format. Thus, if you want to drop the last column, all you have to do is:
column_names = list(df.columns)
df['Fruit Total']= df[column_names[:-1]].sum(axis=1)
It is possible to do it without knowing the number of columns and even without iloc:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]-1]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 5.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Using df['Fruit Total']= df.iloc[:, -4:-1].sum(axis=1) over your original df won't add the last column ('Kiwis'), you should use df.iloc[:, -4:] instead to select all columns:
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
df['Fruit Total']=df.iloc[:,-4:].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
I want to build on Ramon's answer if you want to come up with the total without knowing the shape/size of the dataframe.
I will use his answer below but fix one item that didn't include the last column for the total.
I have removed the -1 from the shape:
cols_to_sum = df.columns[ : df.shape[1]-1]
To this:
cols_to_sum = df.columns[ : df.shape[1]]
print(df)
Apples Bananas Grapes Kiwis
0 2.0 3.0 NaN 1.0
1 1.0 3.0 7.0 NaN
2 NaN NaN 2.0 3.0
cols_to_sum = df.columns[ : df.shape[1]]
df['Fruit Total'] = df[cols_to_sum].sum(axis=1)
print(df)
Apples Bananas Grapes Kiwis Fruit Total
0 2.0 3.0 NaN 1.0 6.0
1 1.0 3.0 7.0 NaN 11.0
2 NaN NaN 2.0 3.0 5.0
Which then gives you the correct total without skipping the last column.
Related
I'm trying to reference data from source dataframes, using indices stored in another dataframe.
For example, let's say we have a "shifts" dataframe with the names of the people on duty on each date (some values can be NaN):
a b c
2023-01-01 Sam Max NaN
2023-01-02 Mia NaN Max
2023-01-03 NaN Sam Mia
Then we have a "performance" dataframe, with the performance of each employee on each date. Row indices are the same as the shifts dataframe, but column names are different:
Sam Mia Max Ian
2023-01-01 4.5 NaN 3.0 NaN
2023-01-02 NaN 2.0 3.0 NaN
2023-01-03 4.0 3.0 NaN 4.0
and finally we have a "salary" dataframe, whose structure and indices are different from the other two dataframes:
Employee Salary
0 Sam 100
1 Mia 90
2 Max 80
3 Ian 70
I need to create two output dataframes, with same structure and indices as "shifts".
In the first one, I need to substitute the employee name with his/her performance on that date.
In the second output dataframe, the employee name is replaced with his/her salary. Theses are the expected outputs:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Any idea of how to do it? Thanks
For the first one:
(shifts
.reset_index().melt('index')
.merge(performance.stack().rename('p'),
left_on=['index', 'value'], right_index=True)
.pivot(index='index', columns='variable', values='p')
.reindex_like(shifts)
)
Output:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
For the second:
shifts.replace(salary.set_index('Employee')['Salary'])
Output:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
Here's a way to do what your question asks:
out1 = ( shifts.stack()
.rename_axis(index=('date','shift'))
.reset_index().rename(columns={0:'employee'})
.pipe(lambda df: df.assign(perf=
df.apply(lambda row: perf.loc[row.date, row.employee], axis=1)))
.pivot(index='date', columns='shift', values='perf')
.rename_axis(index=None, columns=None) )
out2 = shifts.replace(salary.set_index('Employee').Salary)
Output:
Output 1:
a b c
2023-01-01 4.5 3.0 NaN
2023-01-02 2.0 NaN 3.0
2023-01-03 NaN 4.0 3.0
Output 2:
a b c
2023-01-01 100.0 80.0 NaN
2023-01-02 90.0 NaN 80.0
2023-01-03 NaN 100.0 90.0
In short, I just want each unique value of the "ts_" prefixed columns converted into a row index. I intend to use the 'ts' and 'id' column as a multi-index.
rows = [{'id':1, 'a_ts':'2020-10-02','a_energy':6,'a_money':2,'b_ts':'2020-10-02', 'b_color':'blue'},
{'id':2, 'a_ts':'2020-02-02','a_energy':2,'a_money':5, 'a_color':'orange', 'b_ts':'2012-08-11', 'b_money':10, 'b_color':'blue'},
{'id':3,'a_ts':'2011-02-02', 'a_energy':4}]
df = pd.DataFrame(rows)
id a_ts a_energy a_money b_ts b_color a_color b_money
0 1 2020-10-02 6 2.0 2020-10-02 blue NaN NaN
1 2 2020-02-02 2 5.0 2012-08-11 blue orange 10.0
2 3 2011-02-02 4 NaN NaN NaN NaN NaN
I want my output to look something like this.
energy money color
id ts
1 2020-10-02 6.0 2.0 blue
2 2020-02-02 2.0 5.0 orange
2012-08-11 NaN 10.0 blue
3 2011-02-02 4.0 NaN NaN
The best I could come up with was splitting the columns with an underscore and resetting the indexes, but that creates rows where the the ids and timestamp are NaN.
I cannot simply create rows with NaNs, then get rid of all these rows. As I'll lose information about which ID's did not contain a timestamp or what timestamps did not have a matched id (this is because the dataframes are the result of a join).
df.columns = df.columns.str.split("ts_", expand=True)
df = df.stack().reset_index(drop=True)
Use:
df = df.set_index(['id'])
df.columns = df.columns.str.split("_", expand=True)
df = df.stack(0).reset_index(level=-1,drop=True).reset_index()
print (df)
id color energy money ts
0 1 NaN 6.0 2.0 2020-10-02
1 1 blue NaN NaN 2020-10-02
2 2 orange 2.0 5.0 2020-02-02
3 2 blue NaN 10.0 2012-08-11
4 3 NaN 4.0 NaN 2011-02-02
And then shift values per groups with removed only NaNs rows by custom lambda functions:
f = lambda x: x.apply(lambda y: pd.Series(y.dropna().tolist()))
df = df.set_index(['id','ts']).groupby(['id','ts']).apply(f).droplevel(-1)
print (df)
color energy money
id ts
1 2020-10-02 blue 6.0 2.0
2 2012-08-11 blue NaN 10.0
2020-02-02 orange 2.0 5.0
3 2011-02-02 NaN 4.0 NaN
I am working with a pandas data frame that contains also nan values. I want to substitute the nans with interpolated values with df.interpolate, but only if the length of the sequence of nan values is =<N. As an example, let's assume that I choose N = 2 (so I want to fill in sequences of nans if they are up to 2 nans long) and I have a dataframe with
print(df)
A B C
1 1 1
nan nan 2
nan nan 3
nan 4 nan
5 5 5
In such a case I want to apply a function on df that only the nan sequences with length N<=2 get filled, but the larger sequences get untouched, resulting in my desired output of
print(df)
A B C
1 1 1
nan 2 2
nan 3 3
nan 4 4
5 5 5
Note that I am aware of the option of limit=N inside df.interpolate, but it doesn't fulfil what I want, because it would fill any length of nan sequence, just limit the filling to a the first 3 nans resulting in the undesired output
print(df)
A B C
1 1 1
2 2 2
3 3 3
nan 4 4
5 5 5
So do you know of a function/ do you know how to construct a code that results in my desired output? Tnx
You can perform run length encoding and identify the runs of NaN that are shorter than or equal to two elements for each columns. One way to do that is to use get_id from package pdrle (disclaimer: I wrote it).
import pdrle
chk = df.isna() & (df.apply(lambda x: x.groupby(pdrle.get_id(x)).transform(len)) <= 2)
df[chk] = df.interpolate()[chk]
# A B C
# 0 1.0 1.0 1.0
# 1 NaN 2.0 2.0
# 2 NaN 3.0 3.0
# 3 NaN 4.0 4.0
# 4 5.0 5.0 5.0
Try:
N = 2
df_interpolated = df.interpolate()
for c in df:
mask = df[c].isna()
x = (
mask.groupby((mask != mask.shift()).cumsum()).transform(
lambda x: len(x) > N
)
* mask
)
df_interpolated[c] = df_interpolated.loc[~x, c]
print(df_interpolated)
Prints:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
Trying with different df:
A B C
0 1.0 1.0 1.0
1 NaN NaN 2.0
2 NaN NaN 3.0
3 NaN 4.0 NaN
4 5.0 5.0 5.0
5 NaN 5.0 NaN
6 NaN 5.0 NaN
7 8.0 5.0 NaN
produces:
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
5 6.0 5.0 NaN
6 7.0 5.0 NaN
7 8.0 5.0 NaN
You can try the following -
n=2
cols = df.columns[df.isna().sum()<=n]
df[cols] = df[cols].interpolate()
df
A B C
0 1.0 1.0 1.0
1 NaN 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 5.0 5.0 5.0
df.columns[df.isna().sum()<=n] filters the columns based on your condition. Then, you simply overwrite the columns after interpolation.
I am trying to write a list of lists into an excel sheet using pandas
the list looks like:
List_of Lists = [ [1,2,3,4],
[5,6,7,8],
[9,10,11,12],
........,
]
The number of these lists inside the main list could go up to a 1000.
I also want to label them like colums1, colomns2, until colums100 for
instance. on the same sheets. can anyone familiar with pandas help me?
as this could be really easy for some?
I believe you can just pass the list into pd.DataFrame() and you will just get NaNs for the values that don't exist.
For example:
List_of_Lists = [[1,2,3,4],
[5,6,7],
[9,10],
[11]]
df = pd.DataFrame(List_of_Lists)
print(df)
0 1 2 3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Then to get the naming the way you want just use pandas.DataFrame.add_prefix
df = df.add_prefix('Column')
print(df)
Column0 Column1 Column2 Column3
0 1 2.0 3.0 4.0
1 5 6.0 7.0 NaN
2 9 10.0 NaN NaN
3 11 NaN NaN NaN
Now I guess there is the possibility that you also could want each list to be a column. In that case you need to transpose your List_of_Lists.
from itertools import zip_longest
df = pd.DataFrame(list(map(list, zip_longest(*List_of_Lists))))
print(df)
0 1 2 3
0 1 5.0 9.0 11.0
1 2 6.0 10.0 NaN
2 3 7.0 NaN NaN
3 4 NaN NaN NaN
I am trying to create a very large dataframe, made up of one column from many smaller dataframes (renamed to the dataframe name). I am using CONCAT() and looping through dictionary values which represent dataframes, and looping over index values, to create the large dataframe. The CONCAT() join_axes is the common index to all the dataframes. This works fine, however I then have duplicate column names.
I must be able to loop over the indexes at specifc windows as part of my final dataframe creation - so removing this step isnt an option
For example, this results in the following final dataframe with duplciate columns:
Is there any way I can use CONCAT() excatly as I am, but merge the columns to produce an output like so?:
I think you need:
df = pd.concat([df1, df2])
Or if have duplicates in columns use groupby where if some values are overlapping then are summed:
print (df.groupby(level=0, axis=1).sum())
Sample:
df1 = pd.DataFrame({'A':[5,8,7, np.nan],
'B':[1,np.nan,np.nan,9],
'C':[7,3,np.nan,0]})
df2 = pd.DataFrame({'A':[np.nan,np.nan,np.nan,2],
'B':[1,2,np.nan,np.nan],
'C':[np.nan,6,np.nan,3]})
print (df1)
A B C
0 5.0 1.0 7.0
1 8.0 NaN 3.0
2 7.0 NaN NaN
3 NaN 9.0 0.0
print (df2)
A B C
0 NaN 1.0 NaN
1 NaN 2.0 6.0
2 NaN NaN NaN
3 2.0 NaN 3.0
df = pd.concat([df1, df2],axis=1)
print (df)
A B C A B C
0 5.0 1.0 7.0 NaN 1.0 NaN
1 8.0 NaN 3.0 NaN 2.0 6.0
2 7.0 NaN NaN NaN NaN NaN
3 NaN 9.0 0.0 2.0 NaN 3.0
print (df.groupby(level=0, axis=1).sum())
A B C
0 5.0 2.0 7.0
1 8.0 2.0 9.0
2 7.0 NaN NaN
3 2.0 9.0 3.0
What you want is df1.combine_first(df2). Refer to pandas documentation.