Python pandas: creating a discrete series from a cumulative - python

I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?

You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])

Related

Add all column values repeated of one data frame to other in pandas

Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0

pandas using .where() to replace in .groupby() object

Consider a dataframe which contains several groups of integers:
d = pd.DataFrame({'label': ['a','a','a','a','b','b','b','b'], 'value': [1,2,3,2,7,1,8,9]})
d
label value
0 a 1
1 a 2
2 a 3
3 a 2
4 b 7
5 b 1
6 b 8
7 b 9
For each of these groups of integers, each integer has to be bigger or equal to the previous one. If not the case, it takes on the value of the previous integer. I replace using
s.where(~(s < s.shift()), s.shift())
which works fine for a single series. I can even group the dataframe, and loop through each extracted series:
grouped = s.groupby('label')['value']
for _, s in grouped:
print(s.where(~(s < s.shift()), s.shift()))
0 1.0
1 2.0
2 3.0
3 3.0
Name: value, dtype: float64
4 7.0
5 7.0
6 8.0
7 9.0
Name: value, dtype: float64
However, how do I now get these values back into my original dataframe?
Or, is there a better way to do this? I don't care for using .groupby and don't consider the for loop a pretty solution either...
IIUC, you can use cummax in the groupby like:
d['val_max'] = d.groupby('label')['value'].cummax()
print (d)
label value val_max
0 a 1 1
1 a 2 2
2 a 3 3
3 a 2 3
4 b 7 7
5 b 1 7
6 b 8 8
7 b 9 9

Pandas: Insert dataframe into other dataframe without preserving indices

I want to insert a pandas dataframe into another pandas dataframe at certain indices.
Lets say we have this dataframe:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
I can then change values at certain indices as following:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
original_df.iloc[[0,2],[0,1]] = 2
0 1 2
0 2 2 3
1 4 5 6
2 2 2 9
However, if i use the same technique to insert another dataframe, it doesn't work:
original_df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df_to_insert = pd.DataFrame([[10,11],[12,13]])
original_df.iloc[[0,2],[0,1]] = df_to_insert
0 1 2
0 10.0 11.0 3.0
1 4.0 5.0 6.0
2 NaN NaN 9.0
I am looking for a way to get the following result:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It seems to me that with the syntax i am using, the values from df_to_insert are taken from the corresponding index at their target locations. Is there a way for me to avoid this?
When you do insert make sure change the df to values , pandas is index sensitive , which means it will always try to match with the index and column during calculation
original_df.iloc[[0,2],[0,1]] = df_to_insert.values
original_df
Out[651]:
0 1 2
0 10 11 3
1 4 5 6
2 12 13 9
It does work with an array rather than a df:
original_df.iloc[[0,2],[0,1]] = np.array([[10,11],[12,13]])

Pandas - remove row similar to other row

I need to remove all rows from a pandas.DataFrame, which satisfy an unusual condition.
In case there is an exactly the same row, except for it has Nan value in column "C", I want to remove this row.
Given a table:
A B C D
1 2 NaN 3
1 2 50 3
10 20 NaN 30
5 6 7 8
I need to remove the first row, since it has Nan in column C, but there is absolutely same row (second) with real value in column C.
However, 3rd row must stay, because there're no rows with same A, B and D values as it has.
How do you perform this using pandas? Thank you!
You can achieve in using drop_duplicates.
Initial DataFrame:
df=pd.DataFrame(columns=['a','b','c','d'], data=[[1,2,None,3],[1,2,50,3],[10,20,None,30],[5,6,7,8]])
df
a b c d
0 1 2 NaN 3
1 1 2 50 3
2 10 20 NaN 30
3 5 6 7 8
Then you can sort DataFrame by column C. This will drop NaNs to the bottom of column:
df = df.sort_values(['c'])
df
a b c d
3 5 6 7 8
1 1 2 50 3
0 1 2 NaN 3
2 10 20 NaN 30
And then remove duplicates selecting taken into account columns ignoring C and keeping first catched row:
df1 = df.drop_duplicates(['a','b','d'], keep='first')
a b c d
3 5 6 7 8
1 1 2 50 3
2 10 20 NaN 30
But it will be valid only if NaNs are in column C.
You can try fillna along with drop_duplicates
df.bfill().ffill().drop_duplicates(subset=['A', 'B', 'D'], keep = 'last')
This will handle the scenario such as A, B and D values are same but C has non-NaN values in both the rows.
You get
A B C D
1 1 2 50 3
2 10 20 Nan 30
3 5 6 7 8
This feels right to me
notdups = ~df.duplicated(df.columns.difference(['C']), keep=False)
notnans = df.C.notnull()
df[notdups | notnans]
A B C D
1 1 2 50.0 3
2 10 20 NaN 30
3 5 6 7.0 8

Pandas - create total column based on other column

I'm trying to create a total column that sums the numbers from another column based on a third column. I can do this by using .groupby(), but that creates a truncated column, whereas I want a column that is the same length.
My code:
df = pd.DataFrame({'a':[1,2,2,3,3,3], 'b':[1,2,3,4,5,6]})
df['total'] = df.groupby(['a']).sum().reset_index()['b']
My result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 15.0
3 3 4 NaN
4 3 5 NaN
5 3 6 NaN
My desired result:
a b total
0 1 1 1.0
1 2 2 5.0
2 2 3 5.0
3 3 4 15.0
4 3 5 15.0
5 3 6 15.0
...where each 'a' column has the same total as the other.
Returning the sum from a groupby operation in pandas produces a column only as long as the number of unique items in the index. Use transform to produce a column of the same length ("like-indexed") as the original data frame without performing any merges.
df['total'] = df.groupby('a')['b'].transform(sum)
>>> df
a b total
0 1 1 1
1 2 2 5
2 2 3 5
3 3 4 15
4 3 5 15
5 3 6 15

Categories