python- flagging a second set of items in a series - python

I have a dataframe column which contains a list of numbers from a .csv. These numbers range from 1-1400 and may or may not be repeated and the a NaN value can appear pretty much anywhere at random.
Two examples would be
a=[1,4,NaN,5,6,7,...1398,1400,1,2,3,NaN,8,9,...,1398,NaN]
b=[1,NaN,2,3,4,NaN,7,10,...,1398,1399,1400]
I would like to create another column that finds the first 1-1400 and records a '1' in the same index and if the second set of 1-1400 exists, then mark that down as a '2' in the new column
I can think of some roundabout ways using temporary placeholders and some other kind of checks, but I was wondering if there was a 1-3 liner to do this operation
Edit1: I would prefer there to be a single column returned
a1=[1,1,NaN,1,1,1,...1,1,2,2,2,NaN,2,2,...,2,NaN]
b1=[1,NaN,1,1,1,NaN,1,1,...,1,1,1]

You can use groupby() and cumcount() to count numbers in each column:
# create new columns for counting
df['a1'] = np.nan
df['b1'] = np.nan
# take groupby for each value in column `a` and `b` and count each value
df.a1 = df.groupby('a').cumcount() + 1
df.b1 = df.groupby('b').cumcount() + 1
# set np.nan as it is
df.loc[df.a.isnull(), 'a1'] = np.nan
df.loc[df.b.isnull(), 'b1'] = np.nan
EDIT (after receiving a comment of 'does not work'):
df['a2'] = df.ffill().a.diff()
df['a1'] = df.loc[df.a2 < 0].groupby('a').cumcount() + 1
df['a1'] = df['a1'].bfill().shift(-1)
df.loc[df.a1.isnull(), 'a1'] = df.a1.max() + 1
df.drop('a2', axis=1, inplace=True)
df.loc[df.a.isnull(), 'a1'] = np.nan

you can use diff to check when the difference between two following values is negative, meaning of the start of a new range. Let's create a dataframe:
import pandas as pd
import numpy as np
# to create a dataframe with two columns my range go up to 12 but 1400 is the same
df = pd.DataFrame({'a':[1,4,np.nan,5,10,12,2,3,4,np.nan,8,12],'b':range(1,13)})
df.loc[[4,8],'b'] = np.nan
Because you have 'NaN', you need to use ffill to fill NaN with previous value and you want the opposite of the row (using ~) where the diff is greater or equal than 0 (I know it sound like less than 0, but not exactely here as it miss the first row of the dataframe). For column 'a' for example
print (df.loc[~(df.a.ffill().diff()>=0),'a'])
0 1.0
6 2.0
Name: a, dtype: float64
you get the two rows where a "new" range start. To use this property to create 'a1', you can do:
# put 1 in the rows with a new range start
df.loc[~(df.a.ffill().diff()>=0),'a1'] = 1
# create a mask to select notnull row in a:
mask_a = df.a.notnull()
# use cumsum and ffill on column a1 with the mask_a
df.loc[mask_a,'a1'] = df.loc[mask_a,'a1'].cumsum().ffill()
Finally, for several column, you can do:
list_col = ['a','b']
for col in list_col:
df.loc[~(df[col].ffill().diff()>=0),col+'1'] = 1
mask = df[col].notnull()
df.loc[mask,col+'1'] = df.loc[mask,col+'1'].cumsum().ffill()
and with my input, you get:
a b a1 b1
0 1.0 1.0 1.0 1.0
1 4.0 2.0 1.0 1.0
2 NaN 3.0 NaN 1.0
3 5.0 4.0 1.0 1.0
4 10.0 NaN 1.0 NaN
5 12.0 6.0 1.0 1.0
6 1.0 7.0 2.0 1.0
7 3.0 8.0 2.0 1.0
8 4.0 NaN 2.0 NaN
9 NaN 10.0 NaN 1.0
10 8.0 11.0 2.0 1.0
11 12.0 12.0 2.0 1.0
EDIT: you can even do it in one line for each column, same result:
df['a1'] = df[df.a.notnull()].a.diff().fillna(-1).lt(0).cumsum()
df['b1'] = df[df.b.notnull()].b.diff().fillna(-1).lt(0).cumsum()

Related

Pandas Groupby Max of Multiple Columns

Given this data frame:
import pandas as pd
df = pd.DataFrame({'group':['a','a','b','c','c'],'strings':['ab',' ',' ','12',' '],'floats':[7.0,8.0,9.0,10.0,11.0]})
group strings floats
0 a ab 7.0
1 a 8.0
2 b 9.0
3 c 12 10.0
4 c 11.0
I want to group by "group" and get the max value of strings and floats.
Desired result:
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
I know I can just do this:
df.groupby(['group'], sort=False)['strings','floats'].max()
But in reality, I have many columns so I want to refer to all columns (save for "group") in one go.
I wish I could just do this:
df.groupby(['group'], sort=False)[x for x in df.columns if x != 'group'].max()
But, alas, "invalid syntax".
If need max of all columns without group is possible use:
df = df.groupby('group', sort=False).max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0
Your second solution working if add next []:
df = df.groupby(['group'], sort=False)[[x for x in df.columns if x != 'group']].max()
print (df)
strings floats
group
a ab 8.0
b 9.0
c 12 11.0

Replace all NaN values with value from other column

I have the following dataframe:
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[3, 4, np.nan, 1],
[np.nan, np.nan, 5, np.nan],
[np.nan, 3, np.nan, 4]],
columns=list('ABCD'))
I want to do a ffill() on column B with df["B"].ffill(inplace=True) which results in the following df:
A B C D
0 NaN 2.0 NaN 0.0
1 3.0 4.0 NaN 1.0
2 NaN 4.0 5.0 NaN
3 NaN 3.0 NaN 4.0
Now I want to replace all NaN values with their corresponding value from column B. The documentation states that you can give fillna() a Series, so I tried df.fillna(df["B"], inplace=True). This results in the exact same dataframe as above.
However, if I put in a simple value (e.g. df.fillna(0, inplace=True), then it does work:
A B C D
0 0.0 2.0 0.0 0.0
1 3.0 4.0 0.0 1.0
2 0.0 4.0 5.0 0.0
3 0.0 3.0 0.0 4.0
The funny thing is that the fillna() does seem to work with a Series as value parameter when operated on another Series object. For example, df["A"].fillna(df["B"], inplace=True) results in:
A B C D
0 2.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 4.0 4.0 NaN 5
3 3.0 3.0 NaN 4
My real dataframe has a lot of columns and I would hate to manually fillna() all of them. Am I overlooking something here? Didn't I understand the docs correctly perhaps?
EDIT I have clarified my example in such a way that 'ffill' with axis=1 does not work for me. In reality, my dataframe has many, many columns (hundreds) and I am looking for a way to not have to explicitly mention all the columns.
Try changing the axis to 1 (columns):
df = df.ffill(1).bfill(1)
If you need to specify the columns, you can do something like this:
df[["B","C"]] = df[["B","C"]].ffill(1)
EDIT:
Since you need something more general and df.fillna(df.B, axis = 1) is not implemented yet, you can try with:
df = df.T.fillna(df.B).T
Or, equivalently:
df.T.fillna(df.B, inplace=True)
This works because the indices of df.B coincides with the columns of df.T so pandas will know how to replace it. From the docs:
value: scalar, dict, Series, or DataFrame.
Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.
So, for example, the NaN in column 0 at row A (in df.T) will be replaced for the value with index 0 in df.B.

Pandas: Replace missing dataframe values / conditional calculation: fillna

I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0

combine rows with identical index

How do I combine values from two rows with identical index and has no intersection in values?
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,None,None],[None,5,6]],index=['a','b','b'])
df
#input
0 1 2
a 1.0 2.0 3.0
b 4.0 NaN NaN
b NaN 5.0 6.0
Desired output
0 1 2
a 1.0 2.0 3.0
b 4.0 5.0 6.0
Please stack(), drops all nans and unstack()
df.stack().unstack()
If possible simplify solution for first non missing values per index labels use GroupBy.first:
df1 = df.groupby(level=0).first()
If possible same output from sample data is use sum per labels use sum:
df1 = df.sum(level=0)
If there is multiple non missing values per groups is necessary specify expected output, obviously is is more complicated.

Pandas update doesn't do anything

I have a dataframe with some information on it. I created another dataframe that is larger and has default values in it. I want to update the default dataframe with the values from the first dataframe. I'm using df.update but nothing is happening. Here is the code:
new_df = pd.DataFrame(index=range(25))
new_df['Column1'] = 1
new_df['Column2'] = 2
new_df.update(old_df)
Here, old_df has 2 rows, indexed 5,6 with some random values in Column1 and Column2 and nothing else. I'm expecting these rows to overwrite the default values in new_df, what am I doing wrong?
This works for me, so I assume the problem is in the part of the code you haven't shown us.
import pandas as pd
import numpy as np
new_df = pd.DataFrame(index=range(25))
old_df = pd.DataFrame(index=[5,6])
new_df['Column1'] = 1
new_df['Column2'] = 2
old_df['Column1'] = np.nan
old_df['Column2'] = np.nan
old_df.loc[5,'Column1'] = 9
old_df.loc[6,'Column2'] = 7
new_df.update(old_df)
print(new_df.head(10))
Output:
Column1 Column2
0 1.0 2.0
1 1.0 2.0
2 1.0 2.0
3 1.0 2.0
4 1.0 2.0
5 9.0 2.0
6 1.0 7.0
7 1.0 2.0
8 1.0 2.0
9 1.0 2.0
As you don't provide us how you construct/get old_df, before do the update, make sure that the type of both indexes is the same.
new_df.index = new_df.index.astype('int64')
old_df.index = old_df.index.astype('int64')
One int is not equal to one string 1 != '1'. So update() doesn't found common rows in yours dataframes and as nothing to do.

Categories