pandas add rows to original df based on groupby - python

I want to group by text, if at least one name exists, then loop through the names, and each name that does not appear - add a line to original df.
example -
names_set = {'A','B','C'}
initial df:
columns = ['id','text','name','start','end']
data = [
[1,"this is text 1", 'A',0,4],
[2,"this is text 1", 'B',4,5],
[3,"this is text 1", 'C',4,5],
[3,"this is text 2", 'A',6,8],
[4,'this is text 3',None, None, None],
[5,"this is text 4", 'B',10,13],
[6,"this is text 4", 'B',1,5]
]
df1 = pd.DataFrame(data= data,columns=columns)
df1
id text name start end
0 1 this is text 1 A 0.0 4.0
1 2 this is text 1 B 4.0 5.0
2 3 this is text 1 C 4.0 5.0
3 3 this is text 2 A 6.0 8.0
4 4 this is text 3 None NaN NaN
5 5 this is text 4 B 10.0 13.0
6 6 this is text 4 B 1.0 5.0
output:
columns2 = ['id','text','name','start','end']
data2 = [
[1,"this is text 1", 'A',0,4],
[2,"this is text 1", 'B',4,5],
[3,"this is text 1", 'C',4,5],
[3,"this is text 2", 'A',6,8],
[None,"this is text 2", 'B',None,None],
[None,"this is text 2", 'C',None,None],
[4,'this is text 3',None, None, None],
[None,"this is text 4", 'A',None,None],
[5,"this is text 4", 'B',10,13],
[6,"this is text 4", 'B',1,5],
[None,"this is text 4", 'C',None,None]
]
df2 = pd.DataFrame(data= data2,columns=columns2)
df2
id text name start end
0 1.0 this is text 1 A 0.0 4.0
1 2.0 this is text 1 B 4.0 5.0
2 3.0 this is text 1 C 4.0 5.0
3 3.0 this is text 2 A 6.0 8.0
4 NaN this is text 2 B NaN NaN
5 NaN this is text 2 C NaN NaN
6 4.0 this is text 3 None NaN NaN
7 NaN this is text 4 A NaN NaN
8 5.0 this is text 4 B 10.0 13.0
9 6.0 this is text 4 B 1.0 5.0
10 NaN this is text 4 C NaN NaN
the code I have up until now -
g = df1.groupby('text')
text_names_group = df1.groupby("text")["name"].agg(list)
text_names_group
for text in text_names_group:
if len(text) == 1 and text[0] is None:
continue
cur_names = set(text)
missing_names_per_text = names_set - cur_names
so missing_names_per_text is what the names missing for each text, but I want to add it to the original df, per text
Thanks!
edit : there's an option for two lines with same text and name, but different start and end
example - added line 6 in input

First filter only rows match by names_set and add missing combinations, last append not matching rows by invert mask by ~ in boolean indexing and join together by concat:
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
df2 = (df1.set_index(['text', 'name'])
.reindex(pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name']))).reset_index()
df = pd.concat([df1[~m], df2]).sort_values(['text'], ignore_index=True)
print (df)
id text name start end
0 1.0 this is text 1 A 0.0 4.0
1 2.0 this is text 1 B 4.0 5.0
2 3.0 this is text 1 C 4.0 5.0
3 3.0 this is text 2 A 6.0 8.0
4 NaN this is text 2 B NaN NaN
5 NaN this is text 2 C NaN NaN
6 4.0 this is text 3 None NaN NaN
7 NaN this is text 4 A NaN NaN
8 5.0 this is text 4 B 10.0 13.0
9 NaN this is text 4 C NaN NaN
In ral data should be problem sorting by text column, so here is solution with mapping by enumerate dictionary:
columns = ['id','text','name','start','end']
data = [
[1,"this is text 10", 'A',0,4],
[2,"this is text 10", 'B',4,5],
[3,"this is text 10", 'C',4,5],
[3,"this is text 20", 'A',6,8],
[4,'this is text 13',None, None, None],
[5,"this is text 14", 'B',10,13]
]
df1 = pd.DataFrame(data= data,columns=columns)
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
sorting = {v: k for k, v in enumerate(df1['text'].drop_duplicates())}
print (sorting)
{'this is text 10': 0, 'this is text 20': 1, 'this is text 13': 2, 'this is text 14': 3}
mux = pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name'])
df2 = df1.set_index(['text', 'name']).reindex(mux).reset_index()
df = pd.concat([df1[~m], df2]).sort_values(['text'],
ignore_index=True,
key=lambda x: x.map(sorting))
print (df2)
text name id start end
0 this is text 10 A 1.0 0.0 4.0
1 this is text 10 B 2.0 4.0 5.0
2 this is text 10 C 3.0 4.0 5.0
3 this is text 20 A 3.0 6.0 8.0
4 this is text 20 B NaN NaN NaN
5 this is text 20 C NaN NaN NaN
6 this is text 14 A NaN NaN NaN
7 this is text 14 B 5.0 10.0 13.0
8 this is text 14 C NaN NaN NaN
Solution with duplicated names per groups is similar:
names_set = {'A','B','C'}
m = df1['name'].isin(names_set)
sorting = {v: k for k, v in enumerate(df1['text'].drop_duplicates())}
#print (sorting)
Create helper df3 DataFrame by MultiIndex.to_frame, add missing text and name rows and last use left join with original DataFrame:
df3 = pd.MultiIndex.from_product([df1.loc[m, 'text'].unique(),
sorted(names_set)],
names=['text', 'name']).to_frame(index=False)
df3 = (pd.concat([df1.loc[~m, ['text', 'name']], df3])
.sort_values(['text'], ignore_index=True, key=lambda x: x.map(sorting)))
print (df3)
text name
0 this is text 1 A
1 this is text 1 B
2 this is text 1 C
3 this is text 2 A
4 this is text 2 B
5 this is text 2 C
6 this is text 3 None
7 this is text 4 A
8 this is text 4 B
9 this is text 4 C
df2 = df3.merge(df1, how='left')
print (df2)
text name id start end
0 this is text 1 A 1.0 0.0 4.0
1 this is text 1 B 2.0 4.0 5.0
2 this is text 1 C 3.0 4.0 5.0
3 this is text 2 A 3.0 6.0 8.0
4 this is text 2 B NaN NaN NaN
5 this is text 2 C NaN NaN NaN
6 this is text 3 None 4.0 NaN NaN
7 this is text 4 A NaN NaN NaN
8 this is text 4 B 5.0 10.0 13.0
9 this is text 4 B 6.0 1.0 5.0
10 this is text 4 C NaN NaN NaN

Related

Python How to drop rows of Pandas DataFrame whose value in a certain column is NaN

I have this DataFrame and want only the records whose "Total" column is not NaN ,and records when A~E has more than two NaN:
A B C D E Total
1 1 3 5 5 8
1 4 3 5 5 NaN
3 6 NaN NaN NaN 6
2 2 5 9 NaN 8
..i.e. something like df.dropna(....) to get this resulting dataframe:
A B C D E Total
1 1 3 5 5 8
2 2 5 9 NaN 8
Here's my code
import pandas as pd
dfInputData = pd.read_csv(path)
dfInputData = dfInputData.dropna(axis=1,how = 'any')
RowCnt = dfInputData.shape[0]
But it looks like no modification has been made even error
Please help!! Thanks
Use boolean indexing with count all columns without Total for number of missing values and not misisng values in Total:
df = df[df.drop('Total', axis=1).isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0
Or filter columns between A:E:
df = df[df.loc[:, 'A':'E'].isna().sum(axis=1).le(2) & df['Total'].notna()]
print (df)
A B C D E Total
0 1 1 3.0 5.0 5.0 8.0
3 2 2 5.0 9.0 NaN 8.0

Pandas: How to replace values of Nan in column based on another column?

Given that, i have a dataset as below:
dict = {
"A": [math.nan,math.nan,1,math.nan,2,math.nan,3,5],
"B": np.random.randint(1,5,size=8)
}
dt = pd.DataFrame(dict)
My favorite output is, if the in column A we have an Nan then multiply the value of the column B in the same row and replace it with Nan. So, given that, the below is my dataset:
A B
NaN 1
NaN 1
1.0 3
NaN 2
2.0 3
NaN 1
3.0 1
5.0 3
My favorite output is:
A B
2 1
2 1
1 3
4 2
2 3
2 1
3 1
5 3
My current solution is as below which does not work:
dt[pd.isna(dt["A"])]["A"] = dt[pd.isna(dt["A"])]["B"].apply( lambda x:2*x )
print(dt)
In your case with fillna
df.A.fillna(df.B*2, inplace=True)
df
A B
0 2.0 1
1 2.0 1
2 1.0 3
3 4.0 2
4 2.0 3
5 2.0 1
6 3.0 1
7 5.0 3

pandas Dataframe Replace NaN values with with previous value based on a key column

I have a pd.dataframe that looks like this:
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 NaN 12 NaN NaN NaN
value_01 NaN 7 NaN NaN NaN
value_02 7 4 y NaN NaN
value_02 NaN 5 NaN NaN NaN
value_02 NaN 6 NaN NaN NaN
value_03 19 15 z NaN NaN
So now based on the key_value,
For column 'a' & 'c', I want to copy over the last cell's value from the same column 'a' & 'c' based off of the key_value.
For another column 'd', I want to copy over the row 'i - 1' cell value from column 'b' to column 'd' i'th cell.
Lastly, for column 'e' I want to copy over the sum of 'i - 1' cell's from column 'b' to column 'e' i'th cell .
For every key_value the columns 'a', 'b' & 'c' have some value in their first row, based on which the next values are being copied over or for different columns the values are being generated for.
key_value a b c d e
value_01 1 10 x NaN NaN
value_01 1 12 x 10 10
value_01 1 7 x 12 22
value_02 7 4 y NaN NaN
value_02 7 5 y 4 4
value_02 7 6 y 5 9
value_03 8 15 z NaN NaN
My current approach:
size = df.key_value.size
for i in range(size):
if pd.isna(df.a[i]) and df.key_value[i] == output.key_value[i - 1]:
df.a[i] = df.a[i - 1]
df.c[i] = df.c[i - 1]
df.d[i] = df.b[i - 1]
df.e[i] = df.e[i] + df.b[i - 1]
For columns like 'a' and 'b' the NaN values are all in the same row indexes.
My approach works but takes very long since my datframe has over 50000 records, I was wondering if there is a different way to do this, since I have multiple columns like 'a' & 'b' where values need to be copied over based on 'key_value' and some columns where the values are being computed using say a column like 'b'
pd.concat with groupby and assign
pd.concat([
g.ffill().assign(d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
for _, g in df.groupby('key_value')
])
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y NaN NaN
4 value_02 7.0 5 y 4.0 4.0
5 value_02 7.0 6 y 5.0 9.0
6 value_03 19.0 7 z NaN NaN
groupby and apply
def h(g):
return g.ffill().assign(
d=lambda d: d.b.shift(), e=lambda d: d.d.cumsum())
df.groupby('key_value', as_index=False, group_keys=False).apply(h)
You can use groupby + ffill for the groupwise filling. The other operations require shift and cumsum.
In general, note that many common operations have been implemented efficiently in Pandas.
g = df.groupby('key_value')
df['a'] = g['a'].ffill()
df['c'] = g['c'].ffill()
df['d'] = df['b'].shift()
df['e'] = df['d'].cumsum()
print(df)
key_value a b c d e
0 value_01 1.0 1 x NaN NaN
1 value_01 1.0 2 x 1.0 1.0
2 value_01 1.0 3 x 2.0 3.0
3 value_02 7.0 4 y 3.0 6.0
4 value_02 7.0 5 y 4.0 10.0
5 value_02 7.0 6 y 5.0 15.0
6 value_03 19.0 7 z 6.0 21.0

Pandas merge rows with ids in separate columns

Total meltdown here, need some assistance.
I have a DataFrame with +10m rows and some 150 columns with two ids, looking like below:
df = pd.DataFrame({'id1' : [1,2,5,3,6,4]
,'id2' : [2,1,np.nan,4,np.nan,3]
,'num' : [123, 3231, 123, 231, 6534,2394]})
id1 id2 num
0 1 2.0 123
1 2 1.0 3231
2 5 NaN 123
3 3 4.0 231
4 6 NaN 6534
5 4 3.0 2394
Where row index 0 and 1 are a pair given id1 and id2, and row index 3 and 5 are a pair in the same way. I want the table below, where the second row pair is merged with first row pair
df = pd.DataFrame({'id1' : [1,5,3,6]
,'id2' : [2,np.nan,3,np.nan]
,'num' : [123, 123, 231, 6534]
,'2num' : [3231, np.nan, 2394, np.nan,]})
id1 id2 num 2_num
0 1 2.0 123 3231.0
1 5 NaN 123 NaN
2 3 3.0 231 2394.0
3 6 NaN 6534 NaN
How can this be archived using id1 and id2 and labeling all following columns from "id row 2" with "2_"?
Heres one a merge based approach ,(thank you #pirSquared for improvement). i.e
ndf = df.merge(df, 'left', left_on=['id1', 'id2'], right_on=['id2', 'id1'], suffixes=['', '_2']).drop(['id1_2', 'id2_2'], 1)
cols = ['id1','id2']
ndf[cols] = np.sort(ndf[cols],1)
new = ndf.drop_duplicates(subset=['id1','id2'],keep='first')
id1 id2 num num_2
0 1.0 2.0 123 3231.0
2 5.0 NaN 123 NaN
3 3.0 4.0 231 2394.0
4 6.0 NaN 6534 NaN
The idea is to sort each pair of ids so that we group by them.
cols = ['id1', 'id2']
df[cols] = np.sort(df[cols], 1)
df.set_index(
cols + [df.fillna(-1).groupby(cols).cumcount() + 1]
).num.unstack().add_suffix('_num').reset_index()
id1 id2 1_num 2_num
0 1.0 2.0 123.0 3231.0
1 3.0 4.0 231.0 2394.0
2 5.0 NaN 123.0 NaN
3 6.0 NaN 6534.0 NaN
Use:
df[['id1','id2']] = pd.DataFrame(np.sort(df[['id1','id2']].values, axis=1)).fillna('tmp')
print (df)
id1 id2 num
0 1.0 2 123
1 1.0 2 3231
2 5.0 tmp 123
3 3.0 4 231
4 6.0 tmp 6534
5 3.0 4 2394
df1 = df.groupby(['id1','id2'])['num'].apply(list)
print (df1)
id1 id2
1.0 2.0 [123, 3231]
3.0 4.0 [231, 2394]
5.0 tmp [123]
6.0 tmp [6534]
Name: num, dtype: object
df2 = pd.DataFrame(df1.values.tolist(),
index=df1.index,
columns=['num','2_num'])
.reset_index().replace('tmp', np.nan)
print (df2)
id1 id2 num 2_num
0 1.0 2.0 123 3231.0
1 3.0 4.0 231 2394.0
2 5.0 NaN 123 NaN
3 6.0 NaN 6534 NaN

Python pandas.DataFrame: Make whole row NaN according to condition

I want to make the whole row NaN according to a condition, based on a column. For example, if B > 5, I want to make the whole row NaN.
Unprocessed data frame looks like this:
A B
0 1 4
1 3 5
2 4 6
3 8 7
Make whole row NaN, if B > 5:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you.
Use boolean indexing for assign value per condition:
df[df['B'] > 5] = np.nan
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Or DataFrame.mask which add by default NaNs by condition:
df = df.mask(df['B'] > 5)
print (df)
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
Thank you Bharath shetty:
df = df.where(~(df['B']>5))
You can also use df.loc[df.B > 5, :] = np.nan
Example
In [14]: df
Out[14]:
A B
0 1 4
1 3 5
2 4 6
3 8 7
In [15]: df.loc[df.B > 5, :] = np.nan
In [16]: df
Out[16]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN
in human language df.loc[df.B > 5, :] = np.nan can be translated to:
assign np.nan to any column (:) of the dataframe ( df ) where the
condition df.B > 5 is valid.
Or using reindex
df.loc[df.B<=5,:].reindex(df.index)
Out[83]:
A B
0 1.0 4.0
1 3.0 5.0
2 NaN NaN
3 NaN NaN

Categories