I have two dataframes and I need to conditionally updated specific columns in the first dataframe.
df1 = pd.DataFrame([[1,'Foo',1,1,1,np.nan,np.nan,np.nan],[2,'Foo',2,2,2,np.nan,np.nan,np.nan],[3,'Bar',3,3,3,np.nan,np.nan,np.nan]], columns = ['Key','identifier','A','B','C','D','E','F'])
print df1
Key identifier A B C D E F
0 1 Foo 1 1 1 NaN NaN NaN
1 2 Foo 2 2 2 NaN NaN NaN
2 3 Bar 3 3 3 NaN NaN NaN
df2 = pd.DataFrame([[1,np.nan,10,10,10,5,6,7],[2,np.nan,12,12,12,8,9,10],[3,np.nan,13,13,13,11,12,13]], columns = ['Key','identifier','A','B','C','D','E','F'])
print df2
Key identifier A B C D E F
0 1 NaN 10 10 10 5 6 7
1 2 NaN 12 12 12 8 9 10
2 3 NaN 13 13 13 11 12 13
Where the identifer column in df1 =='Foo', I need to update df1 columns D,E,F with the corresponding columns from df2. How can I conditionally update those three columns?
df3 = #code here
desired output:
print df3
Key identifier A B C D E F
0 1 Foo 1 1 1 5.0 6.0 7.0
1 2 Foo 2 2 2 8.0 9.0 10.0
2 3 Bar 3 3 3 NaN NaN NaN
Follow-Up
Say instead, df1 was the following:
df1 = pd.DataFrame([[1,'Foo',1,1,1,np.nan,np.nan,np.nan],[4,'Bar',4,4,4,np.nan,np.nan,np.nan],[2,'Foo',2,2,2,np.nan,np.nan,np.nan],[3,'Bar',3,3,3,np.nan,np.nan,np.nan]], columns = ['Key','identifier','A','B','C','D','E','F'])
Now the lengths of df1 and df2 aren't the same and the positioning of the records to be updated doesn't match. How is this still working? I get the following output:
df2[df1['identifier'] == 'Foo'].combine_first(df1)
Key identifier A B C D E F
0 1.0 Foo 10.0 10.0 10.0 5.0 6.0 7.0
1 4.0 Bar 4.0 4.0 4.0 NaN NaN NaN
2 3.0 Foo 13.0 13.0 13.0 11.0 12.0 13.0
3 3.0 Bar 3.0 3.0 3.0 NaN NaN NaN
Use combine_first, after setting Key to the index with set_index.
df1
identifier A B C D E F
Key
1 Foo 1 1 1 NaN NaN NaN
2 Foo 2 2 2 NaN NaN NaN
3 Bar 3 3 3 NaN NaN NaN
df2
identifier A B C D E F
Key
1 NaN 10 10 10 5 6 7
2 NaN 12 12 12 8 9 10
3 NaN 13 13 13 11 12 13
df2[df1.eval('identifier == "Foo"')].combine_first(df1)
identifier A B C D E F
Key
1 Foo 10.0 10.0 10.0 5.0 6.0 7.0
2 Foo 12.0 12.0 12.0 8.0 9.0 10.0
3 Bar 3.0 3.0 3.0 NaN NaN NaN
Related
I am looking for a method to create an array of numbers to label groups, based on the value of the 'number' column. If it's possible?
With this abbreviated example DF:
number = [nan,nan,1,nan,nan,nan,2,nan,nan,3,nan,nan,nan,nan,nan,4,nan,nan]
df = pd.DataFrame(columns=['number'])
df = pd.DataFrame.assign(df, number=number)
Ideally I would like to make a new column, 'group', based on the int in column 'number' - so there would be effectively be array's of 1, ,2, 3, etc. FWIW, the DF is 1000's lines long, with sporadically placed int's.
The result would be a new column, something like this:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
All advice much appreciated!
You can use notna combined with cumsum:
df['group'] = df['number'].notna().cumsum()
NB. if you had zeros: df['group'] = df['number'].ne(0).cumsum().
output:
number group
0 NaN 0
1 NaN 0
2 1.0 1
3 NaN 1
4 NaN 1
5 NaN 1
6 2.0 2
7 NaN 2
8 NaN 2
9 3.0 3
10 NaN 3
11 NaN 3
12 NaN 3
13 NaN 3
14 NaN 3
15 4.0 4
16 NaN 4
17 NaN 4
You can use forward fill:
df['number'].ffill().fillna(0)
Output:
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 1.0
6 2.0
7 2.0
8 2.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 4.0
16 4.0
17 4.0
Name: number, dtype: float64
The objective is to fill NaN with respect to two columns (i.e., a, b) .
a b c d
2,0,1,4
5,0,5,6
6,0,1,1
1,1,1,4
4,1,5,6
5,1,5,6
6,1,1,1
1,2,2,3
6,2,5,6
Such that, there should be continous value of between 1 to 6 for the column a for a fixed value in column b. Then, the other rows assigned to nan.
The code snippet does the trick
import numpy as np
import pandas as pd
maxval_col_a=6
lowval_col_a=1
maxval_col_b=2
lowval_col_b=0
r=list(range(lowval_col_b,maxval_col_b+1))
df=pd.DataFrame(np.column_stack([[2,5,6,1,4,5,6,1,6,],
[0,0,0,1,1,1,1,2,2,], [1,5,1,1,5,5,1,2,5,],[4,6,1,4,6,6,1,3,6,]]),columns=['a','b','c','d'])
all_df=[]
for idx in r:
k=df.loc[df['b']==idx].set_index('a').reindex(range(lowval_col_a, maxval_col_a+1, 1)).reset_index()
k['b']=idx
all_df.append(k)
df=pd.concat(all_df)
But, I am curious whether there are more efficient and better way of doing this with Pandas.
The expected output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
0 1 1 1.0 4.0
1 2 1 NaN NaN
2 3 1 NaN NaN
3 4 1 5.0 6.0
4 5 1 5.0 6.0
5 6 1 1.0 1.0
0 1 2 2.0 3.0
1 2 2 NaN NaN
2 3 2 NaN NaN
3 4 2 NaN NaN
4 5 2 NaN NaN
5 6 2 5.0 6.0
Create the cartesian product of combinations:
mi = pd.MultiIndex.from_product([df['b'].unique(), range(1, 7)],
names=['b', 'a']).swaplevel()
out = df.set_index(['a', 'b']).reindex(mi).reset_index()
print(out)
# Output
a b c d
0 1 0 NaN NaN
1 2 0 1.0 4.0
2 3 0 NaN NaN
3 4 0 NaN NaN
4 5 0 5.0 6.0
5 6 0 1.0 1.0
6 1 1 1.0 4.0
7 2 1 NaN NaN
8 3 1 NaN NaN
9 4 1 5.0 6.0
10 5 1 5.0 6.0
11 6 1 1.0 1.0
12 1 2 2.0 3.0
13 2 2 NaN NaN
14 3 2 NaN NaN
15 4 2 NaN NaN
16 5 2 NaN NaN
17 6 2 5.0 6.0
First create a multindex with cols [a,b] then a new multindex with all the combinations and then you reindex with the new multindex:
(showing all steps)
# set both a and b as index (it's a multiindex)
df.set_index(['a','b'],drop=True,inplace=True)
# create the new multindex
new_idx_a=np.tile(np.arange(0,6+1),3)
new_idx_b=np.repeat([0,1,2],6+1)
new_multidx=pd.MultiIndex.from_arrays([new_idx_a,
new_idx_b])
# reindex
df=df.reindex(new_multidx)
# convert the multindex back to columns
df.index.names=['a','b']
df.reset_index()
results:
a b c d
0 0 0 NaN NaN
1 1 0 NaN NaN
2 2 0 1.0 4.0
3 3 0 NaN NaN
4 4 0 NaN NaN
5 5 0 5.0 6.0
6 6 0 1.0 1.0
7 0 1 NaN NaN
8 1 1 1.0 4.0
9 2 1 NaN NaN
10 3 1 NaN NaN
11 4 1 5.0 6.0
12 5 1 5.0 6.0
13 6 1 1.0 1.0
14 0 2 NaN NaN
15 1 2 2.0 3.0
16 2 2 NaN NaN
17 3 2 NaN NaN
18 4 2 NaN NaN
19 5 2 NaN NaN
20 6 2 5.0 6.0
We can do it by using a groupby on the column b, then set a as index and add the missing values of a using numpy.arange.
To finish, reset the index to get the expected result :
import numpy as np
df.groupby('b').apply(lambda x : x.set_index('a').reindex(np.arange(1, 7))).drop('b', 1).reset_index()
Output :
b a c d
0 0 1 NaN NaN
1 0 2 1.0 4.0
2 0 3 NaN NaN
3 0 4 NaN NaN
4 0 5 5.0 6.0
5 0 6 1.0 1.0
6 1 1 1.0 4.0
7 1 2 NaN NaN
8 1 3 NaN NaN
9 1 4 5.0 6.0
10 1 5 5.0 6.0
11 1 6 1.0 1.0
12 2 1 2.0 3.0
13 2 2 NaN NaN
14 2 3 NaN NaN
15 2 4 NaN NaN
16 2 5 NaN NaN
17 2 6 5.0 6.0
Hi I have the following dataframe that is ~1,400,000 rows:
x = pd.DataFrame({'ID':['A','B','D','D','F'], 'start1':[1,2,3,4,5], 'start2':[12,11,10,6,7], 'start3':[1,6,2,4,5], 'start4':[5,4,2,3,1], 'start5':[0,0,0,0,0], 'end1':[2,3,4,7,9] })
ID start1 start2 start3 start4 start5 end1
A 1 12 1 5 0 2
B 2 11 6 4 0 3
D 3 10 2 2 0 4
D 4 6 4 3 0 7
F 5 7 5 1 0 9
I'm looking to collapse all rows that contain column headers 'start' or 'end' into the following format:
desired output:
ID start end
A 1 NaN
A 12 NAN
A 1 NAN
A 5 NaN
A 0 NaN
A NaN 2
B 2 NaN
B 11 NaN
B 6 NaN
B 4 NaN
B 0 NaN
B 3 NaN
...
F 1 NaN
F 0 NaN
F NaN 9
I have tried:
joined = df2.apply(lambda x: ' '.join([str(xi) for xi in x]), axis=1)
split = joined.str.split('', expand=True).reset_index(drop=False).melt(id_vars='index')
However this seems to use up all my memory and the environment crashes.
Any help would be great
Try melt the start columns and concat
(pd.concat([x.iloc[:,:-1].melt('ID', value_name='start')
.sort_values(['ID','variable']).drop('variable',axis=1),
x[['ID','end1']]
])
.sort_values('ID', kind='mergesort')
)
Output:
ID start end1
0 A 1.0 NaN
5 A 12.0 NaN
10 A 1.0 NaN
15 A 5.0 NaN
20 A 0.0 NaN
0 A NaN 2.0
1 B 2.0 NaN
6 B 11.0 NaN
11 B 6.0 NaN
16 B 4.0 NaN
21 B 0.0 NaN
1 B NaN 3.0
2 D 3.0 NaN
3 D 4.0 NaN
7 D 10.0 NaN
8 D 6.0 NaN
12 D 2.0 NaN
13 D 4.0 NaN
17 D 2.0 NaN
18 D 3.0 NaN
22 D 0.0 NaN
23 D 0.0 NaN
2 D NaN 4.0
3 D NaN 7.0
4 F 5.0 NaN
9 F 7.0 NaN
14 F 5.0 NaN
19 F 1.0 NaN
24 F 0.0 NaN
4 F NaN 9.0
Remember that you are trying to duplicate a large amount of data here, so you need to be careful.
How about this?
import numpy as np
out = pd.DataFrame(columns = ['ID', 'start', 'end'])
for col in x.columns:
if 'start'in col:
out_col = 'start'
if 'end' in col:
out_col = 'end'
if 'ID' not in col:
temp = x[['ID', col]].rename(columns = {col:out_col})
out = pd.concat([out,temp])
Output:
ID start end
0 A 1 NaN
0 A 0 NaN
0 A NaN 2
0 A 12 NaN
0 A 5 NaN
0 A 1 NaN
1 B 0 NaN
1 B 4 NaN
1 B NaN 3
1 B 6 NaN
1 B 11 NaN
1 B 2 NaN
2 D 10 NaN
2 D 0 NaN
2 D 2 NaN
3 D 4 NaN
3 D NaN 7
2 D NaN 4
2 D 2 NaN
3 D 3 NaN
3 D 4 NaN
2 D 3 NaN
3 D 6 NaN
3 D 0 NaN
4 F 5 NaN
4 F 1 NaN
4 F 7 NaN
4 F 5 NaN
4 F 0 NaN
4 F NaN 9
You can merge all columns into one by .ravel()
So: add end1 and Id to another variable
end1values = x['end1']a
idvalues = x['ID']
Remove end and Id from data set:
x.drop('end1',
axis='columns', inplace=True)
x.drop('ID',
axis='columns', inplace=True)
use Ravel for starts:
df = pd.DataFrame({'start':x.values.ravel()})
Add end1 + ID
df['ID'] = idvalues
df['end'] = end1values
Result:
If I have a pandas data frame of ones like this:
NaN 1 1 1 1 NaN 1 1 1 NaN 1
Nan NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1
How do I do a cumulative sum in each row such but then set each grouping with the maximum value of the cumulative sum such that I get a pandas data frame like this:
NaN 4 4 4 4 NaN 3 3 3 NaN 1
Nan NaN 4 4 4 4 NaN NaN 1 NaN 1
NaN NaN 9 9 9 9 9 9 9 9 9
First we do stack with isnull, the create the sub-group with cumsum and count the continue 1 with transform , last step we just need unstack convert the data back
s=df.isnull().stack()
s=s.groupby(level=0).cumsum()[~s]
s=s.groupby([s.index.get_level_values(0),s]).transform('count').unstack().reindex_like(df)
1 2 3 4 5 6 7 8 9 10 11
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
Many more steps than #YOBEN_S but we can make use of melt and groupby
we use cumcount to create a condtional helper column to group with.
from io import StringIO
import pandas as pd
d = """ NaN 1 1 1 1 NaN 1 1 1 NaN 1
NaN NaN 1 1 1 1 NaN NaN 1 NaN 1
NaN NaN 1 1 1 1 1 1 1 1 1"""
df = pd.read_csv(StringIO(d), header=None, sep=r"\s+")
s = df.reset_index().melt(id_vars="index")
s.loc[s["value"].isnull(), "counter"] = s.groupby(
[s["index"], s["value"].isnull()]
).cumcount()
s["counter"] = s.groupby(["index"])["counter"].ffill()
s["val"] = s.groupby(["index", "counter"])["value"].cumsum()
s["val"] = s.groupby(["counter", "index"])["val"].transform("max")
s.loc[s["value"].isnull(), "val"] = np.nan
df2 = (
s.groupby(["index", "variable"])["val"]
.first()
.unstack()
.rename_axis(None, axis=1)
.rename_axis(None)
)
print(df2)
0 1 2 3 4 5 6 7 8 9 10
0 NaN 4.0 4.0 4.0 4.0 NaN 3.0 3.0 3.0 NaN 1.0
1 NaN NaN 4.0 4.0 4.0 4.0 NaN NaN 1.0 NaN 1.0
2 NaN NaN 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0
If I have a pandas data frame like this:
A B C D E F G H
0 0 2 3 5 NaN NaN NaN NaN
1 2 7 9 1 2 NaN NaN NaN
2 1 5 7 2 1 2 1 NaN
3 6 1 3 2 1 1 5 5
4 1 2 3 6 NaN NaN NaN NaN
How do I move all of the numerical values to the end of each row and place the NANs before them? Such that I get a pandas data frame like this:
A B C D E F G H
0 NaN NaN NaN NaN 0 2 3 5
1 NaN NaN NaN 2 7 9 1 2
2 NaN 1 5 7 2 1 2 1
3 6 1 3 2 1 1 5 5
4 NaN NaN NaN NaN 1 2 3 6
One row solution:
df.apply(lambda x: pd.concat([x[x.isna()==True], x[x.isna()==False]], ignore_index=True), axis=1)
I guess the best approach is to work row by row. Make a function to do the job and use apply or transform to use that function on each row.
def movenan(x):
fl = len(x)
nl = len(x.dropna())
nanarr = np.empty(fl - nl)
nanarr[:] = np.nan
return pd.concat([pd.Series(nanarr), x.dropna()], ignore_index=True)
ddf = df.transform(movenan, axis=1)
ddf.columns = df.columns
Using your sample data, the resulting ddf is:
A B C D E F G H
0 NaN NaN NaN NaN 0.0 2.0 3.0 5.0
1 NaN NaN NaN 2.0 7.0 9.0 1.0 2.0
2 NaN 1.0 5.0 7.0 2.0 1.0 2.0 1.0
3 6.0 1.0 3.0 2.0 1.0 1.0 5.0 5.0
4 NaN NaN NaN NaN 1.0 2.0 3.0 6.0
The movenan function creates an array of nan of the required length, drops the nan from the row, and concatenates the two resulting Series.
ignore_index=True is required because you don't want to preserve data position in their columns (values are moved to different columns), but doing this the column names are lost and replaced by integers. The last line simply copies back the column names into the new dataframe.