In a pandas DataFrame, I can create a Series B with the maximum value of another Series A, from the first row to the current one, by using an expanding window:
df['B'] = df['A'].expanding().max()
I can also extract the value of the index of the maximum overall value of Series A:
idx_max_A = df['A'].idxmax().value
What I want is an efficient way to combine both; that is, to create a Series B that holds the value of the index of the maximum value of Series A from the first row up to the current one. Ideally, something like this...
df['B'] = df['A'].expanding().idxmax().value
...but, of course, the above fails because the Expanding object does not have idxmax. Is there a straightforward way to do this?
EDIT: For illustration purposes, for the following DataFrame...
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
...I'd like to create an additional column B so that the DataFrame contains the following:
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
I believe you can use expanding + max + groupby:
v = df.expanding().max().A
df['B'] = v.groupby(v).transform('idxmax')
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
It seems idmax is a function in the latest version of pandas which I don't have yet. Here's a solution not involving groupby or idmax
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
temp = df.A.expanding().max()
df['B'] = temp.apply(lambda x: temp[temp == x].index[0])
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
Related
There is a data frame.
I would like to add column 'e' after checking below conditions.
if component of 'c' is in column 'a' AND component of 'd' is in column 'b' at same row , then component of e is OK
else ""
import pandas as pd
import numpy as np
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9']}
df = pd.DataFrame(A)
The result I want to get is
A = {'a':[0,2,1,4], 'b':[4,5,1,7],'c':['1','2','3','6'], 'd':['1','4','2','9'], 'e':['OK','','','']}
You can merge df with itself on ['a', 'b'] on the left and ['c', 'd'] on the right. If index 'survives' the merge, then e should be OK:
df['e'] = np.where(
df.index.isin(df.merge(df, left_on=['a', 'b'], right_on=['c', 'd']).index),
'OK', '')
df
Output:
a b c d e
0 0 4 1 1 OK
1 2 5 2 4
2 1 1 3 2
3 4 7 6 9
P.S. Before the merge, we need to convert a and b columns to str type (or c and d to numeric), so that we can compare c and a, and d and b:
df[['a', 'b']] = df[['a', 'b']].astype(str)
I have below data frame with 5 columns, I need to check specific string("-") in all columns and add precedent value in new column(F) if "-" is found. for example, "-" is located in Column B row zero and two; hence, 'a' and 'c'[precedent Column value] are added in Column(F) in related rows and so on.
Source Data Frame:
Desired Data Frame would be:
I have written below codes but get value length error when I want to create new Column(F), appreciate your support.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
'B': {0: '-', 1: 'a', 2: '-', 3: 'b', 4: 'd'}})
df['C'] = np.where(df['B'].isin(df['A'].values), df['B'], np.nan)
df['C'] = df['C'].map(dict(zip(df.A.values, df.B.values)))
df['D'] = np.where(df['C'].isin(df['B'].values), df['C'], np.nan)
df['D'] = df['D'].map(dict(zip(df.B.values, df['C'].values)))
df['E'] = np.where(df['D'].isin(df['C'].values), df['D'], np.nan)
df['E'] = df['E'].map(dict(zip(df['C'].values, df['D'].values)))
a=np.array(df.iloc[:,:5])
g=[]
for index,x in np.ndenumerate(a):
temp=[]
if x=="-":
temp.append(x-1)
g.append(temp)
df['F']=g
print(df)
Replace misisng values to all columns by DataFrame.where exclude previous values by - compared by DataFrame.shifted values, then back filling missing values and select first column by position:
df['F'] = df.where(df.shift(-1, axis=1).eq('-')).bfill(axis=1).iloc[:, 0]
print (df)
A B F
0 a - a
1 b a NaN
2 c - c
3 d b NaN
4 e d NaN
You can do:
df['F']=[i[0][-1] if len(i)>1 else np.nan for i in df.fillna('').sum(axis=1).str.split('-') ]
output:
df['F']
Out[41]:
0 a
1 a
2 c
3 a
4 a
Name: F, dtype: object
List Comprehension Explanation:
fill the NAs in df with '' and sum it across rows
split the sum with -
select the first element after spliting with - if length is > 1, else - wont be present hence fill with np.nan
select the last element of the splitted data by using [-1]
I'm a bit confused with data orientation when creating a Multiindexed DataFrame from a DataFrame.
I import data with read_excel() and I begin with something like:
import pandas as pd
df = pd.DataFrame([['A', 'B', 'A', 'B'], [1, 2, 3, 4]],
columns=['k', 'k', 'm', 'm'])
df
Out[3]:
k k m m
0 A B A B
1 1 2 3 4
I want to multiindex this and to obtain:
A B A B
k k m m
0 1 2 3 4
Mainly from Pandas' doc, I did:
arrays = df.iloc[0].tolist(), list(df)
tuples = list(zip(*arrays))
multiindex = pd.MultiIndex.from_tuples(tuples, names=['topLevel', 'downLevel'])
df = df.drop(0)
If I try
df2 = pd.DataFrame(df.values, index=multiindex)
(...)
ValueError: Shape of passed values is (4, 1), indices imply (4, 4)
I then have to transpose the values:
df2 = pd.DataFrame(df.values.T, index=multiindex)
df2
Out[11]:
0
topLevel downLevel
A k 1
B k 2
A m 3
B m 4
Last I re-transpose this dataframe to obtain:
df2.T
Out[12]:
topLevel A B A B
downLevel k k m m
0 1 2 3 4
OK, this is what I want, but I don't understand why I have to transpose 2 times. It seems useless.
You can create the MultiIndex yourself, and then drop the row. From your starting df:
import pandas as pd
df.columns = pd.MultiIndex.from_arrays([df.iloc[0], df.columns], names=[None]*2)
df = df.iloc[1:].reset_index(drop=True)
A B A B
k k m m
0 1 2 3 4
I'm using python pandas and I want to adjust one same index to multiple columns and make it into one column. And when it's possible, I also want to delete the zero value.
I have this data frame
index A B C
a 8 0 1
b 2 3 0
c 0 4 0
d 3 2 7
I'd like my output to look like this
index data value
a A 8
b A 2
d A 3
b B 3
c B 4
d B 2
a C 1
d C 7
===
I solved this task as below. My original data has 2 indexes & 0 in dataframe were NaN values.
At first, I tried to apply melt function while removing NaN values following this (How to melt a dataframe in Pandas with the option for removing NA values), but I couldn't.
Because my original data has several columns ('value_vars'). so I re-organized dataframe by 2 steps:
Firstly, I made multi-column into one-column by melt function,
Then removed NaN values in each rows by dropna function.
This looks a little like the melt function in pandas, with the only difference being the index.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html
Here is some code you can run to test:
import pandas as pd
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},'B': {0: 1, 1: 3, 2: 5},'C': {0: 2, 1: 4, 2: 6}})
pd.melt(df)
With a little manipulation, you could solve for the indexing issue.
This is not particularly pythonic, but if you have a limited number of columns, you could make due with:
molten = pd.melt(df)
a = molten.merge(df, left_on='value', right_on = 'A')
b = molten.merge(df, left_on='value', right_on = 'B')
c = molten.merge(df, left_on='value', right_on = 'C')
merge = pd.concat([a,b,c])
try this:
array = [['a', 8, 0, 1], ['b', 2, 3, 0] ... ]
cols = ['A', 'B', 'C']
result = [[[a[i][0], cols[j], a[i][j + 1]] for i in range(len(a))] for j in range(2)]
output:
[[['a', 'A', 8], ['b', 'A', 2]], [['a', 'B', 0], ['b', 'B', 3]] ... ]
I know that by using set_index i can convert an existing column into a dataframe index, but is there a way to specify, directly in the Dataframe constructor to use of one the data columns as an index (instead of turning it into a column).
Right now i initialize a DataFrame using data records, then i use set_index to make the column into an index.
DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}], index= ['a', 'b'], columns=('c', 'd'))
I want:
c d
ab
11 2 1
12 2 2
Instead i get:
c d
a 2 1
b 2 2
You can use MultiIndex.from_tuples:
print (pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d], names=('a','b')))
MultiIndex(levels=[[1], [1, 2]],
labels=[[0, 0], [0, 1]],
names=['a', 'b'])
d = [{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]
df= pd.DataFrame(d,
index = pd.MultiIndex.from_tuples([(x['a'], x['b']) for x in d],
names=('a','b')),
columns=('c', 'd'))
print (df)
c d
a b
1 1 2 1
2 2 2
You can just chain call set_index on the ctor without specifying the index and columns params:
In [19]:
df=pd.DataFrame([{'a':1,'b':1,"c":2,'d':1},{'a':1,'b':2,"c":2,'d':2}]).set_index(['a','b'])
df
Out[19]:
c d
a b
1 1 2 1
2 2 2