I have an input data frame which looks like
0 1
0 0 10,30
1 1 10,40
2 2 20,50
Now I am trying to split the second column and store the value in to a new column. Here if the value in column A is divisible by 2 then get the first value from column B else second value like below
A B C
0 0 10,30 10
1 1 10,40 10
2 2 20,50 50
My Code :
import pandas as pd
import numpy as np
df = pd.DataFrame([(0, '10,30'), (1, '10,40'), (2, '20,50')])
df['n'] = np.where(df[0] % 2 == 0, df[0], 0 )
df[2] = (df[1]).str.split(',').str[df['n'].fillna(0)
print(df)
Its throwing an error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I believe need lookup by splited column to DataFrame and cast boolean mask to int for select first column by 0 and second column by 1:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 == 0).astype(int))
print (df)
0 1 2
0 0 10,30 30
1 1 10,40 10
2 2 20,50 50
print (df[0] % 2 == 0)
0 True
1 False
2 True
Name: 0, dtype: bool
#select second, first, second column
print ((df[0] % 2 == 0).astype(int))
0 1
1 0
2 1
Name: 0, dtype: int32
Similar solution with changed condition:
df[2] = df[1].str.split(',', expand=True).lookup(df.index, (df[0] % 2 != 0).astype(int))
print (df)
0 1 2
0 0 10,30 10
1 1 10,40 40
2 2 20,50 20
print (df[0] % 2 != 0)
0 False
1 True
2 False
Name: 0, dtype: bool
#select first, second, first column
print ((df[0] % 2 != 0).astype(int))
0 0
1 1
2 0
Name: 0, dtype: int32
print (df[1].str.split(',', expand=True))
0 1
0 10 30 <-first 10
1 10 40 <-second 40
2 20 50 <-first 20
I think you can also achieve it with the method apply.
First let's put the column 1 splited together with the target index into a new dataframe df1
df1 = pd.concat({i:df[1].str.split(',').str.get(i) for i in range(2)}, axis=1)
df1['ind'] = df[0] % 2
df1
0 1 ind
0 10 30 0
1 10 40 1
2 20 50 0
Next you can put the new values into the column 2 with
df[2] = df1.apply(lambda p: p.loc[p["ind"]], axis=1)
df[2]
0 10
1 40
2 20
dtype: object
If you don't want to create a new data frame, you can also do the following to get the same result
df[2] = df1.apply(lambda p: p.loc[1].split(",")[p.loc[0] % 2], axis=1)
Related
I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))
I have a data frame:
df = pd.DataFrame([[0,4,0,0],
[1,5,1,0],
[2,6,0,0],
[3,7,1,0]], columns=['index', 'A', 'class', 'label'])
df:
index
A
class
label
0
4
0
0
1
5
1
0
2
6
0
0
3
7
1
0
I want to change the label to 1, if the mean of A column of rows with class 0 is bigger than the mean of all data in column A?
How to do this in a few line of code?
I tried this but didn't work:
if df[df['class'] == 0]['A'].mean() > df['A'].mean():
df[df['class']]['lable'] = 1
Use the following, pandas.DataFrame.groupby 'class', get groupby.mean of each group of 'A', check whether greater than df['A'].mean(), and pandas.Series.map that boolean series astype(int) to df['class'] and assign to df['label']:
>>> df['label'] = df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
).astype(int)
>>> df
index A class label
0 0 4 0 0
1 1 5 1 1
2 2 6 0 0
3 3 7 1 1
Since you are checking only for class == 0, you need to add another boolean mask on df['class']:
>>> df['label'] = (df['class'].map(
df.groupby('class')['A'].mean() > df['A'].mean()
) & (~df['class'].astype(bool))
).astype(int)
index A class label
0 0 4 0 0
1 1 5 1 0 # because (5+7)/2 < (4+5+6+7)/4
2 2 6 0 0
3 3 7 1 0 # because (5+7)/2 < (4+5+6+7)/4
So even if your code has worked, you will not know it, because the conditions do not get fulfilled.
If I understand correctly, if the condition you mentioned is fullfilled, than the labels of all rows changes to 1 right? in that case what you did es correct but you missed something, the code should look like this:
if df[df['class'] == 0]['A'].mean() > df['A'].mean:
df['label'] = 1
This should work.
What you did does not work because when you use df[df['class']], you are only selecting the 'class' column of the DataFrame, so the 'label' column you want to modify is not called
I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p
I have two dataframes df1 and df2:
df1 is shown here:
age
0 42
1 52
2 36
3 24
4 73
df2 is shown here:
age
0 0
1 0
2 1
3 0
4 0
I want to replace all the zeros in df2 with their corresponding entries in df1. In more technical words, if the element at a certain index in df2 is zero, then I would want this element to be replaced by the corresponding entry in df1.
Hence, I want df2 to look like:
age
0 42
1 52
2 1
3 24
4 73
I tried using the replace method but it is not working. Please help :)
Thanks in advance.
You could use where:
In [19]: df2.where(df2 != 0, df1)
Out[19]:
age
0 42
1 52
2 1
3 24
4 73
Above, df2 != 0 is a boolean DataFrame.
In [16]: df2 != 0
Out[16]:
age
0 False
1 False
2 True
3 False
4 False
df2.where(df2 != 0, df1) returns a new DataFrame. Where df2 != 0 is True, the corresponding value of df2 is used. Where it is False, the corresponding value of df1 is used.
Another alternative is to make an assignment with df.loc:
df2.loc[df2['age'] == 0, 'age'] = df1['age']
df.loc[mask, col] selects rows of df where the boolean Series, mask is True, and where the column label is col.
In [17]: df2.loc[df2['age'] == 0, 'age']
Out[17]:
0 0
1 0
3 0
4 0
Name: age, dtype: int64
When used in an assignment, such as df2.loc[df2['age'] == 0, 'age'] = df1['age'],
Pandas performs automatic index label alignment. (Notice the index labels above are 0,1,3,4 -- with 2 being skipped). So the values in df2.loc[df2['age'] == 0, 'age'] are replaced by the corresponding values from d1['age']. Even though d1['age'] is a Series with index labels 0,1,2,3, and 4, the 2 is ignored because there is no corresponding index label on the left-hand side.
In other words,
df2.loc[df2['age'] == 0, 'age'] = df1.loc[df2['age'] == 0, 'age']
would work as well, but the added restriction on the right-hand side is unnecessary.
In [30]: df2.mask(df2==0).combine_first(df1)
Out[30]:
age
0 42.0
1 52.0
2 1.0
3 24.0
4 73.0
or "negating" beautiful #unutbu's solution:
In [46]: df2.mask(df2==0, df1)
Out[46]:
age
0 42
1 52
2 1
3 24
4 73
Or try mul
df1.mul(np.where(df2==1,0,1)).replace({0:1})
I have to create a result pandas dataframe from a source pandas dataframe having two columns. The result dataframe should have headers of two types, one type should be from the source dataframe derived from one of the column values appending the column header with the values. The other header is taken as it is from the source dataframe with unique values if there are duplicates. The result dataframe cell values should be 1 depending on whether there was a corresponding derived header from the column values or 0 if none.
The dataframes are as below
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]})
dfm
v x
44 1
39 1
39 1
8 1
40 2
8 2
15 2
15 2
result
x v_8 v_15 v_39 v_40 v_44
1 1 0 1 0 1
2 1 1 0 1 0
You can use function crosstab, then find values higher as 1 and convert it to 1 and 0 by astype:
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]})
print dfm
v x
0 44 1
1 39 1
2 39 1
3 8 1
4 40 2
5 8 2
6 15 2
7 15 2
df = pd.crosstab(dfm.x, dfm.v)
#rename columns
df.rename(columns=lambda x: 'v_' + str(x), inplace=True)
#reset column name v to ''
df.columns.name = ''
print df
v_8 v_15 v_39 v_40 v_44
x
1 1 0 2 0 1
2 1 2 0 1 0
print (df > 0)
v_8 v_15 v_39 v_40 v_44
x
1 True False True False True
2 True True False True False
print (df > 0).astype(int)
v_8 v_15 v_39 v_40 v_44
x
1 1 0 1 0 1
2 1 1 0 1 0
Could you use something like this. I'm not sure its the best solution so I'm interested to see what others post...
import pandas as pd
dfm = pd.DataFrame({'v' : [44,39,39,8,40,8,15,15],'x':[1,1,1,1,2,2,2,2]}).sort(columns="v")
col_heads = ["v_{}".format(x) for x in dfm.v.unique()]
row_index = dfm.x.unique()
nf = pd.DataFrame(index=row_index, columns=col_heads).fillna(0)
nf.index.name = 'x'
a = dfm.groupby("v")
for i,name in a:
for val in name.x:
nf.ix[val, "v_{}".format(i)] = 1
print nf
This outputs the following:
v_8 v_15 v_39 v_40 v_44
x
1 1 0 1 0 1
2 1 1 0 1 0