Python alternative to R mutate - python

I want to convert R code into Python. The code in R is
df %>% mutate(N = if_else(Interval != lead(Interval) | row_number() == n(), criteria/Count, NA_real_))
In Python I wrote the following:
import pandas as pd
import numpy as np
df = pd.read_table('Fd.csv', sep=',')
for i in range(1,len(df.Interval)-1):
x = df.Interval[i]
n = df.Interval[i+1]
if x != n | x==df.Interval.tail().all():
df['new']=(df.criteria/df.Count)
else:
df['new']='NaN'
df.to_csv (r'dataframe.csv', index = False, header=True)
However, the output returns all NaNs.
Here is what the data looks like
Interval | Count | criteria
0 0 0
0 1 0
0 2 0
0 3 0
1 4 1
1 5 2
1 6 3
1 7 4
2 8 1
2 9 2
3 10 3
and this is what I want to get ( I also need to consider the last line)
Interval | Count | criteria | new
0 0 0
0 1 0
0 2 0
0 3 0 0
1 4 1
1 5 2
1 6 3
1 7 4 0.5714
2 8 1
2 9 2 0.2222
3 10 3 0.3333
If anyone could help find my mistake, I would greatly appreciate.

1. Start indexing at 0
The first thing to note is that Python starts indexing at 0 (in contrast to R which starts at 1). Therefore, you need to modify the index range of your for-loop.
2. Specify row indices
When calling
df['new']=(df.criteria/df.Count)
or
df['new']='NaN'
you are setting/getting all the values in the "new" column. However, you intend to set the value only in some rows. Therefore, you need to specify the row.
3. Working example
import pandas as pd
df = pd.DataFrame()
df["Interval"] = [0,0,0,0,1,1,1,1,2,2,3]
df["Count"] = [0,1,2,3,4,5,6,7,8,9,10]
df["criteria"] = [0,0,0,0,1,2,3,4,1,2,3]
df["new"] = ["NaN"] * len(df.Interval)
last_row = len(df.Interval) - 1
for row in range(0, len(df.Interval)):
current_value = df.Interval[row]
next_value = df.Interval[min(row + 1, last_row)]
if (current_value != next_value) or (row == last_row):
result = df.loc[row, 'criteria'] / df.loc[row, 'Count']
df.loc[row, 'new'] = result

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

Count how many initial elements in Pandas Series equal to a certain value?

As in question. I know how to compute it, but is there better/faster/more elegant way to do this?
Cnt is the result.
s = pd.Series( np.random.randint(2, size=10) )
cnt = 0
for n in s:
if n != 0:
break
else:
cnt += 1
continue
Use Series.eq to create a boolean mask then use Series.cummin to return a cummulative minimum over this series finally use Series.sum to get the total count:
cnt = s.eq(0).cummin().sum()
Example:
np.random.seed(9)
s = pd.Series(np.random.randint(2, size=10))
# print(s)
0 0
1 0
2 0
3 1
4 0
5 0
6 1
7 0
8 1
9 1
dtype: int64
cnt = s.eq(0).cummin().sum()
#print(cnt)
3
I have done in a dataframe as it is easier to produce but you can use the vectorized .cumsum to speed up your code with .loc for values == 0. Then just find the length with len:
import pandas as pd, numpy as np
s = pd.DataFrame(pd.Series(np.random.randint(2, size=10)))
s['t'] = s[0].cumsum()
o = len(s.loc[s['t']==0])
o
If you set o = to a column with s['o'] = o, then the output looks like this:
0 t o
0 0 0 2
1 0 0 2
2 1 1 2
3 1 2 2
4 0 2 2
5 1 3 2
6 1 4 2
7 1 5 2
8 1 6 2
9 0 6 2
You can use cumsum() in a mask and then sum() to get the number of initial 0s in the sequence:
s = pd.Series(np.random.randint(2, size=10))
(s.cumsum() == 0).sum()
Note that this method only works if you want to count 0s. If you want to count occurrences of non-zero values you can generalize it, ie.:
(s.sub(s[0]).cumsum() == 0).sum()

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Append count of rows meeting a condition within a group to Pandas dataframe

I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']
You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Generating new columns as a full-combination of other columns

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.
It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]
Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

Categories