Conditional statement and split in a Dataframe - python

I am looking for a conditional statement in python to look for a certain information in a specified column and put the results in a new column
Here is an example of my dataset:
OBJECTID CODE_LITH
1 M4,BO
2 M4,BO
3 M4,BO
4 M1,HP-M7,HP-M1
and what I want as results:
OBJECTID CODE_LITH M4 M1
1 M4,BO 1 0
2 M4,BO 1 0
3 M4,BO 1 0
4 M1,HP-M7,HP-M1 0 1
What I have done so far:
import pandas as pd
import numpy as np
lookup = ['M4']
df.loc[df['CODE_LITH'].str.isin(lookup),'M4'] = 1
df.loc[~df['CODE_LITH'].str.isin(lookup),'M4'] = 0
Since there is multiple variables per rows in "CODE_LITH" it seems like the script in not able to find only "M4" it can find "M4,BO" and put 1 or 0 in the new column
I have also tried:
if ('M4') in df['CODE_LITH']:
df['M4'] = 0
else:
df['M4'] = 1
With the same results.
Thanks for your help.
PS. The dataframe contains about 2.6 millions rows and I need to do this operation for 30-50 variables.

I think this is the Pythonic way to do it:
for mn in ['M1', 'M4']: # Add other "M#" as needed
df[mn] = df['CODE_LITH'].map(lambda x: mn in x)

Use str.contains accessor:
>>>> for key in ('M4', 'M1'):
... df.loc[:, key] = df['CODE_LITH'].str.contains(key).astype(int)
>>> df
OBJECTID CODE_LITH M4 M1
0 1 M4,BO 1 0
1 2 M4,BO 1 0
2 3 M4,BO 1 0
3 4 M1,HP-M7,HP-M1 0 1

I was able to do:
for index,data in enumerate(df['CODE_LITH']):
if "I1" in data:
df['Plut_Felsic'][index] = 1
else:
df['Plut_Felsic'][index] = 0
It does work, but takes quite some time to calculate.

Related

Use pandas to group by column and then create a new column based on a condition

I need to reproduce with pandas what SQL does so easily:
select
del_month
, sum(case when off0_on1 = 1 then 1 else 0 end) as on1
, sum(case when off0_on1 = 0 then 1 else 0 end) as off0
from a1
group by del_month
order by del_month
Here is a sample, illustrative pandas dataframe to work on:
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2], 'off0_on1':[0,0,1,1,0,1,1,1]})
Here are my attempts to reproduce the above SQL with pandas. The first line works. The second line gives an error:
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(sum)
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(sum(lambda x: 1 if x == 0 else 0))
Here's the second line's error:
TypeError: 'function' object is not iterable
This previous question of mine had a problem with the lambda function, which was solved. The bigger problem is how to reproduce SQL's "sum(case when)" logic on grouped data. I'm looking for a general solution, since I need to do this sort of thing often. The answers in my previous question suggested using map() inside the lambda function, but the following results for the "off0" column are not what I need. The "on1" column is what I want. The answer should be the same for the whole group (i.e. "del_month").
Simply sum the Trues in your conditional logic expressions:
import pandas as pd
a1 = pd.DataFrame({'del_month':[1,1,1,1,2,2,2,2],
'off0_on1':[0,0,1,1,0,1,1,1]})
a1['on1'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==1))
a1['off0'] = a1.groupby('del_month')['off0_on1'].transform(lambda x: sum(x==0))
print(a1)
# del_month off0_on1 on1 off0
# 0 1 0 2 2
# 1 1 0 2 2
# 2 1 1 2 2
# 3 1 1 2 2
# 4 2 0 3 1
# 5 2 1 3 1
# 6 2 1 3 1
# 7 2 1 3 1
Similarly, you can do the same in SQL if dialect supports it which most should:
select
del_month
, sum(off0_on1 = 1) as on1
, sum(off0_on1 = 0) as off0
from a1
group by del_month
order by del_month
And to replicate above SQL in pandas, don't use transform but send multiple aggregates in a groupby().apply() call:
def aggfunc(x):
data = {'on1': sum(x['off0_on1'] == 1),
'off0': sum(x['off0_on1'] == 0)}
return pd.Series(data)
g = a1.groupby('del_month').apply(aggfunc)
print(g)
# on1 off0
# del_month
# 1 2 2
# 2 3 1
Using get_dummies would only need a single groupby call, which is simpler.
v = pd.get_dummies(df.pop('off0_on1')).groupby(df.del_month).transform(sum)
df = pd.concat([df, v.rename({0: 'off0', 1: 'on1'}, axis=1)], axis=1)
df
del_month off0 on1
0 1 2 2
1 1 2 2
2 1 2 2
3 1 2 2
4 2 1 3
5 2 1 3
6 2 1 3
7 2 1 3
Additionally, for the case of aggregation, call sum directly instead of using apply:
(pd.get_dummies(df.pop('off0_on1'))
.groupby(df.del_month)
.sum()
.rename({0: 'off0', 1: 'on1'}, axis=1))
off0 on1
del_month
1 2 2
2 1 3

Reformatting a dataframe without using for loops

I want to convert a dataframe like:
id event_type count
1 "a" 3
1 "b" 5
2 "a" 1
3 "b" 2
into a dataframe like:
id a b a > b
1 3 5 0
2 1 0 1
3 0 2 0
Without using for-loops. What's a proper pythonic (Pandas-tonic?) way of doing this?
Well, not sure if this is exactly what you need or if it has to be more flexible than this. However, this would be one way to do it - assuming missing values can be replaced by 0.
import pandas as pd
from io import StringIO
# Creating and reading the data
data = """
id event_type count
1 "a" 3
1 "b" 5
2 "a" 1
3 "b" 2
"""
df = pd.read_csv(StringIO(data), sep='\s+')
# Transforming
df_ = pd.pivot_table(df, index='id', values='count', columns='event_type') \
.fillna(0).astype(int)
df_['a > b'] = (df_['a'] > df_['b']).astype(int)
Where df_ will take the form:
event_type a b a > b
id
1 3 5 0
2 1 0 1
3 0 2 0
This can be split up into two parts.
pivot see post
assign new column
Solution
df.set_index(
[‘id’, ‘event_type’]
)[‘count’].unstack(
fill_value=0
).assign(**{
‘a < b’: lambda d: d.eval(‘a < b’)
})

Search boolean matrix using pyspark

I have a boolean matrix of M x N, where M = 6000 and N = 1000
1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
V
6000
Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.
Now the code I have is
sig_matrix = list()
num_columns = df.columns
for col_name in num_columns:
print('Processing column {}'.format(col_name))
sig_index = df.filter(df[col_name] == 1).\
select('perm').limit(1).collect()[0]['perm']
sig_matrix.append(sig_index)
Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.
Here is a numpy version that runs <1s for me, so should be preferable for this size of data:
arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]
There could well be more efficient numpy solutions.
I ended up solving my problem using numpy. Here is how I did it.
import numpy as np
sig_matrix = list()
columns = list(df)
for col_name in columns:
sig_index = np.argmax(df[col_name]) + 1
sig_matrix.append(sig_index)
As the values in my columns are 0 and 1, argmax will return the first occurrence of value 1.

Generating new columns as a full-combination of other columns

Could not find similar cases here.
Suppose, i have a DataFrame
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
So it is:
A B C I II
0 2 2 3 1 0
1 2 2 3 0 1
2 1 3 3 0 0
3 2 3 4 1 1
I want to make a full pairwise combination between {A,B,C} and {I,II}, so i get {I-A,I-B,I-C,II-A,II-B,II-C}:
Each of a new column is just an elementwise multiplication of corresponding base columns
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
ATM i dont have any working solution. I'am trying to use loops(not succeding in this), but i hope there's more sufficient way.
It's pretty simple, really. You have two sets of columns that you want to combine pairwise. I won't even bother with permutation tools:
>>> new_df = pd.DataFrame()
>>>
>>> for i in ["I", "II"]:
for a in ["A", "B", "C"]:
new_df[i+"-"+a] = df[i] * df[a]
>>> new_df
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Of course you could obtain the lists of column names as slices off df.columns, or in whatever other way is convenient. E.g. for your example dataframe you could write
>>> for i in df.columns[3:]:
for a in df.columns[:3]:
new_df[i+"-"+a] = df[i] * df[a]
Using loops, you can use this code. It's definitely not the most elegant solution but should work for your purpose. It only requires that you specify the columns that you'd like to use for the pairwise multiplication. It seems to be quite readable though, which is something you may want.
def element_wise_mult(first, second):
element_wise_mult = []
for i, el in enumerate(first):
element_wise_mult.append(el * second[i])
return element_wise_mult
if __name__ == '__main__':
import pandas as pd
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
fs = ['I', 'II']
sc = ['A', 'B', 'C']
series = []
names = []
for i in fs:
for j in sc:
names.append(i + '-' + j)
series.append(pd.Series(element_wise(df[i], df[j]))) # append array creates as a pandas series
print(pd.DataFrame(series, index=names).T) # reconstruct dataframe from the series and names stored
Returns:
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
Here is a solution without for loops for your specific example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
cross_vals=np.tile(df[df.columns[:3]].values,(1,2))*np.repeat(df[df.columns[3:]].values,3,axis=1)
cros_cols=np.repeat(df.columns[3:].values,3)+np.array('-')+np.tile(df.columns[:3].values,(1,2))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
Then new_df is
I-A I-B I-C II-A II-B II-C
0 2 2 3 0 0 0
1 0 0 0 2 2 3
2 0 0 0 0 0 0
3 2 3 4 2 3 4
You could generalize it to any size as long as the columns A,B,C,... are consecutive and similarly the columns I,II,... are consecutive.
For the general case, if the columns are not necessarily consecutive, you can do the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[2,2,1,2],
'B':[2,2,3,3],
'C':[3,3,3,4],
'I':[1,0,0,1],
'II':[0,1,0,1]})
let=np.array(['A','B','C'],dtype=object)
num=np.array(['I','II'],dtype=object)
cross_vals=np.tile(df[let].values,(1,len(num)))*np.repeat(df[num].values,len(let),axis=1)
cros_cols=np.repeat(num,len(let))+np.array('-')+np.tile(let,(1,len(num)))
new_df=pd.DataFrame(cross_vals,columns=cros_cols[0])
And the result is the same as above.

pandas not setting column correctly

I have the following program in python
# input
import pandas as pd
import numpy as np
data = pd.DataFrame({'a':pd.Series([1.,2.,3.]), 'b':pd.Series([4.,np.nan,6.])})
Here the data is:
In: print data
a b
0 1 4
1 2 NaN
2 3 6
Now I want a isnull column indicating if the row has any nan:
# create data
data['isnull'] = np.zeros(len(data))
data['isnull'][pd.isnull(data).any(axis=1)] = 1
The output is not correct (the second one should be 1):
In: print data
a b isnull
0 1 4 0
1 2 NaN 0
2 3 6 0
However, if I execute the exact command again, the output will be correct:
data['isnull'][pd.isnull(data).any(axis=1)] = 1
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
Is this a bug with pandas or am I missing something obvious?
my python version is 2.7.6. pandas is 0.12.0. numpy is 1.8.0
You're chain indexing which doesn't give reliable results in pandas. I would do the following:
data['isnull'] = pd.isnull(data).any(axis=1).astype(int)
print data
a b isnull
0 1 4 0
1 2 NaN 1
2 3 6 0
For more on the problems with chained indexing, see here:
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-view-versus-copy

Categories