How to drop rows of dataframe inside of the function python - python

I have dataframe and I want to drop some rows inside of the function
def IncomeToGo(dataframe, mainCatName):
for k in dataframe.name:
if mainCatName in k:
dataframe= dataframe.drop(dataframe.loc[dataframe.name == k].index)
this is the way I use that function
print(len(df1)) // len = 21
IncomeToGo(df1, 'Apple')
print(len(df1)) // len = 21
but the drop part don't do anything and nothing removed form my dataframe

IIUC, here's one way:
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name.ne(mainCatName)]
Example:
Initial df:
name menu
0 A cheese
1 A cake
2 A sausage
3 B chicken
4 B cake
5 B water
6 C chicken
7 C sausage
8 C water
9 D water
10 D cheese
11 D sausage
df = pd.DataFrame({'name': {0: 'A',
1: 'A',
2: 'A',
3: 'B',
4: 'B',
5: 'B',
6: 'C',
7: 'C',
8: 'C',
9: 'D',
10: 'D',
11: 'D'},
'menu': {0: 'cheese',
1: 'cake',
2: 'sausage',
3: 'chicken',
4: 'cake',
5: 'water',
6: 'chicken',
7: 'sausage',
8: 'water',
9: 'water',
10: 'cheese',
11: 'sausage'}})
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name.ne(mainCatName)]
IncomeToGo(df, 'A')
OUTPUT df:
name menu
3 B chicken
4 B cake
5 B water
6 C chicken
7 C sausage
8 C water
9 D water
10 D cheese
11 D sausage

You have 2 errors in your code:
You don't return anything from your function
You remove rows from the column that your are looping through. This is a very bad practice
Try just by filtering out these rows:
def IncomeToGo(dataframe, mainCatName):
return dataframe[dataframe.name != mainCatName]

you can try something like below if you want to use for loop over column values
def IncomeToGo(dataframe, mainCatName):
for k in dataframe.name.unique():
if mainCatName == k:
dataframe = dataframe.loc[dataframe.name != mainCatName].copy()
return dataframe
I would advice not to hardcode column names in the function. Write them in such a way that function can be used at mutiple places dynamically.

Related

Converting pandas df from long to wide with duplicated rows and categorical variables as values [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 6 months ago.
I have a simple dataframe
{'ID': {0: 101, 1: 101, 2: 101, 3: 102, 4: 102, 5: 102, 6: 102, 7: 102, 8: 103, 9: 103}, 'Category': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'A', 5: 'A', 6: 'B', 7: 'B', 8: 'A', 9: 'B'}}
You can see that ID has duplicates and gives me a reshaping error.
I want to convert if from wide to long and keep values in the Category column as a new column (I will have max 5 categories per ID, but some will have less than 5 and should have NaN. Plus want to rename columns to Role1, Role2 etc.
Expected output
ID Role1 Role2 Role3 Role4 Role5
101 A B C NaN NaN
102 A A A B B
103 A B NaN NaN NaN
You can assign an enumeration within each ID and query that before pivoting:
N = 5
(df.assign(role_enum=df.groupby('ID').cumcount()+1)
.query('role_enum<=#N')
.pivot(index='ID', columns='role_enum', values='Category')
.add_prefix('Role').reset_index() # book-keeping
)
Output:
role_enum ID Role1 Role2 Role3 Role4 Role5
0 101 A B C NaN NaN
1 102 A A A B B
2 103 A B NaN NaN NaN

Increasing a value during merge in pandas

I have 2 dataframes
df1
product_id value name
abc 10 a
def 20 b
ggg 10 c
df2
Which I get after using df2.groupby(['prod_id'])['code'].count().reset_index()
prod_id code
abc 10
def 20
ggg 10
ooo 5
hhh 1
I want to merge values from df2 to df1 left on product_id, right on prod_id.
To get:
product_id value name
abc 20 a
def 40 b
ggg 20 c
I tried:
pd.merge(df1, df2.groupby(['prod_id'])['code'].count().reset_index(),
left_on='product_id', right_on='prod_id', how='left')
Which returns df1 with 2 additional columns prod_id and code with the code column holding the amount by which I would like to increase value in df1. Now I can just add those 2 columns but I would like to avoid that.
Here’s one alternative:
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
Complete example:
df1 = pd.DataFrame({'product_id': {0: 'abc', 1: 'def', 2: 'ggg'},
'value': {0: 10, 1: 20, 2: 10},
'name': {0: 'a', 1: 'b', 2: 'c'}})
df2 = pd.DataFrame({'prod_id': {0: 'abc', 1: 'def', 2: 'ggg', 3: 'ooo', 4: 'hhh'},
'code': {0: 10, 1: 20, 2: 10, 3: 5, 4: 1}})
df1['value'] = df1.product_id.map(dict(df2.values)).fillna(0).add(df1.value)
OUTPUT:
product_id value name
0 abc 20 a
1 def 40 b
2 ggg 20 c
you could use reindex on df2 with the order of df1 product_id, after the groupby.count (without the reset_index). like
df1['value'] += (
df2.groupby(['prod_id'])
['code'].count()
.reindex(df1['product_id'], fill_value=0)
.to_numpy()
)

Compare two columns in two dataframes with a condition on another column

I have a multilevel dataframe and I want to compare the value in column secret with a condition on column group. If group = A, we allow the value in another dataframe to be empty or na. Otherwise, output INVALID for the mismatching ones.
multilevel dataframe:
name secret group
df1 df2 df1 df2 df1 df2
id
1 Tim Tim random na A A
2 Tom Tom tree A A
3 Alex Alex apple apple B B
4 May May file cheese C C
expected output for secret
id name secret group
1 Tim na A
2 Tom A
3 Alex apple B
4 May INVALID C
so far I have:
result_df['result'] = multilevel_df.groupby(level=0, axis=0).apply(lambda x: secret_check(x))
#take care of the rest by compare column by column
result_df = multilevel_df.groupby(level=0, axis=1).apply(lambda x: validate(x))
def validate(x):
if x[0] == x[1]:
return x[1]
else:
return 'INVALID'
def secret_check(x):
if (x['group'] == 'A' and pd.isnull(['secret']): #this line is off
return x[1]
elif x[0] == x[1]:
return x[1]
else:
return 'INVALID'
Assuming we have the following dataframe:
df = pd.DataFrame({0: {0: 1, 1: 2, 2: 3, 3: 4},
1: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
2: {0: 'Tim', 1: 'Tom', 2: 'Alex', 3: 'May'},
3: {0: 'random', 1: 'tree', 2: 'apple', 3: 'file'},
4: {0: 'na', 1: '', 2: 'apple', 3: 'cheese'},
5: {0: 'A', 1: 'A', 2: 'B', 3: 'C'},
6: {0: 'A', 1: 'A', 2: 'B', 3: 'C'}})
df
df.columns = pd.MultiIndex.from_tuples([('id',''), ('name', 'df1'), ('name', 'df2'),
('secret', 'df1'), ('secret', 'df2'), ('group', 'df1'), ('group', 'df2')])
df
In[1]:
id name secret group
df1 df2 df1 df2 df1 df2
0 1 Tim Tim random na A A
1 2 Tom Tom tree A A
2 3 Alex Alex apple apple B B
3 4 May May file cheese C C
You can use np.select() to return results based on conditions.
.droplevel() to get out of a multiindex dataframe
and df.loc[:,~df.columns.duplicated()] to drop duplicate columns. Since we are setting the answer to df1 columns, df2 columns are not needed.
df[('secret', 'df1')] = np.select([(df[('group', 'df2')] != 'A') &
(df[('secret', 'df1')] != df[('secret', 'df2')])], #condition 1
[df[('secret', 'df1')] + ' > ' + df[('secret', 'df2')]], #result 1
df[('secret', 'df2')]) #alterantive if conditions not met
df.columns = df.columns.droplevel(1)
df = df.loc[:,~df.columns.duplicated()]
df
Out[1]:
id name secret group
0 1 Tim na A
1 2 Tom A
2 3 Alex apple B
3 4 May file > cheese C
If I understand you right, you want to mark "secret" in df2 as invalid if the secrets in df1 and df2 differ and the group is not A. There you go:
condition = (df[('secret', 'df1')] != df[('secret', 'df2')]) &\
df[('group', 'df1')] != 'A')
df.loc[condition, ('secret', 'df2')] = 'INVALID'

Working with the output of groupby and groupby.size()

I have a pandas dataframe containing a row for each object manipulated by participants during a user study. Each participant participates in the study 3 times, one in each of 3 conditions (a,b,c), working with around 300-700 objects in each condition.
When I report the number of objects worked with I want to make sure that this didn't vary significantly by condition (I don't expect it to have done, but need to confirm this statistically).
I think I want to run an ANOVA to compare the 3 conditions, but I can't figure out how to get the data I need for the ANOVA.
I currently have some pandas code to group the data and count the number of rows per participant per condition (so I can then use mean() and similar to summarise the data). An example with a subset of the data follows:
>>> tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size()
>>> tmp
participant_id condition
1 a 576
2 b 367
3 a 703
4 c 309
dtype: int64
To calculate the ANOVA I would normally just filter these by the condition column, e.g.
cond1 = tmp[tmp[FIELD_CONDITION] == CONDITION_A]
cond2 = tmp[tmp[FIELD_CONDITION] == CONDITION_B]
cond3 = tmp[tmp[FIELD_CONDITION] == CONDITION_C]
f_val, p_val = scipy.stats.f_oneway(cond1, cond2, cond3)
However, since tmp is a Series rather than the DataFrame I'm used to, I can't figure out how to achieve this in the normal way.
>>> tmp[FIELD_CONDITION]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 583, in __getitem__
result = self.index.get_value(self, key)
File "/Library/Python/2.7/site-packages/pandas/indexes/multi.py", line 626, in get_value
raise e1
KeyError: 'condition'
>>> type(tmp)
<class 'pandas.core.series.Series'>
>>> tmp.index
MultiIndex(levels=[[u'1', u'2', u'3', u'4'], [u'd', u's']],
labels=[[0, 1, 2, 3], [0, 0, 0, 1]],
names=[u'participant_id', u'condition'])
I feel sure this is a straightforward problem to solve, but I can't seem to get there without some help :)
I think you need reset_index and then output is DataFrame:
tmp = df.groupby([FIELD_PARTICIPANT, FIELD_CONDITION]).size().reset_index(name='count')
Sample:
import pandas as pd
df = pd.DataFrame({'participant_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, 7: 3, 8: 4, 9: 4},
'condition': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b', 6: 'b', 7: 'a', 8: 'c', 9: 'c'}})
print (df)
condition participant_id
0 a 1
1 a 1
2 a 1
3 a 1
4 b 2
5 b 2
6 b 2
7 a 3
8 c 4
9 c 4
tmp = df.groupby(['participant_id', 'condition']).size().reset_index(name='count')
print (tmp)
participant_id condition count
0 1 a 4
1 2 b 3
2 3 a 1
3 4 c 2
If need working with Series you can use condition where select values of level condition of Multiindex by get_level_values:
tmp = df.groupby(['participant_id', 'condition']).size()
print (tmp)
participant_id condition
1 a 4
2 b 3
3 a 1
4 c 2
dtype: int64
print (tmp.index.get_level_values('condition'))
Index(['a', 'b', 'a', 'c'], dtype='object', name='condition')
print (tmp.index.get_level_values('condition') == 'a')
[ True False True False]
print (tmp[tmp.index.get_level_values('condition') == 'a'])
participant_id condition
1 a 4
3 a 1
dtype: int64

Applying regex between two dataframes efficiently

I'm having a bit of a performance issue while trying to match two specific words in two dataframes. I need to return a 1 for every row containing a word and else a 0. The function I wrote looks as follows:
def matchWords(row):
row = row[0].upper()
for x in df_X.Names:
if re.search("\\b" + x + "\\b", row):
return 1
return 0
This function is called from a lambda and although it works fine, it takes quite a long time to run. I have allready applied multithreading in an effort to increase the speed but I want it faster. Is there a way to maybe precompile the df_X.Names or does anybody have another tip to get this faster / more efficient?
Thanks in advance for any help!
IIUC you need str.contains, multiple words can be join by | (or). Last use numpy.where:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'d': {0: 'wa', 1: 'rs', 2: 'qn'},
'e': {0: 'i', 1: 'r', 2: 't'},
'f': {0: 'a', 1: 's', 2: 'f'}})
print df1
d e f
0 wa i a
1 rs r s
2 qn t f
df = pd.DataFrame({'a': {0: 'wa ug dh', 1: 'rs sd qn', 2: 'ga mf rn'},
'c': {0: 'i', 1: 'r', 2: 't'},
'b': {0: 'a', 1: 's', 2: 'f'}})
print df
a b c
0 wa ug dh a i
1 rs sd qn s r
2 ga mf rn f t
Join values from column d with separator |:
words = "|".join(df1.d.tolist())
print words
wa|rs|qn
print df.a.str.contains(words)
0 True
1 True
2 False
Name: a, dtype: bool
print np.where(df.a.str.contains(words), 1, 0)
[1 1 0]
df['new'] = np.where(df.a.str.contains(words), 1, 0)
print df
a b c new
0 wa ug dh a i 1
1 rs sd qn s r 1
2 ga mf rn f t 0

Categories