I'm working with Pandas. I need to create a new column in a dataframe according to conditions in other columns. I try to look for each value in a series if it contains a value (a condition to return text).This works when the values are exactly the same but not when the value is only a part of the value of the series.
Sample data :
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
if ("ores") in df5["Symptom"]: return "Things"
df["new_column"] = df.swifter.apply(conditions, axis=1)
It's doesn't work because any("something") is always True
So i tried :
df['new_column'] = np.where(df2["Symptom"].str.contains('ores'), 'yes', 'no') : return "Things"
It doesn't work because it's inside a loop.
I can't use np.select because it needed two separate lists and my code has to be easily editable (and it can't come from a dict).
It also doesn't work with find_all. And also not with :
df["new_column"] == "ores" is True: return "things"
I don't really understand why nothing work and what i have to do ?
Edit :
df5 = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
def conditions(df5):
(df5["Symptom"].str.contains('ores'), 'Things')
df5["Deversement Service"] = np.where(conditions)
df5
For the moment i have a lenght of values problem
To add a new column with condition, use np.where:
df = pd.DataFrame([["ores"], ["ores + more texts"], ["anything else"]], columns=['Symptom'])
df['new'] = np.where(df["Symptom"].str.contains('ores'), 'Things', "")
print (df)
Symptom new
0 ores Things
1 ores + more texts Things
2 anything else
If you need a single boolean value, use pd.Series.any:
if df["Symptom"].str.contains('ores').any():
print ("Things")
# Things
Related
---Hello, everyone! New student of Python's Pandas here.
I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.
df_dict = {
'header0' : [55,12,13,14,15],
'header1' : [21,22,23,24,25],
'header2' : [31,32,55,34,35],
'header3' : [41,42,43,44,45],
'header4' : [51,52,53,54,33]
}
index_list = {
0:'index0',
1:'index1',
2:'index2',
3:'index3',
4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)
GOAL:
I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.
CLARIFICATIONS:
Imagine the dataframe is of a large enough size that I cannot "just find it manually"
I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.
WHAT I'VE TRIED SO FAR:
I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.
Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:
Only returns the max/min as per the .idxmax(), .idxmin() functions
Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates
I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.
Thanks.
EDIT: Added text database, clarified questions.
Use np.where:
r, c = np.where(df == 55)
list(zip(df.index[r], df.columns[c]))
Output:
[('index0', 'header0'), ('index2', 'header2')]
There is a function in pandas that gives duplicate rows.
duplicate = df[df.duplicated()]
print(duplicate)
Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:
s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()
If need also duplicates with values:
df1 = (s[s.duplicated(keep=False)]
.sort_values()
.rename_axis(index='idx', columns='cols')
.reset_index(name='val'))
If need tet specific value change mask for Series.eq (==):
s = df.unstack()
out = s[s.eq(55)].index.tolist()
So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.
wanted_value = 55
for col in list(df.columns):
if df[col].eq(wanted_value).any() == True:
print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
I have a dataframe with some clubs and their nationality. Just like this one:
I created a function that I will use to create a new column based on the Nationality. I tested and it works fine if I want to find values that are equal. However, I needed to search for strings that contains a certain character. E.g.: If the string contains 'Br' than I want to create a new column which will receive a certain value. If contains another string, than it will receive another value.
This is what I've done so far (and it is working fine, but I needed something like a 'contains'):
# Function
def label_race (row):
if row['Nationality'] == 'Brazil':
return 'Brasil'
else:
return 'NA'
df.apply (lambda row: label_race(row), axis=1)
I would like to do something like this:
# Function
def label_race (row):
if row['Nationality'] contains'Br':
return 'Brasil'
if row['Nationality'] contains'Brl':
return 'Brasil2'
else:
return 'NA'
df.apply (lambda row: label_race(row), axis=1)
I found some tips, but most of them use things like is.find() or df[].str.contains. And I couldn't adapt to what I want.
if you wanted to create a new column with binary values (if condition met then A else B), you could do something like this
#create a column 'new' with value 'Brasil' if 'Nationality' value contains 'Bra', else put 'NA'
df['new'] = df['Nationality'].apply(lambda x: 'Brasil' if 'Bra' in x else 'NA')
otherwise, if you wanted to create a column and use multiple rules in the same column, you could do something like this...
#create a column 'new' and insert value 'ARG' whenever 'Nationality' contains 'Arg',
df.loc[df['Nationality'].str.contains('Arg'), 'new'] = 'ARG'
#and 'BRA' whenever Nationality contains 'Brazil', without overriding any other values
df.loc[df['Nationality'].str.contains('Brazil'), 'new'] = 'BRA'
IIUC, you can make do with str.extract and dot:
df = pd.DataFrame({'Nationality': ['Brazil', 'abBrl', 'abcd', 'BrX']})
new_df = df.Nationality.str.extract('(?P<Brazil2>Brl)|(?P<Brazil>Br)')
new_df.notnull().dot(new_df.columns)
Output:
0 Brazil
1 Brazil2
2
3 Brazil
dtype: object
I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779
Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.
I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)
I have 3 dataframes (df1, df2, df3) which are identically structured (# and labels of rows/columns), but populated with different values.
I want to populate df3 based on values in the associated column/rows in df1 and df2. I'm doing this with a FOR loop and a custom function:
for x in range(len(df3.columns)):
df3.iloc[:, x] = customFunction(x)
I want to populate df3 using this custom IF/ELSE function:
def customFunction(y):
if df1.iloc[:,y] <> 1 and df2.iloc[:,y] = 0:
return "NEW"
elif df2.iloc[:,y] = 2:
return "OLD"
else:
return "NEITHER"
I understand why I get an error message when i run this, but i can't figure out how to apply this function to a series. I could do it row by row with more complex code but i'm hoping there's a more efficient solution? I fear my approach is flawed.
v1 = df1.values
v2 = df2.values
df3.loc[:] = np.where(
(v1 != 1) & (v2 == 0), 'NEW',
np.where(v2 == 2, 'OLD', 'NEITHER'))
Yeah, try to avoid loops in pandas, its inefficient and built to be used with the underlying numpy vectorization.
You want to use the apply function.
Something like:
df3['new_col'] = df3.apply(lambda x: customFunction(x))
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html