String Modification On Pandas DataFrame Subset - python

I'm having a hard time updating a string value in a subset of Pandas data frame
In the field action, I am able to modify the action column using regular expressions with:
df['action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')
However, if the string contains a specific word, I don't want to modify it, so I tried to only update a subset like this:
df[df['action'].str.contains('TIME')==False]['action'] = df[df['action'].str.contains('TIME')==False].action.str.replace('([^a-z0-9\._]{2,})','')
and also using .loc like:
df.loc('action',df.action.str.contains('TIME')==False) = df.loc('action',df.action.str.contains('TIME')==False).action.str.replace('([^a-z0-9\._]{2,})','')
but in both cases, nothing gets updated. Is there a better way to achieve this?

you can do it with loc but you did it the way around with column first while it should be index first, and using [] and not ()
mask_time = ~df['action'].str.contains('TIME') # same as df.action.str.contains('TIME')==False
df.loc[mask_time,'action'] = df.loc[mask_time,'action'].str.replace('([^a-z0-9\._]{2,})','')
example:
#dummy df
df = pd.DataFrame({'action': ['TIME 1', 'ABC 2']})
print (df)
action
0 TIME 1
1 ABC 2
see the result after using above method:
action
0 TIME 1
1 2

Try this it should work, I found it here
df.loc[df.action.str.contains('TIME')==False,'action'] = df.action.str.replace('([^a-z0-9\._]{2,})','')

Related

Add character to column based on text condition using pandas

I'm trying to do some data cleaning using pandas. Imagine I have a data frame which has a column call "Number" and contains data like: "1203.10", "4221","3452.11", etc. I want to add an "M" before the numbers, which have a point and a zero at the end. For this example, it would be turning the "1203.10" into "M1203.10".
I know how to obtain a data frame containing the numbers with a point and ending with zero.
Suppose the data frame is call "df".
pointzero = '[0-9]+[.][0-9]+[0]$'
pz = df[df.Number.str.match(pointzero)]
But I'm not sure on how to add the "M" at the beginning after having "pz". The only way I know is using a for loop, but I think there is a better way. Any suggestions would be great!
You can use boolean indexing:
pointzero = '[0-9]+[.][0-9]+[0]$'
m = df.Number.str.match(pointzero)
df.loc[m, 'Number'] = 'M' + df.loc[m, 'Number']
Alternatively, using str.replace and a slightly different regex:
pointzero = '([0-9]+[.][0-9]+[0]$)'
df['Number'] = df['Number'].str.replace(pointzero, r'M\1', regex=True))
Example:
Number
0 M1203.10
1 4221
2 3452.11
you should make dataframe or seires example for answer
example:
s1 = pd.Series(["1203.10", "4221","3452.11"])
s1
0 M1203.10
1 4221
2 3452.11
dtype: object
str.contains + boolean masking
cond1 = s1.str.contains('[0-9]+[.][0-9]+[0]$')
s1.mask(cond1, 'M'+s1)
output:
0 M1203.10
1 4221
2 3452.11
dtype: object

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

pandas dataframe how to convert object to array and extract the array value

forgive me if my question was a bit ambiguous. will try to be better junior bb.
Question.
I have DataFrame as below which I received from hive DB.
How to extract value 'cat' and 'animal', 'dog' in column col2, whatever.
In[]:
sample = {'col1': ['cat', 'dog'], 'col2': ['WrappedArray([animal], [cat])', 'WrappedArray([animal], [dog])']}
df = pd.DataFrame(data=sample)
df
out[] :
col1 col2
-----------------------------------------
0 cat WrappedArray([animal], [cat])
1 dog WrappedArray([animal], [dog])
I tried to convert object to an array and extract the data like this code.
In[]: df['col2'][0][1]
Out[]: cat
if I'm wrong, I have to try another way because I am a newbie for Pandas Dataframe.
Could someone let me know how's works?
thanks in advance.
The data in the second column col2 appear to be simply strings.
The output from df['col2'][0][1] would be "r" Which is the second character (index 1) in the first string. To get "cat" you would need to alter the strings and remove the 'WrappedArray([animal]...' stuff. leaving only the actual data. "cat", "dog', etc.
You could try df['col2'].iloc[0][24:27], but that's not a general solution. It would also be brittle and unmaintainable.
If you have any control over how the data is exported from the database, try to get the data out in a cleaner format, i.e. without the WrappedArray(... stuff.
Regular expressions might be helpful here.
You could try something like this:
import re
wrapped = re.compile(r'\[(.*?)\].+\[(.*?)\]')
element = wrapped.search(df['col2'].iloc[0]).group(2)
* Danger Danger Danger *
If you need that functionality. You could create a WrappedArray function that returns the contents as list of strings or the like. Then you can call it by using eval(df['col2'][0][1]).
Don't do this.
FYI:
Your dtypes likely defaulted to object, because you didn't specify them when you created your data frame. You can do that like this:
df = pd.DataFrame(data=sample, dtype='string')
Also, it's recommended to use iloc to index dataframes by index.
I solved it as #rkedge advised me
the data is written in a foreign language.
As I said, DataFrame has object data written with 'WrappedArray([우주ごぎゅ],[ぎゃ],[한국어])'.
df_ = df['col2'].str.extractall(r'([REGEX expression]+)')
df_
0 0 우주ごぎゅ
0 1 ぎゃ
0 2 한국어
1 0 cat
2 0 animal

How to modify DataFrame column without getting SettingWithCopyWarning?

I have a DataFrame object df. And I would like to modify job column so that all retired people are 1 and rest 0 (like shown here):
df['job'] = df['job'].apply(lambda x: 1 if x == "retired" else 0)
But I get a warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Why did I get it here though? From what I read it applies to situations where I take a slice of rows and then a column, but here I am just modyfing elements in a row. Is there a better way to do that?
Use:
df['job']=df['job'].eq('retired').astype(int)
or
df['job']=np.where(df['job'].eq('retired'),1,0)
So here's an example dataframe:
import pandas as pd
import numpy as np
data = {'job':['retired', 'a', 'b', 'retired']}
df = pd.DataFrame(data)
print(df)
job
0 retired
1 a
2 b
3 retired
Now, you can make use of numpy's where function:
df['job'] = np.where(df['job']=='retired', 1, 0)
print(df)
job
0 1
1 0
2 0
3 1
I would not suggest using apply here, as in the case of large data frame it could lower your performance.
I would prefer using numpy.select or numpy.where.
See This And This

Drop Pandas DataFrame lines according to a GropuBy property

I have some DataFrames with information about some elements, for instance:
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df2=pd.DataFrame([[1,5],[1,7],[1,23],[2,6],[2,4]],columns=['Group','Value'])
I have used something like dfGroups = df.groupby('group').apply(my_agg).reset_index(), so now I have DataFrmaes with informations on groups of the previous elements, say
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
my_df2_Group=pd.DataFrame([[1,38],[2,49]],columns=['Group','Group_Value'])
Now I want to clean my groups according to properties of their elements. Let's say that I want to discard groups containing an element with Value greater than 16. So in my_df1_Group, there should only be the first group left, while both groups qualify to stay in my_df2_Group.
As I don't know how to get my_df1_Group and my_df2_Group from my_df1 and my_df2 in Python (I know other languages where it would simply be name+"_Group" with name looping in [my_df1,my_df2], but how do you do that in Python?), I build a list of lists:
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
Then, I simply try this:
my_max=16
Bad=[]
for Sample in SampleList:
for n in Sample[1]['Group']:
df=Sample[0].loc[Sample[0]['Group']==n] #This is inelegant, but trying to work
#with Sample[1] in the for doesn't work
if (df['Value'].max()>my_max):
Bad.append(1)
else:
Bad.append(0)
Sample[1] = Sample[1].assign(Bad_Row=pd.Series(Bad))
Sample[1] = Sample[1].query('Bad_Row == 0')
Which runs without errors, but doesn't work. In particular, this doesn't add the column Bad_Row to my df, nor modifies my DataFrame (but the query runs smoothly even if Bad_Rowcolumn doesn't seem to exist...). On the other hand, if I run this technique manually on a df (i.e. not in a loop), it works.
How should I do?
Based on your comment below, I think you are wanting to check if a Group in your aggregated data frame has a Value in the input data greater than 16. One solution is to perform a row-wise calculation using a criterion of the input data. To accomplish this, my_func accepts a row from the aggregated data frame and the input data as a pandas groupby object. For each group in your grouped data frame, it will subset you initial data and use boolean logic to see if any of the 'Values' in your input data meet your specified criterion.
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>16).any():
return 'Bad Row'
else:
return 'Good Row'
my_df1=pd.DataFrame([[1,12],[1,15],[1,3],[1,6],[2,8],[2,1],[2,17]],columns=['Group','Value'])
my_df1_Group=pd.DataFrame([[1,57],[2,63]],columns=['Group','Group_Value'])
grouped_df1 = my_df1.groupby('Group')
my_df1_Group['Bad_Row'] = my_df1_Group.apply(lambda x: my_func(x,grouped_df1), axis=1)
Returns:
Group Group_Value Bad_Row
0 1 57 Good Row
1 2 63 Bad Row
Based on dubbbdan idea, there is a code that works:
my_max=16
def my_func(row,grouped_df1):
if (grouped_df1.get_group(row['Group'])['Value']>my_max).any():
return 1
else:
return 0
SampleList = [[my_df1,my_df1_Group],[my_df2,my_df2_Group]]
for Sample in SampleList:
grouped_df = Sample[0].groupby('Group')
Sample[1]['Bad_Row'] = Sample[1].apply(lambda x: my_func(x,grouped_df), axis=1)
Sample[1].drop(Sample[1][Sample[1]['Bad_Row']!=0].index, inplace=True)
Sample[1].drop(['Bad_Row'], axis = 1, inplace = True)

Categories