How to trim string from reverse in Pandas column - python

I have a pandas dataframe column value as
"assdffjhjhjh(12tytyttyt)bhhh(AS7878788)"
I need to trim it from the back,i.e my resultant value should be AS7878788.
I am doing the below:
newdf=pd.DataFrame(df.COLUMNNAME.str.split('(',1).tolist(),columns = ['col1','col2'])
df['newcol'] = newdf['col2'].str[:10]
This in the above Dataframe column is giving the the output "12tytyttyt", however my intended output is "AS7878788"
Can someone help please?

Let's try first with a regular string in pure Python:
x = "assdffjhjhjh(12tytyt)bhhh(AS7878788)"
res = x.rsplit('(', 1)[-1][:-1] # 'AS7878788'
Here we split from the right by open bracket (limiting the split count to one for efficiency), extract the last split, and extract every character except the last.
You can then apply this in Pandas via pd.Series.str methods:
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
Here's a demo:
df = pd.DataFrame({'col': ["assdffjhjhjh(12tytyt)bhhh(AS7878788)"]})
df['col'] = df['col'].str.rsplit('(', 1).str[-1].str[:-1]
print(df)
col
0 AS7878788
Note the solution above is very specific to the string you have presented as an example. For a more flexible alternative, consider using regex.

You can use a regex to find all instances of "values between two brackets" and then pull out the final one. For example, if we have the following data:
df = pd.DataFrame({'col': ['assdffjhjhjh(12tytyt)bhhh(AS7878788)',
'asjhgdv(abjhsgf)(abjsdfvhg)afdsgf']})
and we do:
df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)').str[-1]
this gets us:
col
0 AS7878788
1 abjsdfvhg
To explain what the regex is doing, it is trying to find all instances where we have:
\( # an open bracket
([^\(^\)]+) # anything that isn't an open bracket or a close bracket for one or more characters
\) # a close bracket
We can see how this is working if we take the .str[-1] from the end of our previous statement, as df['col'] = df['col'].str.findall(r'\(([^\(^\)]+)\)') gives us:
col
0 [12tytyt, AS7878788]
1 [abjhsgf, abjsdfvhg]

Related

Stripping 0's from the middle of a dataframe

Basically data is coming into my program in this format
0xxxx000xxxx where the x is unique to the data that I have in another system. I'm trying to remove those 0's as they're always in the same place.
I tried
df['item'] = df['item'].str.replace('0','')
but sometimes the x can be a 0 and will get rid of it. I'm not sure how to get rid of just the 0's in those specific positions.
EX:
Input: 099890000890
Output (Desired): 99890890
Use the str accessor for indexing:
df['item'] = df['item'].str[1:5] + df['item'].str[8:]
Or str.replace:
df['item'] = df['item'].str.replace(r'0(.{4})000(.{4})', r'\1\2', regex=True)
Output (as new column. Item2):
item item2
0 099890000890 99890890

How to get an specific located text into a dataframe index?

I have a dataframe with some text indexes which contains a necessary information that I want to copy into a list.
I don't know how is the text info specifically (the word always changes), but I know where is located in the index:
'point.subclase.optimum.R31.done'. R31 is the value which I would like to write in a list, so I know that that text, that is always different, is between point.subclase.optimum. and .done.
I've tried with:
info_list = []
for col in df.columns:
if ('point.subclase.optimum.' in col) and ('.done' in col):
info_list.append(col)
But that script just provide me the entire index in the list.
Does anyone know how to solve it?
Use Series.str.extract with escape \. because special regex character, then remove possible missing values if no match by Series.dropna and last convert output to list:
df = pd.DataFrame({'a':range(3)}, index=['point.subclase.optimum.R31.done',
'point.subclase',
'point.subclase.optimum.R98.done'])
print (df)
a
point.subclase.optimum.R31.done 0
point.subclase 1
point.subclase.optimum.R98.done 2
L = (df.index.str.extract(r'point\.subclase\.optimum\.(.*)\.done', expand=False)
.dropna()
.tolist())
print (L)
['R31', 'R98']

regular expression to delete the 0 after dash sign

I have a pandas dataframe column of numbers which all have a dash in between, for example :
"123-045"
I am wondering is there anyway to delete the zero after the dash sign, to make the above example to
"123-45"
? And is it possible to apply the process condition to the entire column??
I have used a for loop to check each digit after the dash sign, using the python string function. But the number of rows is large, and the for loop takes forever.
Try Series.str.replace method with regex (?<=-)0+ to remove 0 after -:
df = pd.DataFrame({'a': ["123-045"]})
df
# a
#0 123-045
df.a.str.replace('(?<=-)0+', '')
#0 123-45
#Name: a, dtype: object
If str is your string then it could be as simple as this:
str = re.sub("-.", "-", str)
Or with pandas dataframe:
df = pd.DataFrame({'key': ["assa-dssd-sd"]})
print (df.key.str.replace("-.", "-"))

Remove all characters except alphabet in column rows

Let's say i have a dataset, and in some columns of these dataset I have lists. Well first key problem is actually that there are many columns with such lists, where strings can be separated by (';') or (';;'), the string itself starts with whitelist or even (';).
For some cases of these problem i implemented this function:
g = [';','']
f = []
for index, row in data_a.iterrows():
for x in row['column_1']:
if (x in g):
norm = row['column_1'].split(x)
f.append(norm)
print(norm)
else:
Actually it worked, but the problem is that it returned duplicated rows, and wasn't able to solve tasks with other separators.
Another problem is using dummies after I changed the way column values are stored:
column_values = data_a['column_1']
data_a.insert(loc=0, column='new_column_8', value=column_values)
dummies_new_win = pd.get_dummies(data_a['column_1'].apply(pd.Series).stack()).sum(level=0)
Instead of getting 40 columns in my case, i get 50 or 60. Due to the fact, that i am not able to make a function that removes from lists everything except just alphabet. I would like to understand how to implement such function because same string meanings can be written in different ways:
name-Jack or name(Jack)
Desired output would look like this:
nameJack nameJack
Im not sure if i understood you well, but to remove all non alphanumeric, you can use simple regex.
Example:
import re
n = '-s;a-d'
re.sub(r'\W+', '', n)
Output: 'sad'
You can use str.replace for pandas Series.
df = pd.DataFrame({'names': ['name-Jack','name(Jack)']})
df
# names
# 0 name-Jack
# 1 name(Jack)
df['names'] = df['names'].str.replace('\W+','')
df
# names
# 0 nameJack
# 1 nameJack

Find String Pattern Match in Pandas Dataframe and Return Matched Strin

I have a dataframe column with variable comma separated text and just trying to extract the values that are found based on another list. So my dataframe looks like this:
col1 | col2
-----------
x | a,b
listformatch = [c,d,f,b]
pattern = '|'.join(listformatch)
def test_for_pattern(x):
if re.search(pattern, x):
return pattern
else:
return x
#also can use col2.str.contains(pattern) for same results
The above filtering works great but instead of returning b when it finds the match it returns the whole pattern such as a|b instead of just b whereas I want to create another column with the pattern it finds such as b.
Here is my final function but still getting UserWarning: This pattern has match groups. To actually get the groups, use str.extract." groups, use str.extract.", UserWarning) I wish I can solve:
def matching_func(file1, file2):
file1 = pd.read_csv(fin)
file2 = pd.read_excel(fin1, 0, skiprows=1)
pattern = '|'.join(file1[col1].tolist())
file2['new_col'] = file2[col1].map(lambda x: re.search(pattern, x).group()\
if re.search(pattern, x) else None)
I think I understand how pandas extract works now but probably still rusty on regex. How do I create a pattern variable to use for the below example:
df[col1].str.extract('(word1|word2)')
Instead of having the words in the argument, I want to create variable as pattern = 'word1|word2' but that won't work because of the way the string is being created.
My final and preferred version with vectorized string method in pandas 0.13:
Using values from one column to extract from a second column:
df[col1].str.extract('({})'.format('|'.join(df[col2]))
You might like to use extract, or one of the other vectorised string methods:
In [11]: s = pd.Series(['a', 'a,b'])
In [12]: s.str.extract('([cdfb])')
Out[12]:
0 NaN
1 b
dtype: object

Categories