Pandas strip string from column - python

I have a csv file that looks as follow:
code. timestamp. message. name. Id. action
1. time. message. name. some text - id action
I would like to target the id column and strip everything starting from the space after the - until the beginning for the string if the column contains a -.
basically this is what I would like as an output.
code. timestamp. message. name. Id. action
1. <time.> <message.> <name.> id <action>
looking at some documentation I found this solution.
df['id'] = df['id'].map(lambda x: x.lstrip('-').rstrip(' - '))
but this just strips everything from the left and the right of the - which is not the result I want.
can anyone help me to understand how can I target that space after the - and remove everything before it please?

Given:
Col1
0 some text - id
1 remove this text - 012345
Try:
import pandas as pd
mydf = pd.DataFrame({'Col1':['some text - id', 'remove this text - 012345']})
mydf['Col1'] = mydf['Col1'].str.extract(r'- (.*)')
print(mydf)
Outputs:
Col1
0 id
1 012345

You can first replace everything before - and then strip to remove extra spaces.
df["id"].str.replace(r".*-", "").str.strip()

You can use str.split() to split the strings on - and then get the right part of the string, as follows:
df['id'] = df['id'].str.split(r'- ').str[-1]
If there can be more than one occurrence of - in the string and you want to split only on the first occurrence, you can use:
df['id'] = df['id'].str.split(r'- ', n=1).str[-1]
Demo
df = pd.DataFrame({'id':['some text - id', 'another text - 2nd hypen - abc']})
id
0 some text - id
1 another text - 2nd hypen - abc
df['id'] = df['id'].str.split(r'- ').str[-1]
id
0 id
1 abc
df = pd.DataFrame({'id':['some text - id', 'another text - 2nd hypen - abc']})
id
0 some text - id
1 another text - 2nd hypen - abc
df['id'] = df['id'].str.split(r'- ', n=1).str[-1]
id
0 id
1 2nd hypen - abc

Related

Split period into two dates when the date has the same delimiter

Goal: derive period start and period end from the column period, in the form of
dd.mm.yyyy - dd.mm.yyyy
period
28-02-2022 - 30.09.2022
31.01.2022 - 31.12.2022
28.02.2019 - 30-04-2020
20.01.2019-22.02.2020
19.03.2020- 24.05.2021
13.09.2022-12-10.2022
df[['period_start,'period_end]]= df['period'].str.split('-',expand=True)
will not work.
Expected output
period_start period_end
31.02.2022 30.09.2022
31.01.2022 31.12.2022
28.02.2019 30.04.2020
20.01.2019 22.02.2020
19.03.2020 24.05.2021
13.09.2022 12.10.2022
We can use str.extract here for one option:
df[["period_start", "period_end"]] = df["period"].str.extract(r'(\S+)\s*-\s*(\S+)')
.str.replace(r'-', '.')
the problem is you were trying to split on dash, and there's many dashes in the one row, this work :
df[['period_start','period_end']]= df['period'].str.split(' - ',expand=True)
because we split on space + dash
Use a regex to split on the dash with surrounding spaces:
out = (df['period'].str.split(r'\s+-\s+',expand=True)
.set_axis(['period_start', 'period_end'], axis=1)
)
or to remove the column and create new ones:
df[['period_start', 'period_end']] = df.pop('period').str.split(r'\s+-\s+',expand=True)
output:
period_start period_end
0 31-02-2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30-04-2020

ValueError: Columns must be same length as key (Split column in multiple columns using python)

The question has been asked a lot, however I'm still not close to the solution. I have a column which looks something like this
What I want to do is separate the country and language in different columns like
Country Language
Vietnam Vietnamese_display 1
Indonesia Tamil__1
India Tamil_Video_5
I'm using the following code to get it done however there are a lot of factors that needs to be taken into account and I'm not sure how to do it
df[['Country', 'Language']] = df['Line Item'].str.split('_\s+', n=1, expand=True)
How can I skip the first "_" to get my desired results? Thanks
You may use
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
See the regex demo
Details
^ - start of string
_* - 0 or more underscores
([^_]+) - Capturing group 1: any one or more chars other than _
_ - a _ char
(.+) - Group 2: any one or more chars other than line break chars.
Pandas test:
df = pd.DataFrame({'Line Item': ['Vietnam_Vietnamese_display 1','Indonesia_Tamil__1','India_Tamil_Video_5']})
df[['Country', 'Language']] = df['Line Item'].str.extract(r'^_*([^_]+)_(.+)')
df
# Line Item Country Language
# 0 Vietnam_Vietnamese_display 1 Vietnam Vietnamese_display 1
# 1 Indonesia_Tamil__1 Indonesia Tamil__1
# 2 India_Tamil_Video_5 India Tamil_Video_5

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

pandas: Replace string is not replacing targeted substring

I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?
IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...

Categories