I have a dataframe with multiple columns and I want to separate the numbers from the letters with a space in one column.
In this example I want to add space in the third column.
do you know how to do so?
import pandas as pd
data = {'first_column': ['first_value', 'second_value', 'third_value'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_column':['AA6589', 'GG6589', 'BXV6589'],
'fourth_column':['first_value', 'second_value', 'third_value'],
}
df = pd.DataFrame(data)
print (df)
Use str.replace with a short regex:
df['third_column'] = df['third_column'].str.replace(r'(\D+)(\d+)',
r'\1 \2', regex=True)
regex:
(\D+) # capture one or more non-digits
(\d+) # capture one or more digits
replace with \1 \2 (first captured group, then space, then second captured group).
Alternative with lookarounds:
df['third_column'] = df['third_column'].str.replace(r'(?<=\D)(?=\d)',
' ', regex=True)
meaning: insert a space at any position in-between a non-digit and a digit.
Similarly you could extract the digits and non digit characters from your 'third_column' and place them together with a space in between:
df.assign(
third_column=df["third_column"].str.extract(r'(\D+)') + " " + df["third_column"].str.extract(r'(\d+)')
)
first_column second_column third_column fourth_column
0 first_value first_value AA 6589 first_value
1 second_value second_value GG 6589 second_value
2 third_value third_value BXV 6589 third_value
Related
Goal: derive period start and period end from the column period, in the form of
dd.mm.yyyy - dd.mm.yyyy
period
28-02-2022 - 30.09.2022
31.01.2022 - 31.12.2022
28.02.2019 - 30-04-2020
20.01.2019-22.02.2020
19.03.2020- 24.05.2021
13.09.2022-12-10.2022
df[['period_start,'period_end]]= df['period'].str.split('-',expand=True)
will not work.
Expected output
period_start period_end
31.02.2022 30.09.2022
31.01.2022 31.12.2022
28.02.2019 30.04.2020
20.01.2019 22.02.2020
19.03.2020 24.05.2021
13.09.2022 12.10.2022
We can use str.extract here for one option:
df[["period_start", "period_end"]] = df["period"].str.extract(r'(\S+)\s*-\s*(\S+)')
.str.replace(r'-', '.')
the problem is you were trying to split on dash, and there's many dashes in the one row, this work :
df[['period_start','period_end']]= df['period'].str.split(' - ',expand=True)
because we split on space + dash
Use a regex to split on the dash with surrounding spaces:
out = (df['period'].str.split(r'\s+-\s+',expand=True)
.set_axis(['period_start', 'period_end'], axis=1)
)
or to remove the column and create new ones:
df[['period_start', 'period_end']] = df.pop('period').str.split(r'\s+-\s+',expand=True)
output:
period_start period_end
0 31-02-2022 30.09.2022
1 31.01.2022 31.12.2022
2 28.02.2019 30-04-2020
I have below the code
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
pattern = ','.join(['$', ','])
df['A'] = df['A'].str.replace('$|,', '', regex=True)
print(df['A'])
What I am trying to remove every occurrence of '$' or ','... so I am trying to replace with blank..
But its replacing only ,
Output I am getting
0 $5756
1 3434
2 $45
3 1344$
it should be
0 5756
1 3434
2 45
3 1344
What I am doing wrong
Any help appreciated
Thanks
Use:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('[$,]', '', regex=True)
print(df)
Output
A
0 5756
1 3434
2 45
3 1344
The problem is that the character $ has a special meaning in regular expressions. From the documentation (emphasis mine):
$
Matches the end of the string or just before the newline at the end
of the string, and in MULTILINE mode also matches before a newline.
foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$
matches only ‘foo’. More interestingly, searching for foo.$ in
'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.mode;
searching for a single $ in 'foo\n' will find two (empty) matches: one
just before the newline, and one at the end of the string.
So you need to escape the character or put it inside a character class.
As an alternative use:
df['A'].str.replace('\$|,', '', regex=True) # note the escaping \
If you only have integer-like numbers an easy option is to remove all but digits \D, then you don't have to deal with other special regex characters like $:
df['A'] = df['A'].str.replace(r'\D', '', regex=True)
output:
A
0 5756
1 3434
2 45
3 1344
It might be useful for you:
import pandas as pd
df = pd.DataFrame({'A': ['$5,756', '3434', '$45', '1,344']})
df['A'] = df['A'].str.replace('$', '', regex=True)
print(df['A'])
I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.
Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]
Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line