how to remove whitespace from string in pandas column - python

I need to remove whitespaces in pandas df column. My data looks like this:
industry magazine
Home "Goodhousekeeping.com"; "Prevention.com";
Fashion "Cosmopolitan"; " Elle"; "Vogue"
Fashion " Vogue"; "Elle"
Below is my code:
# split magazine column values, create a new column in df
df['magazine_list'] = dfl['magazine'].str.split(';')
# stip the first whitespace from strings
df.magazine_list = df.magazine_list.str.lstrip()
This returns all NaN, I have also tried:
df.magazine = df.magazine.str.lstrip()
This didn't remove the white spaces either.

Use list comprehension with strip of splitted values, also strip values before split for remove trailing ;, spaces and " values:
f = lambda x: [y.strip('" ') for y in x.strip(';" ').split(';')]
df['magazine_list'] = df['magazine'].apply(f)
print (df)
industry magazine \
0 Home Goodhousekeeping.com; "Prevention.com";
1 Fashion Cosmopolitan; " Elle"; "Vogue"
2 Fashion Vogue; "Elle
magazine_list
0 [Goodhousekeeping.com, Prevention.com]
1 [Cosmopolitan, Elle, Vogue]
2 [Vogue, Elle]

Jezrael provides a good solution. It is useful to know that pandas has string accessors for similar operations without the need of list comprehensions. Normally a list comprehension is faster, but depending on the use case using pandas built-in functions could be more readable or simpler to code.
df['magazine'] = (
df['magazine']
.str.replace(' ', '', regex=False)
.str.replace('"', '', regex=False)
.str.strip(';')
.str.split(';')
)
Output
industry magazine
0 Home [Goodhousekeeping.com, Prevention.com]
1 Fashion [Cosmopolitan, Elle, Vogue]
2 Fashion [Vogue, Elle]

Related

Regular Expression Behavior in R unnest_token() v.s Python pandas str.split()

I want to replicate the result similar to df_long below using python pandas. This is the R code:
df <- data.frame("id" = 1, "author" = 'trump', "Tweet" = "RT #kin2souls: #KimStrassel Anyone that votes")
unnest_regex <- "([^A-Za-z_\\d##']|'(?![A-Za-z_\\d##]))"
df_long <- df %>%
unnest_tokens(
word, Tweet, token = "regex", pattern = unnest_regex)
If I understand correctly, the unnest_regex is written in a way that it also finds numbers (among whitespace and few punctuation marks). I don't get why R would treat a number in a string, for example "#kin2souls" as a not match condition. Therefore, we got a result in df_long with #kin2souls as a row on its own. However, when I try to replicate this in pandas:
unnest_regex = r"([^A-Za-z_\\d##']|'(?![A-Za-z_\\d##]))"
df = df_long.assign(word=df['Tweet'].str.split(unnest_regex)).explode('word')
df.drop("Tweet", axis=1, inplace=True)
It will split the "#kin2souls" string into "#kin" and "souls" as separate rows. Furthermore, since the unnest_regex uses capturing parenthesis, in Python, I modify it to:
unnest_regex = r"[^A-Za-z_\\d##']|'(?![A-Za-z_\\d##])"
This to avoid empty string as a result. I wonder if it is also a contributing factor. However, the split at "2" still happens. Could anyone propose a solution in Python and potentially explain why R behave this way? Thank you!
Here's the data in Python:
data = {'id':[1], "author":["trump"], "Tweet": ["RT #kin2souls: #KimStrassel Anyone that votes"]}
df = pd.DataFrame.from_dict(data)
And the expected result:
data_long = {'id':[1,1,1,1,1,1], "author":["trump","trump","trump","trump","trump","trump"], "word": ["rt", "#kin2souls", "#kimstrassel", "anyone", "that", "votes"]}
df_long = pd.DataFrame.from_dict(data_long)
A combination of str split and explode should replicate your output :
(df
.assign(Tweet=df.Tweet.str.lower().str.split(r"[:\s]"))
.explode("Tweet")
.query('Tweet != ""')
.reset_index(drop=True)
)
id author Tweet
0 1 trump rt
1 1 trump #kin2souls
2 1 trump #kimstrassel
3 1 trump anyone
4 1 trump that
5 1 trump votes
I took advantage of the fact that the text is delimited by space, and the occasional :
Alternatively, you could use str extractall - I feel it is a bit longer though :
(
df.set_index(["id", "author"])
.Tweet.str.lower()
.str.extractall(r"\s*([a-z#\d]+)[:\s]*")
.droplevel(-1)
.rename(columns={0: "Tweet"})
.reset_index()
)
Not sure how unnest_token works with regex - maybe someone else can resolve that

How to add a new column with multiple string contain conditions in python pandas other than using np.where?

I was trying to add a new column by giving multiple strings contain conditions using str.contains() and np.where() function. By this way, I can have the final result I want.
But, the code is very lengthy. Are there any good ways to reimplement this using pandas function?
df5['new_column'] = np.where(df5['sr_description'].str.contains('gross to net', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross up', case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net to gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-to-net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross-up',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net-to-gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross 2 net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net 2 gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('gross net',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('net gross',case=False).fillna(False),1,
np.where(df5['sr_description'].str.contains('memo code',case=False).fillna(False),1,0)))))))))))
This output will be,
if those strings contain in 'sr_description' then give a 1, else 0 to new_column
Maybe store the multiple string conditions in a list then read and apply them to a function.
Edit:
Sample Data:
sr_description new_column
something with gross up. 1
without those words. 0
or with Net to gross 1
if not then we give a '0' 0
Here is what I came up with.
Code:
import re
import pandas as pd
import numpy as np
# list of the strings we want to check for
check_strs = ['gross to net', 'gross up', 'net to gross', 'gross-to-net', 'gross-up', 'net-to-gross', 'gross 2 net',
'net 2 gross', 'gross net', 'net gross', 'memo code']
# From the re.escape() docs: Escape special characters in pattern.
# This is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
check_strs_esc = [re.escape(curr_val) for curr_val in check_strs]
# join all the escaped strings as a single regex
check_strs_re = '|'.join(check_strs_esc)
test_col_1 = ['something with gross up.', 'without those words.', np.NaN, 'or with Net to gross', 'if not then we give a "0"']
df_1 = pd.DataFrame(data=test_col_1, columns=['sr_description'])
df_1['contains_str'] = df_1['sr_description'].str.contains(check_strs_re, case=False, na=False)
print(df_1)
Result:
sr_description contains_str
0 something with gross up. True
1 without those words. False
2 NaN False
3 or with Net to gross True
4 if not then we give a "0" False
Note that numpy isn't required for the solution to function, I'm just using it to test a NaN value.
Let me know if anything is unclear or your have any questions! :)

How do you go through a list of strings using the series.str.contains function?

I have credit card charge data that has a column containing the description for the charge. I also created a dictionary that contains categories for different charges. For example, I have a category called grocery expenses (value) and regular expressions (Ralphs, Target). I combined my values in a string with the separator |.
I am using the Series.str.contains(pat,case=True,flags=0,na=nan,regex=True) function to see if the string in each index contains my regular expressions.
# libraries needed
# import pandas as pd
# import re
joined_string=['|'.join(value) for value in values]
the_list=joined_string
Example output: the_list=[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK"]
df['Description']='FOOD4LESS 0508 0000FULLERTON CA'
The Dataframe contains a column of different charges on your credit card
```python
for character_sequence in the_list:
boolean_output=df['Description'].str.contains(character_sequence,regex=True)
For some reason, the code is not going through each character sequence in my list. It only goes through one character sequence, but I need it to go through multiple character sequences.
Since there is no data to compare with, I will just present some dummy data.
import pandas as pd
names = ['Adam','Barry','Chuck','Dennis','Elon','Fridman','George','Harry']
df = pd.DataFrame(names, columns=['Names'])
# Apply regex and save to column: Regex
df['Regex'] = df.Names.str.contains('[ae]', regex=True)
df
Output:
Names Regex
0 Adam True
1 Barry True
2 Chuck False
3 Dennis True
4 Elon False
5 Fridman True
6 George True
7 Harry True
Solution with another Example akin to the Problem
First, your the_list variable is not correct. Assuming it is a typo, I would present my solution here. Please note that regex or regular expression, when applied to a column of data, essentially means that you are trying to find some patterns. How would you in first place know/check if your pattern recognition is working fine? Well, you would need a few data-points to at least validate the regex results. Since, you only provided one line of data, therefore, I will make some dummy data here and test if the regex produces expected results.
Note: Please check the Data Prepeartions section to see the data so you can replicate and test the solution.
import pandas as pd
import re
# Make regex string from the list of target keywords
regex_expression = '|'.join(the_list)
# Make dataframe from the list of descriptions
# --> see under Data section of the solution.
df = pd.DataFrame(descriptions, columns=['Description'])
# Regex search results for a subset of
# target keywords: "Gas|Internet|Water|Electricity,VONS"
df['Regex_A'] = df.Description.str.contains("Gas|Internet|Water|Electricity,VONS", regex=True)
# Regex search result of all target keywords
df['Regex_B'] = df.Description.str.contains(regex_expression, regex=True)
df
Output:
Description Regex_A Regex_B
0 FOOD4LESS 0508 0000FULLERTON CA False True
1 Electricity,VONS 0777 0123FULLERTON NY True True
2 PAVILIONS 1248 9800Ralphs MA False True
3 SPROUTS 9823 0770MARKET#WORK WI False True
4 Internet 0333 1008Water NJ True True
5 Enternet 0444 1008Wager NJ False False
Data Preparation
In a practical scenario, I would assume that in case of the type of problem you presented in the question, you would have a list of words, that you would like to look for in the dataframe column.
So, I took the liberty to first convert your string into a list of strings.
the_list="[Gas|Internet|Water|Electricity,VONS|RALPHS|Ralphs|PAVILIONS|FOOD4LESS|TRADER JOE'S|GROCERY OUTLET|FOOD 4 LESS|SPROUTS|MARKET#WORK]"
the_list = the_list.replace("[","").replace("]","").split("|")
the_list
Output:
['Gas',
'Internet',
'Water',
'Electricity,VONS',
'RALPHS',
'Ralphs',
'PAVILIONS',
'FOOD4LESS',
"TRADER JOE'S",
'GROCERY OUTLET',
'FOOD 4 LESS',
'SPROUTS',
'MARKET#WORK']
Also, we make five rows of data where we have have the keywords we are looking for; and then add another row to it where we expect a False as a result of the regex pattern search.
descriptions = [
'FOOD4LESS 0508 0000FULLERTON CA',
'Electricity,VONS 0777 0123FULLERTON NY',
'PAVILIONS 1248 9800Ralphs MA',
'SPROUTS 9823 0770MARKET#WORK WI',
'Internet 0333 1008Water NJ',
'Enternet 0444 1008Wager NJ',
]

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

pandas: Replace string is not replacing targeted substring

I am trying to iterate a list of strings using dataframe1 to check whether the other dataframe2 has any strings found in dataframe1 to replace them.
for index, row in nlp_df.iterrows():
print( row['x1'] )
string1 = row['x1'].replace("(","\(")
string1 = string1.replace(")","\)")
string1 = string1.replace("[","\[")
string1 = string1.replace("]","\]")
nlp2_df['title'] = nlp2_df['title'].replace(string1,"")
In order to do this I iterated using the code shown above to check and replace for any string found in df1
The output belows shows the strings in df1
wait_timeout
interactive_timeout
pool_recycle
....
__all__
folder_name
re.compile('he(lo')
The output below shows the output after replacing strings in df2
0 have you tried watching the traffic between th...
1 /dev/cu.xxxxx is the "callout" device, it's wh...
2 You'll want the struct package.\r\r\n
For the output in df2 strings like /dev/cu.xxxxx should have been replaced during the iteration but as shown it is not removed. However, I have attempted using nlp2_df['title'] = nlp2_df['title'].replace("/dev/cu.xxxxx","") and managed to remove it successfully.
Is there a reason why directly writing the string works but looping using a variable to use for replacing does not?
IIUC you can simply use regular expressions:
nlp2_df['title'] = nlp2_df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
PS you don't need for loop at all...
Demo:
In [15]: df
Out[15]:
title
0 aaa (bbb) ccc
1 A [word] ...
In [16]: df['new'] = df['title'].str.replace(r'([\(\)\[\]])',r'\\\1')
In [17]: df
Out[17]:
title new
0 aaa (bbb) ccc aaa \(bbb\) ccc
1 A [word] ... A \[word\] ...

Categories