Pandas remove non-alphanumeric characters from string column - python

with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters.
Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons

You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']

Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']

Related

How to remove emoji and other language in a DataFrame

I got a dataframe as follows:
Title
Content
補水法💦
Skin Care
📯 現貨 📯
รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test
I tried to use the regex:
df1['Post Title'] = df1['Post Title'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df1['Post Detail'] = df1['Post Detail'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
It successfully removes the emojis. However, it only saves for the English but not the Chinese.
I would like to remove all the emojis and languages neither Chinese nor English.
Expected result:
Title
Content
補水法
Skin Care
現貨
Test
This can be done using the demoji library in python. For using do a pip install
pip install demoji
Than
import demoji
df = df.applymap(lambda x: demoji.replace(x,''))
In my opinion, the solution contains two parts:
Remove emojis. Emojis can be removed by unicodes. The detailed emoji unicode list can be downloaded from http://www.unicode.org/emoji/charts/full-emoji-list.html. It is recommended to utilizing exsiting packages to obtain the list like emoji. For example, with emoji, we can write following code. Tools like demoji and cleantext implement their methods in a similar way.
>>>
... from emoji import UNICODE_EMOJI
... df = pd.DataFrame({"Title": ["補水法💦", "📯 現貨 📯"], "Content": ["Skin Care", "รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test"]})
... def remove_emoji(text):
... emojis = UNICODE_EMOJI["en"]
... if isinstance(emojis, str):
... emojis = [emojis]
... result = text
... for x in emojis:
... result = result.replace(x, "")
... return result
... df = df.applymap(lambda x: remove_emoji(x))
... print(df)
Title Content
0 補水法 Skin Care
1 現貨 รีบจัดด่วน ราคาเฉพาะรอบนี Test
Remove Non-English or Non-Chinese Characters. One possible way to do it is utilizing some language detection tools like langid, langdetect and so on. For example, with following code, the content with Thai language will be removed.
>>>
... import langid
... df = pd.DataFrame({"Title": ["補水法💦", "📯 現貨 📯"], "Content": ["Skin Care", "รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test"]})
... df = df.applymap(lambda x: '' if ((langid.classify(x)[0] != 'zh') and (langid.classify(x)[0] != 'en')) else x)
>>> print(df)
Title Content
0 補水法💦 Skin Care
1 📯 現貨 📯
If you want to acheive a more fine-grained result, say, you desire the result as below. I think you also need to tokenize the content. For Chinese tokenization, you can refer to https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py, which also includes unicode details of Hanzi.
Title Content
0 補水法 Skin Care
1 現貨 Test
You can extract all English and Chinese words and join with a space:
import pandas as pd
df = pd.DataFrame({'Title':['補水法💦', '📯 現貨 📯', 'Text, text'], 'Content':['Skin Care', 'รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test', 'More text!!!']})
pPunct = r'!-/\:-#\[-`\{-~\u00A1-\u00A9\u00AB\u00AC\u00AE-\u00B1\u00B4\u00B6-\u00B8\u00BB\u00BF\u00D7\u00F7\u02C2-\u02C5\u02D2-\u02DF\u02E5-\u02EB\u02ED\u02EF-\u02FF\u0375\u037E\u0384\u0385\u0387\u03F6\u0482\u055A-\u055F\u0589\u058A\u058D-\u058F\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0606-\u060F\u061B\u061D-\u061F\u066A-\u066D\u06D4\u06DE\u06E9\u06FD\u06FE\u0700-\u070D\u07F6-\u07F9\u07FE\u07FF\u0830-\u083E\u085E\u0888\u0964\u0965\u0970\u09F2\u09F3\u09FA\u09FB\u09FD\u0A76\u0AF0\u0AF1\u0B70\u0BF3-\u0BFA\u0C77\u0C7F\u0C84\u0D4F\u0D79\u0DF4\u0E3F\u0E4F\u0E5A\u0E5B\u0F01-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0F3A-\u0F3D\u0F85\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE-\u0FDA\u104A-\u104F\u109E\u109F\u10FB\u1360-\u1368\u1390-\u1399\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DB\u1800-\u180A\u1940\u1944\u1945\u19DE-\u19FF\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B6A\u1B74-\u1B7E\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2010-\u2027\u2030-\u205E\u207A-\u207E\u208A-\u208E\u20A0-\u20C0\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u2140-\u2144\u214A-\u214D\u214F\u218A\u218B\u2190-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u2775\u2794-\u2B73\u2B76-\u2B95\u2B97-\u2BFF\u2CE5-\u2CEA\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E5D\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3001-\u3004\u3008-\u3020\u3030\u3036\u3037\u303D-\u303F\u309B\u309C\u30A0\u30FB\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA700-\uA716\uA720\uA721\uA789\uA78A\uA828-\uA82B\uA836-\uA839\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAA77-\uAA79\uAADE\uAADF\uAAF0\uAAF1\uAB5B\uAB6A\uAB6B\uABEB\uFB29\uFBB2-\uFBC2\uFD3E-\uFD4F\uFDCF\uFDFC-\uFDFF\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE66\uFE68-\uFE6B\uFF01-\uFF0F\uFF1A-\uFF20\uFF3B-\uFF40\uFF5B-\uFF65\uFFE0-\uFFE6\uFFE8-\uFFEE\uFFFC\uFFFD\U00010100-\U00010102\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019C\U000101A0\U000101D0-\U000101FC\U0001039F\U000103D0\U0001056F\U00010857\U00010877\U00010878\U0001091F\U0001093F\U00010A50-\U00010A58\U00010A7F\U00010AC8\U00010AF0-\U00010AF6\U00010B39-\U00010B3F\U00010B99-\U00010B9C\U00010EAD\U00010F55-\U00010F59\U00010F86-\U00010F89\U00011047-\U0001104D\U000110BB\U000110BC\U000110BE-\U000110C1\U00011140-\U00011143\U00011174\U00011175\U000111C5-\U000111C8\U000111CD\U000111DB\U000111DD-\U000111DF\U00011238-\U0001123D\U000112A9\U0001144B-\U0001144F\U0001145A\U0001145B\U0001145D\U000114C6\U000115C1-\U000115D7\U00011641-\U00011643\U00011660-\U0001166C\U000116B9\U0001173C-\U0001173F\U0001183B\U00011944-\U00011946\U000119E2\U00011A3F-\U00011A46\U00011A9A-\U00011A9C\U00011A9E-\U00011AA2\U00011C41-\U00011C45\U00011C70\U00011C71\U00011EF7\U00011EF8\U00011FD5-\U00011FF1\U00011FFF\U00012470-\U00012474\U00012FF1\U00012FF2\U00016A6E\U00016A6F\U00016AF5\U00016B37-\U00016B3F\U00016B44\U00016B45\U00016E97-\U00016E9A\U00016FE2\U0001BC9C\U0001BC9F\U0001CF50-\U0001CFC3\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1EA\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D6C1\U0001D6DB\U0001D6FB\U0001D715\U0001D735\U0001D74F\U0001D76F\U0001D789\U0001D7A9\U0001D7C3\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85-\U0001DA8B\U0001E14F\U0001E2FF\U0001E95E\U0001E95F\U0001ECAC\U0001ECB0\U0001ED2E\U0001EEF0\U0001EEF1\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F10D-\U0001F1AD\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F6D7\U0001F6DD-\U0001F6EC\U0001F6F0-\U0001F6FC\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F7E0-\U0001F7EB\U0001F7F0\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F8B0\U0001F8B1\U0001F900-\U0001FA53\U0001FA60-\U0001FA6D\U0001FA70-\U0001FA74\U0001FA78-\U0001FA7C\U0001FA80-\U0001FA86\U0001FA90-\U0001FAAC\U0001FAB0-\U0001FABA\U0001FAC0-\U0001FAC5\U0001FAD0-\U0001FAD9\U0001FAE0-\U0001FAE7\U0001FAF0-\U0001FAF6\U0001FB00-\U0001FB92\U0001FB94-\U0001FBCA'
pLatin = r'A-Za-z\u00AA\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7CA\uA7D0\uA7D1\uA7D3\uA7D5-\uA7D9\uA7F2-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uAB66-\uAB69\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A\U00010780-\U00010785\U00010787-\U000107B0\U000107B2-\U000107BA\U0001DF00-\U0001DF1E'
pHan = r'\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A'
pEmojiEx = r'0-9\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2640\u2642\u2648-\u2653\u265F\u2660\u2663\u2665\u2666\u2668\u267B\u267E\u267F\u2692-\u2697\u2699\u269B\u269C\u26A0\u26A1\u26A7\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\U0001F004\U0001F0CF\U0001F170\U0001F171\U0001F17E\U0001F17F\U0001F18E\U0001F191-\U0001F19A\U0001F1E6-\U0001F1FF\U0001F201\U0001F202\U0001F21A\U0001F22F\U0001F232-\U0001F23A\U0001F250\U0001F251\U0001F300-\U0001F321\U0001F324-\U0001F393\U0001F396\U0001F397\U0001F399-\U0001F39B\U0001F39E-\U0001F3F0\U0001F3F3-\U0001F3F5\U0001F3F7-\U0001F4FD\U0001F4FF-\U0001F53D\U0001F549-\U0001F54E\U0001F550-\U0001F567\U0001F56F\U0001F570\U0001F573-\U0001F57A\U0001F587\U0001F58A-\U0001F58D\U0001F590\U0001F595\U0001F596\U0001F5A4\U0001F5A5\U0001F5A8\U0001F5B1\U0001F5B2\U0001F5BC\U0001F5C2-\U0001F5C4\U0001F5D1-\U0001F5D3\U0001F5DC-\U0001F5DE\U0001F5E1\U0001F5E3\U0001F5E8\U0001F5EF\U0001F5F3\U0001F5FA-\U0001F64F\U0001F680-\U0001F6C5\U0001F6CB-\U0001F6D2\U0001F6D5-\U0001F6D7\U0001F6DD-\U0001F6E5\U0001F6E9\U0001F6EB\U0001F6EC\U0001F6F0\U0001F6F3-\U0001F6FC\U0001F7E0-\U0001F7EB\U0001F7F0\U0001F90C-\U0001F93A\U0001F93C-\U0001F945\U0001F947-\U0001F9FF\U0001FA70-\U0001FA74\U0001FA78-\U0001FA7C\U0001FA80-\U0001FA86\U0001FA90-\U0001FAAC\U0001FAB0-\U0001FABA\U0001FAC0-\U0001FAC5\U0001FAD0-\U0001FAD9\U0001FAE0-\U0001FAE7\U0001FAF0-\U0001FAF6'
pat = fr'[{pHan}{pLatin}{pPunct}]+'
df = df.replace(fr'[{pEmojiEx}]+', '', regex=True)
df['Title'] = df['Title'].str.findall(pat).str.join(" ")
df['Content'] = df['Content'].str.findall(pat).str.join(" ")
print(df.to_string())
Output:
>>> df
Title Content
0 補水法 Skin Care
1 現貨 Test
2 Text, text More text!!!
What it does is:
df.replace(fr'[{pEmojiEx}]+', '', regex=True) removes all chars that are marked with Emoji Unicode property (these include digits, but I removed * and #)
.str.findall(fr'[{pHan}{pLatin}{pPunct}]+').str.join(" ") extracts chunks of one or more Latin, Han or any punctuation chars and joins them with a space.

python bypass re.finditer match when searched words are in a defined expression

I have a list of words (find_list) that I want to find in a text and a list of expressions containing those words that I want to bypass (scape_list) when it is in the text.
I can find all the words in the text using this code:
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'\W{}\W'.format(word)
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
final_list.append(word_.group(0))
The final_list is:
[' name ', ' name ', ' name ', ' Name.', ' small ', ' Small ', ' Small ']
Is there a way to bypass expressions listed in scape_list and obtain a final_list like this one:
[' name ', ' name ', ' Name.', ' small ']
final_list and scape_list are always being updated. So I think that regex is a good approach.
You can capture the word before and after the find_list word using the regex and check whether both the combinations are not present in the scape_list. I have added comments where I have changed the code. (And better change the scape_list to a set if it can become large in future)
find_list = ['name', 'small']
scape_list = ['small software', 'company name']
text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."
final_list = []
for word in find_list:
s = r'(\w*\W)({})(\W\w*)'.format(word) # change the regex to capture adjacent words
matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))
for word_ in matches:
if ((word_.group(1) + word_.group(2)).strip().lower() not in scape_list
and (word_.group(2) + word_.group(3)).strip().lower() not in scape_list): # added this condition
final_list.append(word_.group(2)) # changed here
final_list
['name', 'name', 'Name', 'small']

Delete unknown special character

Delete special character
s="____Ç_apple___ _______new A_____"
print(re.sub('[^0-9a-zA-Z]\s+$', '', s))
result = ____Ç_______________apple___ _______new A_____
s="____Ç_apple___ _______new A_____"
print(re.sub('[^0-9a-zA-Z]', '', s))
result= applenewA
final
result = apple new A
but i cannot get it
i want to delete Ç and _ and maintain space and English
Since you want to consolidate multiple spaces into one space, and then remove characters that are not words or spaces, you should do it in two separate regex substitutions:
print(re.sub(r'[^0-9a-zA-Z ]+', '', re.sub(r'\s+', ' ', s)))
This outputs:
apple new A
You want 'apple new A' for the result, right?
s="____Ç_apple___ _______new A_____"
result = re.sub('[^a-zA-Z|\s]+', '', s) # apple new A
result = ' '.join(result.split()) # apple new A
print(result)

Python nested loops into one line

I need to clean a list of strings containing names. I need to remove titles and then things like 's etc. The code works ok but I'd like to transform it to two comprehension lists. My attempts like this one [name.replace(e, '') for name in names_ for e in replace] didn't work, I'm definitely missing something. Will appreciate your help!
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ', 'Maj ', 'Gen ']
names_new = []
for name in names:
for e in replace:
name = name.replace(e, '')
names_new.append(name)
names_final = []
for name in names_new:
if name.endswith("'s"):
name = name[:-2]
names_final.append(name)
else:
names_final.append(name)
print(names_final)
You can use re.sub() to do exactly what you want:
import re
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ', 'Maj ', 'Gen ']
names = [re.sub(r'(Mrs\s|Maj\s|Gen\s|\'s$)', '', x) for x in names]
print(names)
Output:
['Marple', 'Smith', 'Tony Dobson']
the problem is due to name = name.replace(e, '') statement in the for loop, and as we can't use assignment operator in comprehensions, you used name.replace(e, '') but again replace() method is not inplace as the string in python is not mutable.
Solution I that I have written is based on using reduce, here were replacing all the occurrences of elements in sequence replace.
from functools import reduce
names = ['Mrs Marple', 'Maj Gen Smith', "Tony Dobson's"]
replace = ['Mrs ','Maj ','Gen ']
result = [reduce(lambda str, e: str.replace(e, ''), replace, name) for name in names]
Here is the result
print(result)
['Marple', 'Smith', "Tony Dobson's"]
The solution by #chrisz works but if replace list is generated on the fly or is too long, we won't be able to form a regex for it. This solution works pretty much in any scenario.

How to add whitespace after string.punctuation in Python?

I want to clean my reviews data. Here's my code :
def processData(data):
data = data.lower() #casefold
data = re.sub('<[^>]*>',' ',data) #remove any html
data = re.sub(r'#([^\s]+)', r'\1', data) #Replace #word with word
remove = string.punctuation
remove = remove.replace("'", "") # don't remove '
p = r"[{}]".format(remove) #create the pattern
data = re.sub(p, "", data)
data = re.sub('[\s]+', ' ', data) #remove additional whitespaces
pp = re.compile(r"(.)\1{1,}", re.DOTALL) #pattern for remove repetitions
data = pp.sub(r"\1\1", data)
return data
This code almost work well, but there still a problem.
For this sentence "she work in public-service" ,
I got "she work in publicservice".
The problem is there are no whitespace after string punctuation.
I want my sentence to be like this "she work in public service".
Can you help me with my code?
I think you want this:
>>> st = 'she works in public-service'
>>> import re
>>> re.sub(r'([{}])'.format(string.punctuation),r' ',st)
'she works in public service'
>>>

Categories