I got a dataframe as follows:
Title
Content
補水法💦
Skin Care
📯 現貨 📯
รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test
I tried to use the regex:
df1['Post Title'] = df1['Post Title'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
df1['Post Detail'] = df1['Post Detail'].str.replace(r'[^\x00-\x7F]+', '', regex=True)
It successfully removes the emojis. However, it only saves for the English but not the Chinese.
I would like to remove all the emojis and languages neither Chinese nor English.
Expected result:
Title
Content
補水法
Skin Care
現貨
Test
This can be done using the demoji library in python. For using do a pip install
pip install demoji
Than
import demoji
df = df.applymap(lambda x: demoji.replace(x,''))
In my opinion, the solution contains two parts:
Remove emojis. Emojis can be removed by unicodes. The detailed emoji unicode list can be downloaded from http://www.unicode.org/emoji/charts/full-emoji-list.html. It is recommended to utilizing exsiting packages to obtain the list like emoji. For example, with emoji, we can write following code. Tools like demoji and cleantext implement their methods in a similar way.
>>>
... from emoji import UNICODE_EMOJI
... df = pd.DataFrame({"Title": ["補水法💦", "📯 現貨 📯"], "Content": ["Skin Care", "รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test"]})
... def remove_emoji(text):
... emojis = UNICODE_EMOJI["en"]
... if isinstance(emojis, str):
... emojis = [emojis]
... result = text
... for x in emojis:
... result = result.replace(x, "")
... return result
... df = df.applymap(lambda x: remove_emoji(x))
... print(df)
Title Content
0 補水法 Skin Care
1 現貨 รีบจัดด่วน ราคาเฉพาะรอบนี Test
Remove Non-English or Non-Chinese Characters. One possible way to do it is utilizing some language detection tools like langid, langdetect and so on. For example, with following code, the content with Thai language will be removed.
>>>
... import langid
... df = pd.DataFrame({"Title": ["補水法💦", "📯 現貨 📯"], "Content": ["Skin Care", "รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test"]})
... df = df.applymap(lambda x: '' if ((langid.classify(x)[0] != 'zh') and (langid.classify(x)[0] != 'en')) else x)
>>> print(df)
Title Content
0 補水法💦 Skin Care
1 📯 現貨 📯
If you want to acheive a more fine-grained result, say, you desire the result as below. I think you also need to tokenize the content. For Chinese tokenization, you can refer to https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_zh.py, which also includes unicode details of Hanzi.
Title Content
0 補水法 Skin Care
1 現貨 Test
You can extract all English and Chinese words and join with a space:
import pandas as pd
df = pd.DataFrame({'Title':['補水法💦', '📯 現貨 📯', 'Text, text'], 'Content':['Skin Care', 'รีบจัดด่วน‼️ ราคาเฉพาะรอบนี💕 Test', 'More text!!!']})
pPunct = r'!-/\:-#\[-`\{-~\u00A1-\u00A9\u00AB\u00AC\u00AE-\u00B1\u00B4\u00B6-\u00B8\u00BB\u00BF\u00D7\u00F7\u02C2-\u02C5\u02D2-\u02DF\u02E5-\u02EB\u02ED\u02EF-\u02FF\u0375\u037E\u0384\u0385\u0387\u03F6\u0482\u055A-\u055F\u0589\u058A\u058D-\u058F\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0606-\u060F\u061B\u061D-\u061F\u066A-\u066D\u06D4\u06DE\u06E9\u06FD\u06FE\u0700-\u070D\u07F6-\u07F9\u07FE\u07FF\u0830-\u083E\u085E\u0888\u0964\u0965\u0970\u09F2\u09F3\u09FA\u09FB\u09FD\u0A76\u0AF0\u0AF1\u0B70\u0BF3-\u0BFA\u0C77\u0C7F\u0C84\u0D4F\u0D79\u0DF4\u0E3F\u0E4F\u0E5A\u0E5B\u0F01-\u0F17\u0F1A-\u0F1F\u0F34\u0F36\u0F38\u0F3A-\u0F3D\u0F85\u0FBE-\u0FC5\u0FC7-\u0FCC\u0FCE-\u0FDA\u104A-\u104F\u109E\u109F\u10FB\u1360-\u1368\u1390-\u1399\u1400\u166D\u166E\u169B\u169C\u16EB-\u16ED\u1735\u1736\u17D4-\u17D6\u17D8-\u17DB\u1800-\u180A\u1940\u1944\u1945\u19DE-\u19FF\u1A1E\u1A1F\u1AA0-\u1AA6\u1AA8-\u1AAD\u1B5A-\u1B6A\u1B74-\u1B7E\u1BFC-\u1BFF\u1C3B-\u1C3F\u1C7E\u1C7F\u1CC0-\u1CC7\u1CD3\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2010-\u2027\u2030-\u205E\u207A-\u207E\u208A-\u208E\u20A0-\u20C0\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u2140-\u2144\u214A-\u214D\u214F\u218A\u218B\u2190-\u2426\u2440-\u244A\u249C-\u24E9\u2500-\u2775\u2794-\u2B73\u2B76-\u2B95\u2B97-\u2BFF\u2CE5-\u2CEA\u2CF9-\u2CFC\u2CFE\u2CFF\u2D70\u2E00-\u2E2E\u2E30-\u2E5D\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3001-\u3004\u3008-\u3020\u3030\u3036\u3037\u303D-\u303F\u309B\u309C\u30A0\u30FB\u3190\u3191\u3196-\u319F\u31C0-\u31E3\u3200-\u321E\u322A-\u3247\u3250\u3260-\u327F\u328A-\u32B0\u32C0-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA4FE\uA4FF\uA60D-\uA60F\uA673\uA67E\uA6F2-\uA6F7\uA700-\uA716\uA720\uA721\uA789\uA78A\uA828-\uA82B\uA836-\uA839\uA874-\uA877\uA8CE\uA8CF\uA8F8-\uA8FA\uA8FC\uA92E\uA92F\uA95F\uA9C1-\uA9CD\uA9DE\uA9DF\uAA5C-\uAA5F\uAA77-\uAA79\uAADE\uAADF\uAAF0\uAAF1\uAB5B\uAB6A\uAB6B\uABEB\uFB29\uFBB2-\uFBC2\uFD3E-\uFD4F\uFDCF\uFDFC-\uFDFF\uFE10-\uFE19\uFE30-\uFE52\uFE54-\uFE66\uFE68-\uFE6B\uFF01-\uFF0F\uFF1A-\uFF20\uFF3B-\uFF40\uFF5B-\uFF65\uFFE0-\uFFE6\uFFE8-\uFFEE\uFFFC\uFFFD\U00010100-\U00010102\U00010137-\U0001013F\U00010179-\U00010189\U0001018C-\U0001018E\U00010190-\U0001019C\U000101A0\U000101D0-\U000101FC\U0001039F\U000103D0\U0001056F\U00010857\U00010877\U00010878\U0001091F\U0001093F\U00010A50-\U00010A58\U00010A7F\U00010AC8\U00010AF0-\U00010AF6\U00010B39-\U00010B3F\U00010B99-\U00010B9C\U00010EAD\U00010F55-\U00010F59\U00010F86-\U00010F89\U00011047-\U0001104D\U000110BB\U000110BC\U000110BE-\U000110C1\U00011140-\U00011143\U00011174\U00011175\U000111C5-\U000111C8\U000111CD\U000111DB\U000111DD-\U000111DF\U00011238-\U0001123D\U000112A9\U0001144B-\U0001144F\U0001145A\U0001145B\U0001145D\U000114C6\U000115C1-\U000115D7\U00011641-\U00011643\U00011660-\U0001166C\U000116B9\U0001173C-\U0001173F\U0001183B\U00011944-\U00011946\U000119E2\U00011A3F-\U00011A46\U00011A9A-\U00011A9C\U00011A9E-\U00011AA2\U00011C41-\U00011C45\U00011C70\U00011C71\U00011EF7\U00011EF8\U00011FD5-\U00011FF1\U00011FFF\U00012470-\U00012474\U00012FF1\U00012FF2\U00016A6E\U00016A6F\U00016AF5\U00016B37-\U00016B3F\U00016B44\U00016B45\U00016E97-\U00016E9A\U00016FE2\U0001BC9C\U0001BC9F\U0001CF50-\U0001CFC3\U0001D000-\U0001D0F5\U0001D100-\U0001D126\U0001D129-\U0001D164\U0001D16A-\U0001D16C\U0001D183\U0001D184\U0001D18C-\U0001D1A9\U0001D1AE-\U0001D1EA\U0001D200-\U0001D241\U0001D245\U0001D300-\U0001D356\U0001D6C1\U0001D6DB\U0001D6FB\U0001D715\U0001D735\U0001D74F\U0001D76F\U0001D789\U0001D7A9\U0001D7C3\U0001D800-\U0001D9FF\U0001DA37-\U0001DA3A\U0001DA6D-\U0001DA74\U0001DA76-\U0001DA83\U0001DA85-\U0001DA8B\U0001E14F\U0001E2FF\U0001E95E\U0001E95F\U0001ECAC\U0001ECB0\U0001ED2E\U0001EEF0\U0001EEF1\U0001F000-\U0001F02B\U0001F030-\U0001F093\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F10D-\U0001F1AD\U0001F1E6-\U0001F202\U0001F210-\U0001F23B\U0001F240-\U0001F248\U0001F250\U0001F251\U0001F260-\U0001F265\U0001F300-\U0001F6D7\U0001F6DD-\U0001F6EC\U0001F6F0-\U0001F6FC\U0001F700-\U0001F773\U0001F780-\U0001F7D8\U0001F7E0-\U0001F7EB\U0001F7F0\U0001F800-\U0001F80B\U0001F810-\U0001F847\U0001F850-\U0001F859\U0001F860-\U0001F887\U0001F890-\U0001F8AD\U0001F8B0\U0001F8B1\U0001F900-\U0001FA53\U0001FA60-\U0001FA6D\U0001FA70-\U0001FA74\U0001FA78-\U0001FA7C\U0001FA80-\U0001FA86\U0001FA90-\U0001FAAC\U0001FAB0-\U0001FABA\U0001FAC0-\U0001FAC5\U0001FAD0-\U0001FAD9\U0001FAE0-\U0001FAE7\U0001FAF0-\U0001FAF6\U0001FB00-\U0001FB92\U0001FB94-\U0001FBCA'
pLatin = r'A-Za-z\u00AA\u00BA\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u02E0-\u02E4\u1D00-\u1D25\u1D2C-\u1D5C\u1D62-\u1D65\u1D6B-\u1D77\u1D79-\u1DBE\u1E00-\u1EFF\u2071\u207F\u2090-\u209C\u212A\u212B\u2132\u214E\u2160-\u2188\u2C60-\u2C7F\uA722-\uA787\uA78B-\uA7CA\uA7D0\uA7D1\uA7D3\uA7D5-\uA7D9\uA7F2-\uA7FF\uAB30-\uAB5A\uAB5C-\uAB64\uAB66-\uAB69\uFB00-\uFB06\uFF21-\uFF3A\uFF41-\uFF5A\U00010780-\U00010785\U00010787-\U000107B0\U000107B2-\U000107BA\U0001DF00-\U0001DF1E'
pHan = r'\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DBF\u4E00-\u9FFF\uF900-\uFA6D\uFA70-\uFAD9\U00016FE2\U00016FE3\U00016FF0\U00016FF1\U00020000-\U0002A6DF\U0002A700-\U0002B738\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D\U00030000-\U0003134A'
pEmojiEx = r'0-9\u00A9\u00AE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2640\u2642\u2648-\u2653\u265F\u2660\u2663\u2665\u2666\u2668\u267B\u267E\u267F\u2692-\u2697\u2699\u269B\u269C\u26A0\u26A1\u26A7\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299\U0001F004\U0001F0CF\U0001F170\U0001F171\U0001F17E\U0001F17F\U0001F18E\U0001F191-\U0001F19A\U0001F1E6-\U0001F1FF\U0001F201\U0001F202\U0001F21A\U0001F22F\U0001F232-\U0001F23A\U0001F250\U0001F251\U0001F300-\U0001F321\U0001F324-\U0001F393\U0001F396\U0001F397\U0001F399-\U0001F39B\U0001F39E-\U0001F3F0\U0001F3F3-\U0001F3F5\U0001F3F7-\U0001F4FD\U0001F4FF-\U0001F53D\U0001F549-\U0001F54E\U0001F550-\U0001F567\U0001F56F\U0001F570\U0001F573-\U0001F57A\U0001F587\U0001F58A-\U0001F58D\U0001F590\U0001F595\U0001F596\U0001F5A4\U0001F5A5\U0001F5A8\U0001F5B1\U0001F5B2\U0001F5BC\U0001F5C2-\U0001F5C4\U0001F5D1-\U0001F5D3\U0001F5DC-\U0001F5DE\U0001F5E1\U0001F5E3\U0001F5E8\U0001F5EF\U0001F5F3\U0001F5FA-\U0001F64F\U0001F680-\U0001F6C5\U0001F6CB-\U0001F6D2\U0001F6D5-\U0001F6D7\U0001F6DD-\U0001F6E5\U0001F6E9\U0001F6EB\U0001F6EC\U0001F6F0\U0001F6F3-\U0001F6FC\U0001F7E0-\U0001F7EB\U0001F7F0\U0001F90C-\U0001F93A\U0001F93C-\U0001F945\U0001F947-\U0001F9FF\U0001FA70-\U0001FA74\U0001FA78-\U0001FA7C\U0001FA80-\U0001FA86\U0001FA90-\U0001FAAC\U0001FAB0-\U0001FABA\U0001FAC0-\U0001FAC5\U0001FAD0-\U0001FAD9\U0001FAE0-\U0001FAE7\U0001FAF0-\U0001FAF6'
pat = fr'[{pHan}{pLatin}{pPunct}]+'
df = df.replace(fr'[{pEmojiEx}]+', '', regex=True)
df['Title'] = df['Title'].str.findall(pat).str.join(" ")
df['Content'] = df['Content'].str.findall(pat).str.join(" ")
print(df.to_string())
Output:
>>> df
Title Content
0 補水法 Skin Care
1 現貨 Test
2 Text, text More text!!!
What it does is:
df.replace(fr'[{pEmojiEx}]+', '', regex=True) removes all chars that are marked with Emoji Unicode property (these include digits, but I removed * and #)
.str.findall(fr'[{pHan}{pLatin}{pPunct}]+').str.join(" ") extracts chunks of one or more Latin, Han or any punctuation chars and joins them with a space.
Related
with pandas and jupyter notebook I would like to delete everything that is not character, that is: hyphens, special characters etc etc
es:
firstname,birthday_date
joe-down§,02-12-1990
lucash brown_ :),06-09-1980
^antony,11-02-1987
mary|,14-12-2002
change with:
firstname,birthday_date
joe down,02-12-1990
lucash brown,06-09-1980
antony,11-02-1987
mary,14-12-2002
I'm trying with:
df['firstname'] = df['firstname'].str.replace(r'!', '')
df['firstname'] = df['firstname'].str.replace(r'^', '')
df['firstname'] = df['firstname'].str.replace(r'|', '')
df['firstname'] = df['firstname'].str.replace(r'§', '')
df['firstname'] = df['firstname'].str.replace(r':', '')
df['firstname'] = df['firstname'].str.replace(r')', '')
......
......
df
it seems to work, but on more populated columns I always miss some characters.
Is there a way to completely eliminate all NON-text characters and keep only a single word or words in the same column? in the example I used firstname to make the idea better! but it would also serve for columns with whole words!
Thanks!
P.S also encoded text for emoticons
You can use regex for this.
df['firstname'] = df['firstname'].str.replace('[^a-zA-Z0-9]', ' ', regex=True).str.strip()
df.firstname.tolist()
>>> ['joe down', 'lucash brown', 'antony', 'mary']
Try the below. It works on the names you have used in post
first_names = ['joe-down§','lucash brown_','^antony','mary|']
clean_names = []
keep = {'-',' '}
for name in first_names:
clean_names.append(''.join(c if c not in keep else ' ' for c in name if c.isalnum() or c in keep))
print(clean_names)
output
['joe down', 'lucash brown', 'antony', 'mary']
I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'
You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.
For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'
I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech
I have a string that comes from an article with a few hundred sentences. I want to convert the string to a dataframe, with each sentence as a row. For example,
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
I hope it becomes:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
As a python newbie, this is what I tried:
import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")
With the code above, all sentences become column names. I actually want them in rows of a single column.
Don't use read_csv. Just split by '.' and use the standard pd.DataFrame:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
Keep in mind that this will break if there will be
floating point numbers in some of the sentences. In this case you will need to change the format of your string (eg use '\n' instead of '.' to separate sentences.)
this is a quick solution but it solves your issue:
data_df = pd.read_csv(data, sep=".", header=None).T
You can achieve this via a list comprehension:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
What you are trying to do is called tokenizing sentences. The easiest way would be to use a Text-Mining library such as NLTK for it:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
Otherwise you could simply try something like:
pd.DataFrame(data.split('. '))
However, this will fail if you run into sentences like this:
problem = 'Tim likes to jump... but not always!'
Python regular expression I have a string that contains keywords but sometimes the keywords dont exist and they are not in any particular oder. I need help with the regular expression.
Keywords are:
Up-to-date
date added
date trained
These are the keywords i need to find amongst a number of other keywords and they may not exist and will be in any order.
What the sting looks like
<div>
<h2 class='someClass'>text</h2>
blah blah blah Up-to-date blah date added blah
</div>
what i've tried:
regex = re.compile('</h2>.*(Up\-to\-date|date\sadded|date\strained)*.*</div>')
regex = re.compile('</h2>.*(Up\-to\-date?)|(date\sadded?)|(date\strained?).*</div>')
re.findall(regex,string)
The outcome i'm looking for would be:
If all exists
['Up-to-date','date added','date trained']
If some exists
['Up-to-date','','date trained']
Does it have to be a regex? If not, you could use find:
In [12]: sentence = 'hello world cat dog'
In [13]: words = ['cat', 'bear', 'dog']
In [15]: [w*(sentence.find(w)>=0) for w in words]
Out[15]: ['cat', '', 'dog']
This code does what you want, but it smells:
import re
def check(the_str):
output_list = []
u2d = re.compile('</h2>.*Up\-to\-date*.*</div>')
da = re.compile('</h2>.*date\sadded*.*</div>')
dt = re.compile('</h2>.*date\strained*.*</div>')
if re.match(u2d, the_str):
output_list.append("Up-to-date")
if re.match(da, the_str):
output_list.append("date added")
if re.match(dt, the_str):
output_list.append("date trained")
return output_list
the_str = "</h2>My super cool string with the date added and then some more text</div>"
print check(the_str)
the_str2 = "</h2>My super cool string date added with the date trained and then some more text</div>"
print check(the_str2)
the_str3 = "</h2>My super cool string date added with the date trained and then Up-to-date some more text</div>"
print check(the_str3)