I want to delete greek words with capital letters, such as:
text = 'Ο Κώστας θέλει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο σχολείο'
the output should be
text = 'Ο Κώστας θέλει να ποδόσφαιρο στο σχολείο'
I checked this one Regular expression : Remove words with Capital letters, but I don't know how to adopt the code into Greek aphabet.
We can (by consulting an Unicode chart) see that the Greek letters are approximately in the range U+0370..U+1FFF, and then filter using the unicodedata module:
>>> import unicodedata
>>> greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
{'Β', 'Χ', 'ᾛ', 'Ἁ', 'Ὼ', 'ᾜ', 'ᾫ', 'Ἂ', 'Ὰ', 'Ἑ', 'Ω', 'Ἤ', 'Ε', 'Ρ', 'Η', 'ᾏ', 'Ϳ', 'Ή', 'Ἣ', 'Ἵ', 'ᾋ', 'Ύ', 'ᾚ', 'Ή', 'Ϲ', 'Ί', 'Ὥ', 'Ύ', 'Ξ', 'Ὄ', 'Ο', 'Θ', 'Ϗ', 'Ϋ', 'Ͻ', 'ᾘ', 'Ὑ', 'Ώ', 'Ᾰ', 'ᾝ', 'Ἐ', 'Ὦ', 'Ά', 'Σ', 'Ὂ', 'Ἱ', 'Ὤ', 'Ͷ', 'Ὴ', 'Ό', 'Ψ', 'ῼ', 'Φ', 'Ἒ', 'Ὕ', 'ᾪ', 'Ἅ', 'Ῑ', 'Ἧ', 'Λ', 'Ἢ', 'Ϸ', 'Ἔ', 'Ί', 'Ἇ', 'Ἲ', 'Ὓ', 'Ζ', 'Τ', 'Ὗ', 'Ϊ', 'Ͽ', 'Μ', 'Ὀ', 'Ἄ', 'ᾊ', 'Κ', 'Γ', 'Ὶ', 'Ϻ', 'Ᾱ', 'ᾬ', 'Ώ', 'Ἳ', 'Ἥ', 'Ἦ', 'Ι', 'Ἃ', 'ᾌ', 'Ὁ', 'Έ', 'Δ', 'Ὡ', 'Ἆ', 'Ἰ', 'ϴ', 'Ͼ', 'Ῠ', 'ῌ', 'Ἓ', 'Ἕ', 'Έ', 'Ὃ', 'Ὠ', 'ᾈ', 'Ͱ', 'ᾼ', 'Ὢ', 'ᾙ', 'ᾞ', 'ᾎ', 'Ὸ', 'Ῥ', 'Ἀ', 'Ὣ', 'Ͳ', 'Ἶ', 'Ῐ', 'ᾮ', 'ᾍ', 'Ἡ', 'Ῡ', 'Ὧ', 'ᾉ', 'ᾩ', 'ᾯ', 'ᾭ', 'ᾟ', 'Ό', 'Α', 'Ὲ', 'Υ', 'Π', 'Ἴ', 'Ά', 'Ἷ', 'ᾨ', 'Ὅ', 'Ὺ', 'Ν', 'Ἠ'}
Then, you can form a regexp that matches words (continous runs) of such characters. We'll also include Latin capital characters.
>>> import re
>>> import string
>>> chars_class = re.escape("".join(greek_capital_chars.union(string.ascii_uppercase)))
>>> r = re.compile(f"[{chars_class}]+")
>>> text = 'Ο Κώστας θέλει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο σχολείο'
>>> r.sub("", text)
' ώστας θέλει να ποδόσφαιρο στο σχολείο'
As it is, the regex will of course also remove any capital letter; you may wish to do
>>> r = re.compile(f"[{chars_class}]{{2,}}")
>>> r.sub("", text)
'Ο Κώστας θέλει να ποδόσφαιρο στο σχολείο'
or similar instead, depending on your use case.
For example, there is a string or txt
"""
asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh
"""
Desired result:
"""
#111
#222
#333
#444
#555
"""
Using the code below, I can only see the first result.
import re
html="asfas #111 dfsfds #222 dsfsdfsfsd dsfds"
result = re.search('#"(.+?) ', html)
x = (result.group(0))
print(x)
How do I improve my code?
You can use re.findall method instead of re.search (re.search searches only for the first location where the regular expression pattern produces a match):
import re
txt = '''asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh'''
print(*re.findall(r'#\d+', txt), sep='\n')
Prints:
#111
#222
#333
#444
#555
If you always have # followed by 3 digits then:
import re
text = '''asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh
'''
results = re.findall(r'(#\d{3})', text)
print(results)
So () means keep a pattern that has # followed by only 3 digits.
You can do this even without using regex :
html="asfas #111 dfsfds #222 dsfsdfsfsd dsfds"
x = [i for i in html.split() if i.startswith('#')]
Output :
['#111', '#222']
def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
return " ".join(text)
Input text: 'Transaction value was - RS.3456.63 '
Output : 'Transaction value was RS 345663 '
Could someone suggest me how to remove special characters (including '.' ) during text pre-processing but retain the decimal numbers?
Required Output : 'Transaction value was RS 3456.63 '
You can use a more generic regex to replace all special characters except .
import re
def text_process(text):
text = re.sub('[^\w.]+', ' ', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
You get
'Transaction value was 3456.63 Rupees'
EDIT: The following function returns only the number with decimals.
def text_process(text):
text = re.sub('[^\d.]+', '', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
'3456.63'
If I understand your question correctly, this code is for you:
text = 'Transaction value was, - 3456.63 Rupees'
regex = r"(?<!\d)[" + string.punctuation + "](?!\d)"
result = re.sub(regex, "", text)
# output: 'Transaction value was 3456.63 Rupees'
To solve your second question, try using this trick:
text = 'Transaction value was, - Rs.3456.63'
regex_space = r"([0-9]+(\.[0-9]+)?)"
regex_punct = r'[^\w.]+'
re.sub(r'[^\w.]+', ' ', re.sub(regex_space,r" \1 ", text).strip())
# output: 'Transaction value was Rs. 3456.63 Rupees'
I have a csv table from which I get my regex pattern, e.g. \bconden
Problem : I don't manage to specify to python that this is a raw string
How to put r before a pattern when it comes from a string ?
import re
a = 'de la matière condensée'
fromcsv = '\bconden'
print(re.search('r' + fromcsv, a))
result is None
You can use the str_to_raw function below to make a raw string out of an already declared plain string variable:
import re
a = 'de la matière condensée'
pattern = '\bconden'
escape_dict = {
'\a': r'\a',
'\b': r'\b',
'\c': r'\c',
'\f': r'\f',
'\n': r'\n',
'\r': r'\r',
'\t': r'\t',
'\v': r'\v',
'\'': r'\'',
'\"': r'\"',
'\0': r'\0',
'\1': r'\1',
'\2': r'\2',
'\3': r'\3',
'\4': r'\4',
'\5': r'\5',
'\6': r'\6',
'\7': r'\7',
'\8': r'\8',
'\9': r'\9'
}
def str_to_raw(s):
return r''.join(escape_dict.get(c, c) for c in s)
print(re.search(r'\bconden', a))
print(re.search(str_to_raw(pattern), a))
Output:
<re.Match object; span=(14, 20), match='conden'>
<re.Match object; span=(14, 20), match='conden'>
note: I got escape_dict from this page.
I have the following data and I need to extract the first string occurrence It is separated from rest of data with \t. I'm trying to use split(),regex but the problem is it is taking more than 1 second to do this for each line. Is there anyway that it could be done faster?
Data:
DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618
So,the bottom line is I need to extract DT, PRP, VBD... from the above text really fast.
You can just call split with maxsplit argument and wrap it into a list generator.
result = [line.split('\t', 1)[0] for line in data]
As you see, passing 1 in the method call makes it stop after the first splitting takes place. I bet this is the fastest solution in Python.
A manual alternative.
def end_of_loop():
raise StopIteration
def my_split(line):
return ''.join(end_of_loop() if char == '\t' else char for char in line)
result = [my_split(line) for line in lines]
Provided your data are in a file:
with open(file) as data:
result = [my_split(line) for line in data]
This will be a lot slower than the first one.
You can use split in a list comprehension :
>>> s="""DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
... PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
... VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
... RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
... VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618"""
>>> [i.split()[0] for i in s.split('\n')]
['DT', 'PRP', 'VBD', 'RB', 'VBN']
import re
p = re.compile(r'^\S+', re.MULTILINE)
re.findall(p, test_str)
You can simply do this to get a list of strings you want.