Python replace substring given pattern - python

I am doing some data cleansing and want to remove the whole string between the characters:
"<p.>kódy:" and "</p.>" The strings are located in a dataframe and for each record, different characters can be found between the two characters, so I thought using combining str.place or re.sub with some kind of a wilcard could work, but I've not been succesful.
This is my sample input:
<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>
<p.> 924071180 924071181 924071182 </p.>
And the desired output:
<p.> 924071180 924071181 924071182 </p.>
Any help would be appreciated!
Cheers,
Stepan

You can use a regular expression substitution to get what you want in a single call:
data = """<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.>
<p.> 924071180 924071181 924071182 </p.>
"""
import re
r = re.sub(r"<p\.>kódy:.+?</p\.>", "", data)
print(r)
Result:
<p.> 924071180 924071181 924071182 </p.>

You can use split.
st = "<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110</p.> <p.> 924071180 924071181 924071182 </p.>"
st.split('</p.>')
Result:
['<p.>kódy: 2008212017 2008212025 2008212041 2008212066 2008212074 2008212108 2008212116 2008212124 2008212132 2008212140 2008212165 2008212199 2008212207 2008212215 2008212223 2008212231 2008212249 2008212256 2008212264 2008212272 2008212314 2008212355 2008212363 2008212389 2052500028 2052500036 2052500051 2052500069 2052500093 2052500101 2054384017 2054384041 2054384066 2054384090 2054384116 2054384124 2054384132 2054384140 2054384157 2054384165 2054384181 2054384199 2054384207 2054384215 2054384223 2054384249 20543842494 2054384348 2081043032 2081043057 2081043081 2081043214 2081043222 311088575007 311095577004 311095711009 4210013769006 62008212110', ' <p.> 924071180 924071181 924071182 ', '']
Or:
import re
t = re.sub('<p.>kódy.*?</p.>', '', st)
Result:
' <p.> 924071180 924071181 924071182 </p.>'

Related

regular expressions: remove Greek words with capital letters

I want to delete greek words with capital letters, such as:
text = 'Ο Κώστας θέλει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο σχολείο'
the output should be
text = 'Ο Κώστας θέλει να ποδόσφαιρο στο σχολείο'
I checked this one Regular expression : Remove words with Capital letters, but I don't know how to adopt the code into Greek aphabet.
We can (by consulting an Unicode chart) see that the Greek letters are approximately in the range U+0370..U+1FFF, and then filter using the unicodedata module:
>>> import unicodedata
>>> greek_capital_chars = set(chr(cp) for cp in range(0x0370, 0x1FFF) if "GREEK CAPITAL" in unicodedata.name(chr(cp), ""))
{'Β', 'Χ', 'ᾛ', 'Ἁ', 'Ὼ', 'ᾜ', 'ᾫ', 'Ἂ', 'Ὰ', 'Ἑ', 'Ω', 'Ἤ', 'Ε', 'Ρ', 'Η', 'ᾏ', 'Ϳ', 'Ή', 'Ἣ', 'Ἵ', 'ᾋ', 'Ύ', 'ᾚ', 'Ή', 'Ϲ', 'Ί', 'Ὥ', 'Ύ', 'Ξ', 'Ὄ', 'Ο', 'Θ', 'Ϗ', 'Ϋ', 'Ͻ', 'ᾘ', 'Ὑ', 'Ώ', 'Ᾰ', 'ᾝ', 'Ἐ', 'Ὦ', 'Ά', 'Σ', 'Ὂ', 'Ἱ', 'Ὤ', 'Ͷ', 'Ὴ', 'Ό', 'Ψ', 'ῼ', 'Φ', 'Ἒ', 'Ὕ', 'ᾪ', 'Ἅ', 'Ῑ', 'Ἧ', 'Λ', 'Ἢ', 'Ϸ', 'Ἔ', 'Ί', 'Ἇ', 'Ἲ', 'Ὓ', 'Ζ', 'Τ', 'Ὗ', 'Ϊ', 'Ͽ', 'Μ', 'Ὀ', 'Ἄ', 'ᾊ', 'Κ', 'Γ', 'Ὶ', 'Ϻ', 'Ᾱ', 'ᾬ', 'Ώ', 'Ἳ', 'Ἥ', 'Ἦ', 'Ι', 'Ἃ', 'ᾌ', 'Ὁ', 'Έ', 'Δ', 'Ὡ', 'Ἆ', 'Ἰ', 'ϴ', 'Ͼ', 'Ῠ', 'ῌ', 'Ἓ', 'Ἕ', 'Έ', 'Ὃ', 'Ὠ', 'ᾈ', 'Ͱ', 'ᾼ', 'Ὢ', 'ᾙ', 'ᾞ', 'ᾎ', 'Ὸ', 'Ῥ', 'Ἀ', 'Ὣ', 'Ͳ', 'Ἶ', 'Ῐ', 'ᾮ', 'ᾍ', 'Ἡ', 'Ῡ', 'Ὧ', 'ᾉ', 'ᾩ', 'ᾯ', 'ᾭ', 'ᾟ', 'Ό', 'Α', 'Ὲ', 'Υ', 'Π', 'Ἴ', 'Ά', 'Ἷ', 'ᾨ', 'Ὅ', 'Ὺ', 'Ν', 'Ἠ'}
Then, you can form a regexp that matches words (continous runs) of such characters. We'll also include Latin capital characters.
>>> import re
>>> import string
>>> chars_class = re.escape("".join(greek_capital_chars.union(string.ascii_uppercase)))
>>> r = re.compile(f"[{chars_class}]+")
>>> text = 'Ο Κώστας θέλει να ΠΑΙΞΕΙ ΑΎΡΙΟ ποδόσφαιρο στο σχολείο'
>>> r.sub("", text)
' ώστας θέλει να ποδόσφαιρο στο σχολείο'
As it is, the regex will of course also remove any capital letter; you may wish to do
>>> r = re.compile(f"[{chars_class}]{{2,}}")
>>> r.sub("", text)
'Ο Κώστας θέλει να ποδόσφαιρο στο σχολείο'
or similar instead, depending on your use case.

Python: How can I find text between certain words in a string?

For example, there is a string or txt
"""
asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh
"""
Desired result:
"""
#111
#222
#333
#444
#555
"""
Using the code below, I can only see the first result.
import re
html="asfas #111 dfsfds #222 dsfsdfsfsd dsfds"
result = re.search('#"(.+?) ', html)
x = (result.group(0))
print(x)
How do I improve my code?
You can use re.findall method instead of re.search (re.search searches only for the first location where the regular expression pattern produces a match):
import re
txt = '''asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh'''
print(*re.findall(r'#\d+', txt), sep='\n')
Prints:
#111
#222
#333
#444
#555
If you always have # followed by 3 digits then:
import re
text = '''asfas #111 dfsfds #222 dsfsdfsfsd dsfds
dsfsdfs sdfsdgsd #333 dsfsdfs dfsfsdf #444 dfsfsd
dsfssgs sdsdg #555 fsfh
'''
results = re.findall(r'(#\d{3})', text)
print(results)
So () means keep a pattern that has # followed by only 3 digits.
You can do this even without using regex :
html="asfas #111 dfsfds #222 dsfsdfsfsd dsfds"
x = [i for i in html.split() if i.startswith('#')]
Output :
['#111', '#222']

How do I retain the decimal numbers during text preprocessing in python?(Edited)

def text_process(text):
text = text.translate(str.maketrans('', '', string.punctuation))
return " ".join(text)
Input text: 'Transaction value was - RS.3456.63 '
Output : 'Transaction value was RS 345663 '
Could someone suggest me how to remove special characters (including '.' ) during text pre-processing but retain the decimal numbers?
Required Output : 'Transaction value was RS 3456.63 '
You can use a more generic regex to replace all special characters except .
import re
def text_process(text):
text = re.sub('[^\w.]+', ' ', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
You get
'Transaction value was 3456.63 Rupees'
EDIT: The following function returns only the number with decimals.
def text_process(text):
text = re.sub('[^\d.]+', '', text)
return text
s = 'Transaction: value* #was - 3456.63 Rupees'
text_process(s)
'3456.63'
If I understand your question correctly, this code is for you:
text = 'Transaction value was, - 3456.63 Rupees'
regex = r"(?<!\d)[" + string.punctuation + "](?!\d)"
result = re.sub(regex, "", text)
# output: 'Transaction value was 3456.63 Rupees'
To solve your second question, try using this trick:
text = 'Transaction value was, - Rs.3456.63'
regex_space = r"([0-9]+(\.[0-9]+)?)"
regex_punct = r'[^\w.]+'
re.sub(r'[^\w.]+', ' ', re.sub(regex_space,r" \1 ", text).strip())
# output: 'Transaction value was Rs. 3456.63 Rupees'

how to indicate raw string with regex() if my pattern come from another string?

I have a csv table from which I get my regex pattern, e.g. \bconden
Problem : I don't manage to specify to python that this is a raw string
How to put r before a pattern when it comes from a string ?
import re
a = 'de la matière condensée'
fromcsv = '\bconden'
print(re.search('r' + fromcsv, a))
result is None
You can use the str_to_raw function below to make a raw string out of an already declared plain string variable:
import re
a = 'de la matière condensée'
pattern = '\bconden'
escape_dict = {
'\a': r'\a',
'\b': r'\b',
'\c': r'\c',
'\f': r'\f',
'\n': r'\n',
'\r': r'\r',
'\t': r'\t',
'\v': r'\v',
'\'': r'\'',
'\"': r'\"',
'\0': r'\0',
'\1': r'\1',
'\2': r'\2',
'\3': r'\3',
'\4': r'\4',
'\5': r'\5',
'\6': r'\6',
'\7': r'\7',
'\8': r'\8',
'\9': r'\9'
}
def str_to_raw(s):
return r''.join(escape_dict.get(c, c) for c in s)
print(re.search(r'\bconden', a))
print(re.search(str_to_raw(pattern), a))
Output:
<re.Match object; span=(14, 20), match='conden'>
<re.Match object; span=(14, 20), match='conden'>
note: I got escape_dict from this page.

python extracting string from data

I have the following data and I need to extract the first string occurrence It is separated from rest of data with \t. I'm trying to use split(),regex but the problem is it is taking more than 1 second to do this for each line. Is there anyway that it could be done faster?
Data:
DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618
So,the bottom line is I need to extract DT, PRP, VBD... from the above text really fast.
You can just call split with maxsplit argument and wrap it into a list generator.
result = [line.split('\t', 1)[0] for line in data]
As you see, passing 1 in the method call makes it stop after the first splitting takes place. I bet this is the fastest solution in Python.
A manual alternative.
def end_of_loop():
raise StopIteration
def my_split(line):
return ''.join(end_of_loop() if char == '\t' else char for char in line)
result = [my_split(line) for line in lines]
Provided your data are in a file:
with open(file) as data:
result = [my_split(line) for line in data]
This will be a lot slower than the first one.
You can use split in a list comprehension :
>>> s="""DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
... PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
... VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
... RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
... VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618"""
>>> [i.split()[0] for i in s.split('\n')]
['DT', 'PRP', 'VBD', 'RB', 'VBN']
import re
p = re.compile(r'^\S+', re.MULTILINE)
re.findall(p, test_str)
You can simply do this to get a list of strings you want.

Categories