python extracting string from data - python

I have the following data and I need to extract the first string occurrence It is separated from rest of data with \t. I'm trying to use split(),regex but the problem is it is taking more than 1 second to do this for each line. Is there anyway that it could be done faster?
Data:
DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618
So,the bottom line is I need to extract DT, PRP, VBD... from the above text really fast.

You can just call split with maxsplit argument and wrap it into a list generator.
result = [line.split('\t', 1)[0] for line in data]
As you see, passing 1 in the method call makes it stop after the first splitting takes place. I bet this is the fastest solution in Python.
A manual alternative.
def end_of_loop():
raise StopIteration
def my_split(line):
return ''.join(end_of_loop() if char == '\t' else char for char in line)
result = [my_split(line) for line in lines]
Provided your data are in a file:
with open(file) as data:
result = [my_split(line) for line in data]
This will be a lot slower than the first one.

You can use split in a list comprehension :
>>> s="""DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
... PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
... VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
... RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
... VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618"""
>>> [i.split()[0] for i in s.split('\n')]
['DT', 'PRP', 'VBD', 'RB', 'VBN']

import re
p = re.compile(r'^\S+', re.MULTILINE)
re.findall(p, test_str)
You can simply do this to get a list of strings you want.

Related

Is there a way where I can replace the first '.' with '-' in my code for a domain generator

last time I've gotten some help on making a website name generator. I feel bad but i'm stuck at the moment and I need some help again to improve it. in my code there's a .txt file called combined which included these lines.
After that i created a variable to add to the domain
web = 'web'
suffix = 'co.id'
And then i write it out so that the it would print the line output to the Combined.txt
output_count = 50
subdomain_count = 2
for i in range(output_count):
out = []
for j in range(subdomain_count):
out.append(random.choice(Test))
out.append(web)
out.append(suffix)
Example.write('.'.join(out)+"\n")
with open("dictionaries/examples.txt") as f:
websamples = [line.rstrip() for line in f]
I want the output where instead of just login.download.web.co.id there would be more variety like login-download.web.co.id or login.download-web.co.id In the code i used Example.write('.'.join(out)+"\n") so that the. would be a separator for each characters. I was thinking of adding more, by making a similar code line and save it to a different .txt files but I feel like it would be too long. Is there a way where I can variate each character separation with this symbol - or _ instead of just a . in the output?
Thanks!
Sure just iterate through a list of delimiters to add each of them to the output.
web = 'web'
suffix = 'co.id'
output_count = 50
subdomain_count = 2
delimeters = [ '-', '.']
for i in range(output_count):
out = []
for j in range(subdomain_count):
out.append(random.choice(Test))
for delimeter in delimeters:
addr = delimeter.join(out)
addrs = '.'.join([addr, web, suffix])
print(addrs)
Example.write(addrs + '\n')
output
my_pay.web.co.id
my-pay.web.co.id
my.pay.web.co.id
pay_download.web.co.id
pay-download.web.co.id
pay.download.web.co.id
group_login.web.co.id
group-login.web.co.id
group.login.web.co.id
install_group.web.co.id
install-group.web.co.id
install.group.web.co.id
...
...
update
import itertools
Test = ['download', 'login', 'my', 'ip', 'site', 'ssl', 'pay', 'install']
delimeters = [ '-', '.']
web = 'web'
suffix = 'co.id'
output_count = 50
subdomain_count = 2
for combo in itertools.combinations(Test, 2):
out = ''
for i, d in enumerate(delimeters):
out = d.join(combo)
out = delimeters[i-1].join([out, web])
addr = '.'.join([out, suffix])
print(addr)
# Example.write(addr+'\n')
output
download-login.web.co.id
download.login-web.co.id
download-my.web.co.id
download.my-web.co.id
download-ip.web.co.id
download.ip-web.co.id
download-site.web.co.id
download.site-web.co.id
download-ssl.web.co.id
download.ssl-web.co.id
download-pay.web.co.id
download.pay-web.co.id
download-install.web.co.id
download.install-web.co.id
login-my.web.co.id
login.my-web.co.id
login-ip.web.co.id
login.ip-web.co.id
login-site.web.co.id
login.site-web.co.id
login-ssl.web.co.id
login.ssl-web.co.id
login-pay.web.co.id
login.pay-web.co.id
login-install.web.co.id
login.install-web.co.id
my-ip.web.co.id
my.ip-web.co.id
my-site.web.co.id
my.site-web.co.id
my-ssl.web.co.id
my.ssl-web.co.id
my-pay.web.co.id
my.pay-web.co.id
my-install.web.co.id
my.install-web.co.id
ip-site.web.co.id
ip.site-web.co.id
ip-ssl.web.co.id
ip.ssl-web.co.id
ip-pay.web.co.id
ip.pay-web.co.id
ip-install.web.co.id
ip.install-web.co.id
site-ssl.web.co.id
site.ssl-web.co.id
site-pay.web.co.id
site.pay-web.co.id
site-install.web.co.id
site.install-web.co.id
ssl-pay.web.co.id
ssl.pay-web.co.id
ssl-install.web.co.id
ssl.install-web.co.id
pay-install.web.co.id
pay.install-web.co.id
As an alternative of replacing the final output, you could make the seperator random:
import random
seperators = ['-', '_', '.']
Example.write(random.choice(seperators).join(out)+"\n")
In order to ensure compliance with RFC 1035 I would suggest:
from random import choices as CHOICES, choice as CHOICE
output_count = 50
subdomain_count = 2
web = 'web'
suffix = 'co.id'
dotdash = '.-'
filename = 'output.txt'
Test = [
'auth',
'access',
'account',
'admin'
# etc
]
with open(filename, 'w') as output:
for _ in range(output_count):
sd = CHOICE(dotdash).join(CHOICES(Test, k=subdomain_count))
print('.'.join((sd, web, suffix)), file=output)

Python - How to count specific section in a list

I'm brand new to python and I'm struggling how to add certain sections of a cvs file in python. I'm not allowed to use "import cvs"
I'm importing the TipJoke CVS file from https://vincentarelbundock.github.io/Rdatasets/datasets.html
This is the only code I have so far that worked and I'm at a total loss on where to go from here.
if __name__ == '__main__':
from pprint import pprint
from string import punctuation
f = open("TipJoke.csv", "r")
tipList = []
for line in f:
#deletes the quotes
line = line.replace('"', '')
tipList.append(line)
pprint(tipList[])
Output:
[',Card,Tip,Ad,Joke,None\n',
'1,None,1,0,0,1\n',
'2,Joke,1,0,1,0\n',
'3,Ad,0,1,0,0\n',
'4,None,0,0,0,1\n',
'5,None,1,0,0,1\n',
'6,None,0,0,0,1\n',
'7,Ad,0,1,0,0\n',
'8,Ad,0,1,0,0\n',
'9,None,0,0,0,1\n',
'10,None,0,0,0,1\n',
'11,None,1,0,0,1\n',
'12,Ad,0,1,0,0\n',
'13,None,0,0,0,1\n',
'14,Ad,1,1,0,0\n',
'15,Joke,1,0,1,0\n',
'16,Joke,0,0,1,0\n',
'17,Joke,1,0,1,0\n',
'18,None,0,0,0,1\n',
'19,Joke,0,0,1,0\n',
'20,None,0,0,0,1\n',
'21,Ad,1,1,0,0\n',
'22,Ad,1,1,0,0\n',
'23,Ad,0,1,0,0\n',
'24,Joke,0,0,1,0\n',
'25,Joke,1,0,1,0\n',
'26,Joke,0,0,1,0\n',
'27,None,1,0,0,1\n',
'28,Joke,1,0,1,0\n',
'29,Joke,1,0,1,0\n',
'30,None,1,0,0,1\n',
'31,Joke,0,0,1,0\n',
'32,None,1,0,0,1\n',
'33,Joke,1,0,1,0\n',
'34,Ad,0,1,0,0\n',
'35,Joke,0,0,1,0\n',
'36,Ad,1,1,0,0\n',
'37,Joke,0,0,1,0\n',
'38,Ad,0,1,0,0\n',
'39,Joke,0,0,1,0\n',
'40,Joke,0,0,1,0\n',
'41,Joke,1,0,1,0\n',
'42,None,0,0,0,1\n',
'43,None,0,0,0,1\n',
'44,Ad,0,1,0,0\n',
'45,None,0,0,0,1\n',
'46,None,0,0,0,1\n',
'47,Ad,0,1,0,0\n',
'48,Joke,0,0,1,0\n',
'49,Joke,1,0,1,0\n',
'50,None,1,0,0,1\n',
'51,None,0,0,0,1\n',
'52,Joke,1,0,1,0\n',
'53,Joke,1,0,1,0\n',
'54,Joke,0,0,1,0\n',
'55,None,1,0,0,1\n',
'56,Ad,0,1,0,0\n',
'57,Joke,0,0,1,0\n',
'58,None,0,0,0,1\n',
'59,Ad,0,1,0,0\n',
'60,Joke,1,0,1,0\n',
'61,Ad,0,1,0,0\n',
'62,None,1,0,0,1\n',
'63,Joke,0,0,1,0\n',
'64,Ad,0,1,0,0\n',
'65,Joke,0,0,1,0\n',
'66,Ad,0,1,0,0\n',
'67,Ad,0,1,0,0\n',
'68,Ad,0,1,0,0\n',
'69,None,0,0,0,1\n',
'70,Joke,1,0,1,0\n',
'71,None,1,0,0,1\n',
'72,None,0,0,0,1\n',
'73,None,0,0,0,1\n',
'74,Joke,0,0,1,0\n',
'75,Ad,1,1,0,0\n',
'76,Ad,0,1,0,0\n',
'77,Ad,1,1,0,0\n',
'78,Joke,0,0,1,0\n',
'79,Joke,0,0,1,0\n',
'80,Ad,1,1,0,0\n',
'81,Ad,0,1,0,0\n',
'82,None,0,0,0,1\n',
'83,Ad,0,1,0,0\n',
'84,Joke,0,0,1,0\n',
'85,Joke,0,0,1,0\n',
'86,Ad,1,1,0,0\n',
'87,None,1,0,0,1\n',
'88,Joke,1,0,1,0\n',
'89,Ad,0,1,0,0\n',
'90,None,0,0,0,1\n',
'91,None,0,0,0,1\n',
'92,Joke,0,0,1,0\n',
'93,Joke,0,0,1,0\n',
'94,Ad,0,1,0,0\n',
'95,Ad,0,1,0,0\n',
'96,Ad,0,1,0,0\n',
'97,Joke,1,0,1,0\n',
'98,None,0,0,0,1\n',
'99,None,0,0,0,1\n',
'100,None,1,0,0,1\n',
'101,Joke,0,0,1,0\n',
'102,Joke,0,0,1,0\n',
'103,Ad,1,1,0,0\n',
'104,Ad,0,1,0,0\n',
'105,Ad,0,1,0,0\n',
'106,Ad,1,1,0,0\n',
'107,Ad,0,1,0,0\n',
'108,None,0,0,0,1\n',
'109,Ad,0,1,0,0\n',
'110,Joke,1,0,1,0\n',
'111,None,0,0,0,1\n',
'112,Ad,0,1,0,0\n',
'113,Ad,0,1,0,0\n',
'114,None,0,0,0,1\n',
'115,Ad,0,1,0,0\n',
'116,None,0,0,0,1\n',
'117,None,0,0,0,1\n',
'118,Ad,0,1,0,0\n',
'119,None,1,0,0,1\n',
'120,Ad,1,1,0,0\n',
'121,Ad,0,1,0,0\n',
'122,Ad,1,1,0,0\n',
'123,None,0,0,0,1\n',
'124,None,0,0,0,1\n',
'125,Joke,1,0,1,0\n',
'126,Joke,1,0,1,0\n',
'127,Ad,0,1,0,0\n',
'128,Joke,0,0,1,0\n',
'129,Joke,0,0,1,0\n',
'130,Ad,0,1,0,0\n',
'131,None,0,0,0,1\n',
'132,None,0,0,0,1\n',
'133,None,0,0,0,1\n',
'134,Joke,1,0,1,0\n',
'135,Ad,0,1,0,0\n',
'136,None,0,0,0,1\n',
'137,Joke,0,0,1,0\n',
'138,Ad,0,1,0,0\n',
'139,Ad,0,1,0,0\n',
'140,None,0,0,0,1\n',
'141,Joke,0,0,1,0\n',
'142,None,0,0,0,1\n',
'143,Ad,0,1,0,0\n',
'144,None,1,0,0,1\n',
'145,Joke,0,0,1,0\n',
'146,Ad,0,1,0,0\n',
'147,Ad,0,1,0,0\n',
'148,Ad,0,1,0,0\n',
'149,Joke,1,0,1,0\n',
'150,Ad,1,1,0,0\n',
'151,Joke,1,0,1,0\n',
'152,None,0,0,0,1\n',
'153,Ad,0,1,0,0\n',
'154,None,0,0,0,1\n',
'155,None,0,0,0,1\n',
'156,Ad,0,1,0,0\n',
'157,Ad,0,1,0,0\n',
'158,Joke,0,0,1,0\n',
'159,None,0,0,0,1\n',
'160,Joke,1,0,1,0\n',
'161,None,1,0,0,1\n',
'162,Ad,1,1,0,0\n',
'163,Joke,0,0,1,0\n',
'164,Joke,0,0,1,0\n',
'165,Ad,0,1,0,0\n',
'166,Joke,1,0,1,0\n',
'167,Joke,1,0,1,0\n',
'168,Ad,0,1,0,0\n',
'169,Joke,1,0,1,0\n',
'170,Joke,0,0,1,0\n',
'171,Ad,0,1,0,0\n',
'172,Joke,0,0,1,0\n',
'173,Joke,0,0,1,0\n',
'174,Ad,0,1,0,0\n',
'175,None,0,0,0,1\n',
'176,Joke,1,0,1,0\n',
'177,Ad,0,1,0,0\n',
'178,Joke,0,0,1,0\n',
'179,Joke,0,0,1,0\n',
'180,None,0,0,0,1\n',
'181,None,0,0,0,1\n',
'182,Ad,0,1,0,0\n',
'183,None,0,0,0,1\n',
'184,None,0,0,0,1\n',
'185,None,0,0,0,1\n',
'186,None,0,0,0,1\n',
'187,Ad,0,1,0,0\n',
'188,None,1,0,0,1\n',
'189,Ad,0,1,0,0\n',
'190,Ad,0,1,0,0\n',
'191,Ad,0,1,0,0\n',
'192,Joke,1,0,1,0\n',
'193,Joke,0,0,1,0\n',
'194,Ad,0,1,0,0\n',
'195,None,0,0,0,1\n',
'196,Joke,1,0,1,0\n',
'197,Joke,0,0,1,0\n',
'198,Joke,1,0,1,0\n',
'199,Ad,0,1,0,0\n',
'200,None,0,0,0,1\n',
'201,Joke,1,0,1,0\n',
'202,Joke,0,0,1,0\n',
'203,Joke,0,0,1,0\n',
'204,Ad,0,1,0,0\n',
'205,None,0,0,0,1\n',
'206,Ad,0,1,0,0\n',
'207,Ad,0,1,0,0\n',
'208,Joke,0,0,1,0\n',
'209,Ad,0,1,0,0\n',
'210,Joke,0,0,1,0\n',
'211,None,0,0,0,1\n']
I'm currently trying to find the Total number of entries of the specified card type and the Percentage of tips given for the specified card type with two decimal places of precision. The tip column is the 0 or 1 right after the card type (None, Ad, Joke).
if you are allowed with pandas library then
import pandas as pd
df = pd.read_csv("TipJoke.csv")
df is a pandas dataframe object in which you can perform multiple filtering task according to your need.
for example if you want to get data for Joke you can filter like this:
print(df[df["Card"] == "Joke"])
Though, i'm just providing you the direction , not whole logic for your question.
This works
from pprint import pprint
from string import punctuation
counts = {"Joke": 0, "Ad": 0, "None": 0}
with open("TipJoke.csv", "r") as f:
for line in f:
line_clean = line.replace('"', "").replace("\n", "").split(",")
try:
counts[line_clean[1]] += int(line_clean[2])
except:
pass
print(counts)

Fastest way to count non spacing chars in Unicode text in Python

Given the Unicode non spacing marks list - https://www.fileformat.info/info/unicode/category/Mn/list.htm
UNICODE_NSM = ['\u0300', '\u0301', '\u0302', '\u0303', '\u0304', '\u0305', '\u0306', '\u0307', '\u0308', '\u0309', '\u030A', '\u030B', '\u030C', '\u030D', '\u030E', '\u030F', '\u0310', '\u0311', '\u0312', '\u0313', '\u0314', '\u0315', '\u0316', '\u0317', '\u0318', '\u0319', '\u031A', '\u031B', '\u031C', '\u031D', '\u031E', '\u031F', '\u0320', '\u0321', '\u0322', '\u0323', '\u0324', '\u0325', '\u0326', '\u0327', '\u0328', '\u0329', '\u032A', '\u032B', '\u032C', '\u032D', '\u032E', '\u032F', '\u0330', '\u0331', '\u0332', '\u0333', '\u0334', '\u0335', '\u0336', '\u0337', '\u0338', '\u0339', '\u033A', '\u033B', '\u033C', '\u033D', '\u033E', '\u033F', '\u0340', '\u0341', '\u0342', '\u0343', '\u0344', '\u0345', '\u0346', '\u0347', '\u0348', '\u0349', '\u034A', '\u034B', '\u034C', '\u034D', '\u034E', '\u034F', '\u0350', '\u0351', '\u0352', '\u0353', '\u0354', '\u0355', '\u0356', '\u0357', '\u0358', '\u0359', '\u035A', '\u035B', '\u035C', '\u035D', '\u035E', '\u035F', '\u0360', '\u0361', '\u0362', '\u0363', '\u0364', '\u0365', '\u0366', '\u0367', '\u0368', '\u0369', '\u036A', '\u036B', '\u036C', '\u036D', '\u036E', '\u036F', '\u0483', '\u0484', '\u0485', '\u0486', '\u0487', '\u0591', '\u0592', '\u0593', '\u0594', '\u0595', '\u0596', '\u0597', '\u0598', '\u0599', '\u059A', '\u059B', '\u059C', '\u059D', '\u059E', '\u059F', '\u05A0', '\u05A1', '\u05A2', '\u05A3', '\u05A4', '\u05A5', '\u05A6', '\u05A7', '\u05A8', '\u05A9', '\u05AA', '\u05AB', '\u05AC', '\u05AD', '\u05AE', '\u05AF', '\u05B0', '\u05B1', '\u05B2', '\u05B3', '\u05B4', '\u05B5', '\u05B6', '\u05B7', '\u05B8', '\u05B9', '\u05BA', '\u05BB', '\u05BC', '\u05BD', '\u05BF', '\u05C1', '\u05C2', '\u05C4', '\u05C5', '\u05C7', '\u0610', '\u0611', '\u0612', '\u0613', '\u0614', '\u0615', '\u0616', '\u0617', '\u0618', '\u0619', '\u061A', '\u064B', '\u064C', '\u064D', '\u064E', '\u064F', '\u0650', '\u0651', '\u0652', '\u0653', '\u0654', '\u0655', '\u0656', '\u0657', '\u0658', '\u0659', '\u065A', '\u065B', '\u065C', '\u065D', '\u065E', '\u065F', '\u0670', '\u06D6', '\u06D7', '\u06D8', '\u06D9', '\u06DA', '\u06DB', '\u06DC', '\u06DF', '\u06E0', '\u06E1', '\u06E2', '\u06E3', '\u06E4', '\u06E7', '\u06E8', '\u06EA', '\u06EB', '\u06EC', '\u06ED', '\u0711', '\u0730', '\u0731', '\u0732', '\u0733', '\u0734', '\u0735', '\u0736', '\u0737', '\u0738', '\u0739', '\u073A', '\u073B', '\u073C', '\u073D', '\u073E', '\u073F', '\u0740', '\u0741', '\u0742', '\u0743', '\u0744', '\u0745', '\u0746', '\u0747', '\u0748', '\u0749', '\u074A', '\u07A6', '\u07A7', '\u07A8', '\u07A9', '\u07AA', '\u07AB', '\u07AC', '\u07AD', '\u07AE', '\u07AF', '\u07B0', '\u07EB', '\u07EC', '\u07ED', '\u07EE', '\u07EF', '\u07F0', '\u07F1', '\u07F2', '\u07F3', '\u0816', '\u0817', '\u0818', '\u0819', '\u081B', '\u081C', '\u081D', '\u081E', '\u081F', '\u0820', '\u0821', '\u0822', '\u0823', '\u0825', '\u0826', '\u0827', '\u0829', '\u082A', '\u082B', '\u082C', '\u082D', '\u0859', '\u085A', '\u085B', '\u08E4', '\u08E5', '\u08E6', '\u08E7', '\u08E8', '\u08E9', '\u08EA', '\u08EB', '\u08EC', '\u08ED', '\u08EE', '\u08EF', '\u08F0', '\u08F1', '\u08F2', '\u08F3', '\u08F4', '\u08F5', '\u08F6', '\u08F7', '\u08F8', '\u08F9', '\u08FA', '\u08FB', '\u08FC', '\u08FD', '\u08FE', '\u0900', '\u0901', '\u0902', '\u093A', '\u093C', '\u093E', '\u0941', '\u0942', '\u0943', '\u0944', '\u0945', '\u0946', '\u0947', '\u0948', '\u094D', '\u0951', '\u0952', '\u0953', '\u0954', '\u0955', '\u0956', '\u0957', '\u0962', '\u0963', '\u0981', '\u09BC', '\u09C1', '\u09C2', '\u09C3', '\u09C4', '\u09CD', '\u09E2', '\u09E3', '\u0A01', '\u0A02', '\u0A3C', '\u0A41', '\u0A42', '\u0A47', '\u0A48', '\u0A4B', '\u0A4C', '\u0A4D', '\u0A51', '\u0A70', '\u0A71', '\u0A75', '\u0A81', '\u0A82', '\u0ABC', '\u0AC1', '\u0AC2', '\u0AC3', '\u0AC4', '\u0AC5', '\u0AC7', '\u0AC8', '\u0ACD', '\u0AE2', '\u0AE3', '\u0B01', '\u0B3C', '\u0B3F', '\u0B41', '\u0B42', '\u0B43', '\u0B44', '\u0B4D', '\u0B56', '\u0B62', '\u0B63', '\u0B82', '\u0BC0', '\u0BCD', '\u0C3E', '\u0C3F', '\u0C40', '\u0C46', '\u0C47', '\u0C48', '\u0C4A', '\u0C4B', '\u0C4C', '\u0C4D', '\u0C55', '\u0C56', '\u0C62', '\u0C63', '\u0CBC', '\u0CBF', '\u0CC6', '\u0CCC', '\u0CCD', '\u0CE2', '\u0CE3', '\u0D41', '\u0D42', '\u0D43', '\u0D44', '\u0D4D', '\u0D62', '\u0D63', '\u0DCA', '\u0DD2', '\u0DD3', '\u0DD4', '\u0DD6', '\u0E31', '\u0E34', '\u0E35', '\u0E36', '\u0E37', '\u0E38', '\u0E39', '\u0E3A', '\u0E47', '\u0E48', '\u0E49', '\u0E4A', '\u0E4B', '\u0E4C', '\u0E4D', '\u0E4E', '\u0EB1', '\u0EB4', '\u0EB5', '\u0EB6', '\u0EB7', '\u0EB8', '\u0EB9', '\u0EBB', '\u0EBC', '\u0EC8', '\u0EC9', '\u0ECA', '\u0ECB', '\u0ECC', '\u0ECD', '\u0F18', '\u0F19', '\u0F35', '\u0F37', '\u0F39', '\u0F71', '\u0F72', '\u0F73', '\u0F74', '\u0F75', '\u0F76', '\u0F77', '\u0F78', '\u0F79', '\u0F7A', '\u0F7B', '\u0F7C', '\u0F7D', '\u0F7E', '\u0F80', '\u0F81', '\u0F82', '\u0F83', '\u0F84', '\u0F86', '\u0F87', '\u0F8D', '\u0F8E', '\u0F8F', '\u0F90', '\u0F91', '\u0F92', '\u0F93', '\u0F94', '\u0F95', '\u0F96', '\u0F97', '\u0F99', '\u0F9A', '\u0F9B', '\u0F9C', '\u0F9D', '\u0F9E', '\u0F9F', '\u0FA0', '\u0FA1', '\u0FA2', '\u0FA3', '\u0FA4', '\u0FA5', '\u0FA6', '\u0FA7', '\u0FA8', '\u0FA9', '\u0FAA', '\u0FAB', '\u0FAC', '\u0FAD', '\u0FAE', '\u0FAF', '\u0FB0', '\u0FB1', '\u0FB2', '\u0FB3', '\u0FB4', '\u0FB5', '\u0FB6', '\u0FB7', '\u0FB8', '\u0FB9', '\u0FBA', '\u0FBB', '\u0FBC', '\u0FC6', '\u102D', '\u102E', '\u102F', '\u1030', '\u1032', '\u1033', '\u1034', '\u1035', '\u1036', '\u1037', '\u1039', '\u103A', '\u103D', '\u103E', '\u1058', '\u1059', '\u105E', '\u105F', '\u1060', '\u1071', '\u1072', '\u1073', '\u1074', '\u1082', '\u1085', '\u1086', '\u108D', '\u109D', '\u135D', '\u135E', '\u135F', '\u1712', '\u1713', '\u1714', '\u1732', '\u1733', '\u1734', '\u1752', '\u1753', '\u1772', '\u1773', '\u17B4', '\u17B5', '\u17B7', '\u17B8', '\u17B9', '\u17BA', '\u17BB', '\u17BC', '\u17BD', '\u17C6', '\u17C9', '\u17CA', '\u17CB', '\u17CC', '\u17CD', '\u17CE', '\u17CF', '\u17D0', '\u17D1', '\u17D2', '\u17D3', '\u17DD', '\u180B', '\u180C', '\u180D', '\u18A9', '\u1920', '\u1921', '\u1922', '\u1927', '\u1928', '\u1932', '\u1939', '\u193A', '\u193B', '\u1A17', '\u1A18', '\u1A56', '\u1A58', '\u1A59', '\u1A5A', '\u1A5B', '\u1A5C', '\u1A5D', '\u1A5E', '\u1A60', '\u1A62', '\u1A65', '\u1A66', '\u1A67', '\u1A68', '\u1A69', '\u1A6A', '\u1A6B', '\u1A6C', '\u1A73', '\u1A74', '\u1A75', '\u1A76', '\u1A77', '\u1A78', '\u1A79', '\u1A7A', '\u1A7B', '\u1A7C', '\u1A7F', '\u1B00', '\u1B01', '\u1B02', '\u1B03', '\u1B34', '\u1B36', '\u1B37', '\u1B38', '\u1B39', '\u1B3A', '\u1B3C', '\u1B42', '\u1B6B', '\u1B6C', '\u1B6D', '\u1B6E', '\u1B6F', '\u1B70', '\u1B71', '\u1B72', '\u1B73', '\u1B80', '\u1B81', '\u1BA2', '\u1BA3', '\u1BA4', '\u1BA5', '\u1BA8', '\u1BA9', '\u1BAB', '\u1BE6', '\u1BE8', '\u1BE9', '\u1BED', '\u1BEF', '\u1BF0', '\u1BF1', '\u1C2C', '\u1C2D', '\u1C2E', '\u1C2F', '\u1C30', '\u1C31', '\u1C32', '\u1C33', '\u1C36', '\u1C37', '\u1CD0', '\u1CD1', '\u1CD2', '\u1CD4', '\u1CD5', '\u1CD6', '\u1CD7', '\u1CD8', '\u1CD9', '\u1CDA', '\u1CDB', '\u1CDC', '\u1CDD', '\u1CDE', '\u1CDF', '\u1CE0', '\u1CE2', '\u1CE3', '\u1CE4', '\u1CE5', '\u1CE6', '\u1CE7', '\u1CE8', '\u1CED', '\u1CF4', '\u1DC0', '\u1DC1', '\u1DC2', '\u1DC3', '\u1DC4', '\u1DC5', '\u1DC6', '\u1DC7', '\u1DC8', '\u1DC9', '\u1DCA', '\u1DCB', '\u1DCC', '\u1DCD', '\u1DCE', '\u1DCF', '\u1DD0', '\u1DD1', '\u1DD2', '\u1DD3', '\u1DD4', '\u1DD5', '\u1DD6', '\u1DD7', '\u1DD8', '\u1DD9', '\u1DDA', '\u1DDB', '\u1DDC', '\u1DDD', '\u1DDE', '\u1DDF', '\u1DE0', '\u1DE1', '\u1DE2', '\u1DE3', '\u1DE4', '\u1DE5', '\u1DE6', '\u1DFC', '\u1DFD', '\u1DFE', '\u1DFF', '\u20D0', '\u20D1', '\u20D2', '\u20D3', '\u20D4', '\u20D5', '\u20D6', '\u20D7', '\u20D8', '\u20D9', '\u20DA', '\u20DB', '\u20DC', '\u20E1', '\u20E5', '\u20E6', '\u20E7', '\u20E8', '\u20E9', '\u20EA', '\u20EB', '\u20EC', '\u20ED', '\u20EE', '\u20EF', '\u20F0', '\u2CEF', '\u2CF0', '\u2CF1', '\u2D7F', '\u2DE0', '\u2DE1', '\u2DE2', '\u2DE3', '\u2DE4', '\u2DE5', '\u2DE6', '\u2DE7', '\u2DE8', '\u2DE9', '\u2DEA', '\u2DEB', '\u2DEC', '\u2DED', '\u2DEE', '\u2DEF', '\u2DF0', '\u2DF1', '\u2DF2', '\u2DF3', '\u2DF4', '\u2DF5', '\u2DF6', '\u2DF7', '\u2DF8', '\u2DF9', '\u2DFA', '\u2DFB', '\u2DFC', '\u2DFD', '\u2DFE', '\u2DFF', '\u302A', '\u302B', '\u302C', '\u302D', '\u3099', '\u309A', '\uA66F', '\uA674', '\uA675', '\uA676', '\uA677', '\uA678', '\uA679', '\uA67A', '\uA67B', '\uA67C', '\uA67D', '\uA69F', '\uA6F0', '\uA6F1', '\uA802', '\uA806', '\uA80B', '\uA825', '\uA826', '\uA8C4', '\uA8E0', '\uA8E1', '\uA8E2', '\uA8E3', '\uA8E4', '\uA8E5', '\uA8E6', '\uA8E7', '\uA8E8', '\uA8E9', '\uA8EA', '\uA8EB', '\uA8EC', '\uA8ED', '\uA8EE', '\uA8EF', '\uA8F0', '\uA8F1', '\uA926', '\uA927', '\uA928', '\uA929', '\uA92A', '\uA92B', '\uA92C', '\uA92D', '\uA947', '\uA948', '\uA949', '\uA94A', '\uA94B', '\uA94C', '\uA94D', '\uA94E', '\uA94F', '\uA950', '\uA951', '\uA980', '\uA981', '\uA982', '\uA9B3', '\uA9B6', '\uA9B7', '\uA9B8', '\uA9B9', '\uA9BC', '\uAA29', '\uAA2A', '\uAA2B', '\uAA2C', '\uAA2D', '\uAA2E', '\uAA31', '\uAA32', '\uAA35', '\uAA36', '\uAA43', '\uAA4C', '\uAAB0', '\uAAB2', '\uAAB3', '\uAAB4', '\uAAB7', '\uAAB8', '\uAABE', '\uAABF', '\uAAC1', '\uAAEC', '\uAAED', '\uAAF6', '\uABE5', '\uABE8', '\uABED', '\uFB1E', '\uFE00', '\uFE01', '\uFE02', '\uFE03', '\uFE04', '\uFE05', '\uFE06', '\uFE07', '\uFE08', '\uFE09', '\uFE0A', '\uFE0B', '\uFE0C', '\uFE0D', '\uFE0E', '\uFE0F', '\uFE20', '\uFE21', '\uFE22', '\uFE23', '\uFE24', '\uFE25', '\uFE26', '\U000101FD', '\U00010A01', '\U00010A02', '\U00010A03', '\U00010A05', '\U00010A06', '\U00010A0C', '\U00010A0D', '\U00010A0E', '\U00010A0F', '\U00010A38', '\U00010A39', '\U00010A3A', '\U00010A3F', '\U00011001', '\U00011038', '\U00011039', '\U0001103A', '\U0001103B', '\U0001103C', '\U0001103D', '\U0001103E', '\U0001103F', '\U00011040', '\U00011041', '\U00011042', '\U00011043', '\U00011044', '\U00011045', '\U00011046', '\U00011080', '\U00011081', '\U000110B3', '\U000110B4', '\U000110B5', '\U000110B6', '\U000110B9', '\U000110BA', '\U00011100', '\U00011101', '\U00011102', '\U00011127', '\U00011128', '\U00011129', '\U0001112A', '\U0001112B', '\U0001112D', '\U0001112E', '\U0001112F', '\U00011130', '\U00011131', '\U00011132', '\U00011133', '\U00011134', '\U00011180', '\U00011181', '\U000111B6', '\U000111B7', '\U000111B8', '\U000111B9', '\U000111BA', '\U000111BB', '\U000111BC', '\U000111BD', '\U000111BE', '\U000116AB', '\U000116AD', '\U000116B0', '\U000116B1', '\U000116B2', '\U000116B3', '\U000116B4', '\U000116B5', '\U000116B7', '\U00016F8F', '\U00016F90', '\U00016F91', '\U00016F92', '\U0001D167', '\U0001D168', '\U0001D169', '\U0001D17B', '\U0001D17C', '\U0001D17D', '\U0001D17E', '\U0001D17F', '\U0001D180', '\U0001D181', '\U0001D182', '\U0001D185', '\U0001D186', '\U0001D187', '\U0001D188', '\U0001D189', '\U0001D18A', '\U0001D18B', '\U0001D1AA', '\U0001D1AB', '\U0001D1AC', '\U0001D1AD', '\U0001D242', '\U0001D243', '\U0001D244', '\U000E0100', '\U000E0101', '\U000E0102', '\U000E0103', '\U000E0104', '\U000E0105', '\U000E0106', '\U000E0107', '\U000E0108', '\U000E0109', '\U000E010A', '\U000E010B', '\U000E010C', '\U000E010D', '\U000E010E', '\U000E010F', '\U000E0110', '\U000E0111', '\U000E0112', '\U000E0113', '\U000E0114', '\U000E0115', '\U000E0116', '\U000E0117', '\U000E0118', '\U000E0119', '\U000E011A', '\U000E011B', '\U000E011C', '\U000E011D', '\U000E011E', '\U000E011F', '\U000E0120', '\U000E0121', '\U000E0122', '\U000E0123', '\U000E0124', '\U000E0125', '\U000E0126', '\U000E0127', '\U000E0128', '\U000E0129', '\U000E012A', '\U000E012B', '\uE012C', '\U000E012D', '\U000E012E', '\U000E012F', '\U000E0130', '\U000E0131', '\U000E0132', '\U000E0133', '\U000E0134', '\U000E0135', '\U000E0136', '\U000E0137', '\U000E0138', '\U000E0139', '\U000E013A', '\U000E013B', '\U000E013C', '\U000E013D', '\U000E013E', '\U000E013F', '\U000E0140', '\U000E0141', '\U000E0142', '\U000E0143', '\U000E0144', '\U000E0145', '\U000E0146', '\U000E0147', '\U000E0148', '\U000E0149', '\U000E014A', '\U000E014B', '\U000E014C', '\U000E014D', '\U000E014E', '\U000E014F', '\U000E0150', '\U000E0151', '\U000E0152', '\U000E0153', '\U000E0154', '\U000E0155', '\U000E0156', '\U000E0157', '\U000E0158', '\U000E0159', '\U000E015A', '\U000E015B', '\U000E015C', '\U000E015D', '\U000E015E', '\U000E015F', '\U000E0160', '\U000E0161', '\U000E0162', '\U000E0163', '\U000E0164', '\U000E0165', '\U000E0166', '\U000E0167', '\U000E0168', '\U000E0169', '\U000E016A', '\U000E016B', '\U000E016C', '\U000E016D', '\U000E016E', '\U000E016F', '\U000E0170', '\U000E0171', '\U000E0172', '\U000E0173', '\U000E0174', '\U000E0175', '\U000E0176', '\U000E0177', '\U000E0178', '\U000E0179', '\U000E017A', '\U000E017B', '\U000E017C', '\U000E017D', '\U000E017E', '\U000E017F', '\U000E0180', '\U000E0181', '\U000E0182', '\U000E0183', '\U000E0184', '\U000E0185', '\uE0186', '\U000E0187', '\U000E0188', '\U000E0189', '\U000E018A', '\U000E018B', '\U000E018C', '\U000E018D', '\U000E018E', '\U000E018F', '\U000E0190', '\U000E0191', '\U000E0192', '\U000E0193', '\U000E0194', '\U000E0195', '\U000E0196', '\U000E0197', '\U000E0198', '\U000E0199', '\U000E019A', '\U000E019B', '\U000E019C', '\U000E019D', '\U000E019E', '\U000E019F', '\U000E01A0', '\U000E01A1', '\U000E01A2', '\U000E01A3', '\U000E01A4', '\U000E01A5', '\U000E01A6', '\U000E01A7', '\U000E01A8', '\U000E01A9', '\U000E01AA', '\U000E01AB', '\U000E01AC', '\U000E01AD', '\U000E01AE', '\U000E01AF', '\U000E01B0', '\U000E01B1', '\U000E01B2', '\U000E01B3', '\U000E01B4', '\U000E01B5', '\U000E01B6', '\U000E01B7', '\U000E01B8', '\U000E01B9', '\U000E01BA', '\U000E01BB', '\U000E01BC', '\U000E01BD', '\U000E01BE', '\U000E01BF', '\U000E01C0', '\U000E01C1', '\U000E01C2', '\U000E01C3', '\U000E01C4', '\U000E01C5', '\U000E01C6', '\U000E01C7', '\U000E01C8', '\U000E01C9', '\U000E01CA', '\U000E01CB', '\U000E01CC', '\U000E01CD', '\U000E01CE', '\U000E01CF', '\U000E01D0', '\U000E01D1', '\U000E01D2', '\U000E01D3', '\U000E01D4', '\U000E01D5', '\U000E01D6', '\U000E01D7', '\U000E01D8', '\U000E01D9', '\U000E01DA', '\U000E01DB', '\U000E01DC', '\U000E01DD', '\U000E01DE', '\U000E01DF', '\U000E01E0', '\U000E01E1', '\U000E01E2', '\U000E01E3', '\U000E01E4', '\U000E01E5', '\U000E01E6', '\U000E01E7', '\U000E01E8', '\U000E01E9', '\U000E01EA', '\U000E01EB', '\U000E01EC', '\U000E01ED', '\U000E01EE', '\U000E01EF'];
NOTE.
Please note that we have both \U000XXXXX and \uXXXX representations here.
I want to count the Unicode input text like this Hindi string "अब यहां से कहा जाएँ हम" or just a token word like "समझा", excluding the non spacing characters.
My implementation looks like
def countNonSpacingCharString(str):
count = 0;
for char in str:
if char not in UNICODE_NSM:
count = count + 1
return count
Thanks to the help provided in the answers below I have put all together in this github. There is also a mark codepoints list ready to be used in JavaScript / Node.js - https://github.com/loretoparisi/unicode_marks
Fastest way I came up with. len was slightly faster than sum. I built a set of all combining mark types in the setup.
test.py:
import sys
from unicodedata import category
MARK_SET = set(chr(c) for c in range(sys.maxunicode + 1) if category(chr(c))[0] == 'M')
s = "अब यहां से कहा जाएँ हम"
def count_len(s):
return len([c for c in s if c not in MARK_SET])
def count_sum(s):
return sum([c not in MARK_SET for c in s])
if __name__ == '__main__':
print(len(s))
print(count_len(s))
print(count_sum(s))
Output:
22
16
16
Timings:
C:\>py -m timeit -s "from test import count_sum,s" "count_sum(s)"
50000 loops, best of 5: 4.62 usec per loop
C:\>py -m timeit -s "from test import count_len,s" "count_len(s)"
50000 loops, best of 5: 3.97 usec per loop
It's worth noting that there is a grapheme 3rd party library. grapheme.length(s) == 16, but it was much slower (118us). The full grapheme-detecting algorithm is more complicated than skipping the modifier category. Consider the combining emojis for families and skin colors.
See also Unicode Text Segmentation.
This might be a better alternative:
def countNonSpacingCharString(str):
return len([char for char in str if not(char in UNICODE_NSM)])
How about using a dictionary to look up the values and if not present, increment the count? It should be faster than the former approach because the time complexity to check the presence of the character reduces to O(1).
The implementation should look somewhat like this:
Create a dict and populate it:
lookup_dict = {}
for alpha in UNICODE_NSM:
lookup_dict[alpha] = 1
Look it up while looping through the string:
def countNonSpacingCharString(str):
count = 0;
for char in str:
start_time = time.time()
if not lookup_dict.get(char):
count = count + 1
print("--- %s seconds ---" % (time.time() - start_time))
return count
I must note that using str, as variable name in Python is bad idea, as it is name of built-in function. Anyway I would implement your function following way:
def countNonSpacingCharString(s):
return len(filter(lambda x:not x in UNICODE_NSM,s))
in Python 2
def countNonSpacingCharString(s):
return sum(1 for _ in filter(lambda x:not x in UNICODE_NSM,s))
in Python 3
Inspecting my function using dis.dis showed that it produced less bytecode than your version with count, thus suggesting it might be faster, though this need further investigation.
EDIT: I tested my code in Python 2, but not Python 3 - version for Python 3 added, using Mohammad Banisaeid answer from this topic.
EDIT 2: If you uses UNICODE_NSM only for that, you might try to use set instead of list, which should boost in operator, though again this need further investigation. For discussion about list vs set performance see this thread.
Perhaps the easiest way to do this is to use the unicodedata module. In part, because it will be more rigorously tested. Indeed, I found your list appeared to be including categories other than Mn. That is, it includes Unicode points from Mc (Mark, spacing combining) as well, but you said you only wanted to exclude Unicode points from Mn (Mark, Nonspacing).
eg.
import unicodedata
def countNonSpacingCharString(string):
category = unicodedata.category
return sum(category(char) != 'Mn' for char in string)
This appears to be about 60 times faster according to timeit.
You might get a TypeError, if your version of Python and therefore unicodedata is not up-to-date, and so not aware of recent additions to Unicode. You can get around this by installing unicodedata2 and using that instead.
From your comments it looks like you're really after counting "user perceived characters". This is a complicated process with a number of edge cases. If you can then you should to install regex on your environment (that would be micropython?). You can then do:
>>> parts = regex.findall(r'\X', 'अब यहां से कहा जाएँ हम')
>>> parts
['अ', 'ब', ' ', 'य', 'हां', ' ', 'से', ' ', 'क', 'हा', ' ', 'जा', 'एँ', ' ', 'ह', 'म']
>>> len(parts)
16
Which splits your string into "user perceived characters", and then you can work on this list of strings to get what you need.
Failing that, your current solution of just ignoring Mark code points is an 80/20 solution (gets you most of the way their for the least amount of effort). You will have to revise what your list of Unicode marks though. My tests showed that your list was missing 113 code points across all the Indo-European and Dravidian scripts in Unicode (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, and Sinhala).
I extracted these characters by downloading and parsing: https://www.unicode.org/Public/11.0.0/ucd/UnicodeData.txt with the following code:
indian_script_range = range(0x0900, 0x0E00) # doesn't include all indic scripts (eg. Thai)
basic_multilingual_plane = range(0x0000, 0x10000)
# use the latter if you want to be more thorough and include all indic scripts and non-indic scripts
codepoint_range = indian_script_range
codepoints = []
with open('UnicodeData.txt') as f:
for line in f:
hex_string, name, category, *rest = line.strip().split(';')
codepoint_number = int(hex_string, base=16)
if (
category in ('Mn', 'Mc', 'Me')
and (
codepoint_number in codepoint_range
or name.startswith('VARIATION SELECTOR') # you seemed to want to include these
)
):
codepoints.append(chr(codepoint_number))
missing = set(codepoints) - set(UNICODE_NSM)
Mark Tolonens answer is the fastest, because it uses a set for comparison. If you have a text of length n and m whitespace-characters to compare with, then your worst-case runtime using two lists is O(nm). Using a set for the whitespace characters reduces that to O(n).
Using unicodedata.category is just nicer because it is shorter and less prone to human error.
Performance comparison
You can clearly see that the markset_count and the category_count are way faster than the generator_count and the loop_count. Also the speed of the latter two varies way more. Interestingly, the generator_count is slower than the loop_count.
The markset_count is a bit faster than the category_count. I think that is the case because looking up the category and doing the string comparison also takes a bit of time. The difference is way more clear when you only plot the two and increase the text length:
import timeit
import sys
import unicodedata
import numpy as np
UNICODE_NSM = ['\u0300', '\u0301', '\u0302', '\u0303', '\u0304', '\u0305', '\u0306', '\u0307', '\u0308', '\u0309', '\u030A', '\u030B', '\u030C', '\u030D', '\u030E', '\u030F', '\u0310', '\u0311', '\u0312', '\u0313', '\u0314', '\u0315', '\u0316', '\u0317', '\u0318', '\u0319', '\u031A', '\u031B', '\u031C', '\u031D', '\u031E', '\u031F', '\u0320', '\u0321', '\u0322', '\u0323', '\u0324', '\u0325', '\u0326', '\u0327', '\u0328', '\u0329', '\u032A', '\u032B', '\u032C', '\u032D', '\u032E', '\u032F', '\u0330', '\u0331', '\u0332', '\u0333', '\u0334', '\u0335', '\u0336', '\u0337', '\u0338', '\u0339', '\u033A', '\u033B', '\u033C', '\u033D', '\u033E', '\u033F', '\u0340', '\u0341', '\u0342', '\u0343', '\u0344', '\u0345', '\u0346', '\u0347', '\u0348', '\u0349', '\u034A', '\u034B', '\u034C', '\u034D', '\u034E', '\u034F', '\u0350', '\u0351', '\u0352', '\u0353', '\u0354', '\u0355', '\u0356', '\u0357', '\u0358', '\u0359', '\u035A', '\u035B', '\u035C', '\u035D', '\u035E', '\u035F', '\u0360', '\u0361', '\u0362', '\u0363', '\u0364', '\u0365', '\u0366', '\u0367', '\u0368', '\u0369', '\u036A', '\u036B', '\u036C', '\u036D', '\u036E', '\u036F', '\u0483', '\u0484', '\u0485', '\u0486', '\u0487', '\u0591', '\u0592', '\u0593', '\u0594', '\u0595', '\u0596', '\u0597', '\u0598', '\u0599', '\u059A', '\u059B', '\u059C', '\u059D', '\u059E', '\u059F', '\u05A0', '\u05A1', '\u05A2', '\u05A3', '\u05A4', '\u05A5', '\u05A6', '\u05A7', '\u05A8', '\u05A9', '\u05AA', '\u05AB', '\u05AC', '\u05AD', '\u05AE', '\u05AF', '\u05B0', '\u05B1', '\u05B2', '\u05B3', '\u05B4', '\u05B5', '\u05B6', '\u05B7', '\u05B8', '\u05B9', '\u05BA', '\u05BB', '\u05BC', '\u05BD', '\u05BF', '\u05C1', '\u05C2', '\u05C4', '\u05C5', '\u05C7', '\u0610', '\u0611', '\u0612', '\u0613', '\u0614', '\u0615', '\u0616', '\u0617', '\u0618', '\u0619', '\u061A', '\u064B', '\u064C', '\u064D', '\u064E', '\u064F', '\u0650', '\u0651', '\u0652', '\u0653', '\u0654', '\u0655', '\u0656', '\u0657', '\u0658', '\u0659', '\u065A', '\u065B', '\u065C', '\u065D', '\u065E', '\u065F', '\u0670', '\u06D6', '\u06D7', '\u06D8', '\u06D9', '\u06DA', '\u06DB', '\u06DC', '\u06DF', '\u06E0', '\u06E1', '\u06E2', '\u06E3', '\u06E4', '\u06E7', '\u06E8', '\u06EA', '\u06EB', '\u06EC', '\u06ED', '\u0711', '\u0730', '\u0731', '\u0732', '\u0733', '\u0734', '\u0735', '\u0736', '\u0737', '\u0738', '\u0739', '\u073A', '\u073B', '\u073C', '\u073D', '\u073E', '\u073F', '\u0740', '\u0741', '\u0742', '\u0743', '\u0744', '\u0745', '\u0746', '\u0747', '\u0748', '\u0749', '\u074A', '\u07A6', '\u07A7', '\u07A8', '\u07A9', '\u07AA', '\u07AB', '\u07AC', '\u07AD', '\u07AE', '\u07AF', '\u07B0', '\u07EB', '\u07EC', '\u07ED', '\u07EE', '\u07EF', '\u07F0', '\u07F1', '\u07F2', '\u07F3', '\u0816', '\u0817', '\u0818', '\u0819', '\u081B', '\u081C', '\u081D', '\u081E', '\u081F', '\u0820', '\u0821', '\u0822', '\u0823', '\u0825', '\u0826', '\u0827', '\u0829', '\u082A', '\u082B', '\u082C', '\u082D', '\u0859', '\u085A', '\u085B', '\u08E4', '\u08E5', '\u08E6', '\u08E7', '\u08E8', '\u08E9', '\u08EA', '\u08EB', '\u08EC', '\u08ED', '\u08EE', '\u08EF', '\u08F0', '\u08F1', '\u08F2', '\u08F3', '\u08F4', '\u08F5', '\u08F6', '\u08F7', '\u08F8', '\u08F9', '\u08FA', '\u08FB', '\u08FC', '\u08FD', '\u08FE', '\u0900', '\u0901', '\u0902', '\u093A', '\u093C', '\u093E', '\u0941', '\u0942', '\u0943', '\u0944', '\u0945', '\u0946', '\u0947', '\u0948', '\u094D', '\u0951', '\u0952', '\u0953', '\u0954', '\u0955', '\u0956', '\u0957', '\u0962', '\u0963', '\u0981', '\u09BC', '\u09C1', '\u09C2', '\u09C3', '\u09C4', '\u09CD', '\u09E2', '\u09E3', '\u0A01', '\u0A02', '\u0A3C', '\u0A41', '\u0A42', '\u0A47', '\u0A48', '\u0A4B', '\u0A4C', '\u0A4D', '\u0A51', '\u0A70', '\u0A71', '\u0A75', '\u0A81', '\u0A82', '\u0ABC', '\u0AC1', '\u0AC2', '\u0AC3', '\u0AC4', '\u0AC5', '\u0AC7', '\u0AC8', '\u0ACD', '\u0AE2', '\u0AE3', '\u0B01', '\u0B3C', '\u0B3F', '\u0B41', '\u0B42', '\u0B43', '\u0B44', '\u0B4D', '\u0B56', '\u0B62', '\u0B63', '\u0B82', '\u0BC0', '\u0BCD', '\u0C3E', '\u0C3F', '\u0C40', '\u0C46', '\u0C47', '\u0C48', '\u0C4A', '\u0C4B', '\u0C4C', '\u0C4D', '\u0C55', '\u0C56', '\u0C62', '\u0C63', '\u0CBC', '\u0CBF', '\u0CC6', '\u0CCC', '\u0CCD', '\u0CE2', '\u0CE3', '\u0D41', '\u0D42', '\u0D43', '\u0D44', '\u0D4D', '\u0D62', '\u0D63', '\u0DCA', '\u0DD2', '\u0DD3', '\u0DD4', '\u0DD6', '\u0E31', '\u0E34', '\u0E35', '\u0E36', '\u0E37', '\u0E38', '\u0E39', '\u0E3A', '\u0E47', '\u0E48', '\u0E49', '\u0E4A', '\u0E4B', '\u0E4C', '\u0E4D', '\u0E4E', '\u0EB1', '\u0EB4', '\u0EB5', '\u0EB6', '\u0EB7', '\u0EB8', '\u0EB9', '\u0EBB', '\u0EBC', '\u0EC8', '\u0EC9', '\u0ECA', '\u0ECB', '\u0ECC', '\u0ECD', '\u0F18', '\u0F19', '\u0F35', '\u0F37', '\u0F39', '\u0F71', '\u0F72', '\u0F73', '\u0F74', '\u0F75', '\u0F76', '\u0F77', '\u0F78', '\u0F79', '\u0F7A', '\u0F7B', '\u0F7C', '\u0F7D', '\u0F7E', '\u0F80', '\u0F81', '\u0F82', '\u0F83', '\u0F84', '\u0F86', '\u0F87', '\u0F8D', '\u0F8E', '\u0F8F', '\u0F90', '\u0F91', '\u0F92', '\u0F93', '\u0F94', '\u0F95', '\u0F96', '\u0F97', '\u0F99', '\u0F9A', '\u0F9B', '\u0F9C', '\u0F9D', '\u0F9E', '\u0F9F', '\u0FA0', '\u0FA1', '\u0FA2', '\u0FA3', '\u0FA4', '\u0FA5', '\u0FA6', '\u0FA7', '\u0FA8', '\u0FA9', '\u0FAA', '\u0FAB', '\u0FAC', '\u0FAD', '\u0FAE', '\u0FAF', '\u0FB0', '\u0FB1', '\u0FB2', '\u0FB3', '\u0FB4', '\u0FB5', '\u0FB6', '\u0FB7', '\u0FB8', '\u0FB9', '\u0FBA', '\u0FBB', '\u0FBC', '\u0FC6', '\u102D', '\u102E', '\u102F', '\u1030', '\u1032', '\u1033', '\u1034', '\u1035', '\u1036', '\u1037', '\u1039', '\u103A', '\u103D', '\u103E', '\u1058', '\u1059', '\u105E', '\u105F', '\u1060', '\u1071', '\u1072', '\u1073', '\u1074', '\u1082', '\u1085', '\u1086', '\u108D', '\u109D', '\u135D', '\u135E', '\u135F', '\u1712', '\u1713', '\u1714', '\u1732', '\u1733', '\u1734', '\u1752', '\u1753', '\u1772', '\u1773', '\u17B4', '\u17B5', '\u17B7', '\u17B8', '\u17B9', '\u17BA', '\u17BB', '\u17BC', '\u17BD', '\u17C6', '\u17C9', '\u17CA', '\u17CB', '\u17CC', '\u17CD', '\u17CE', '\u17CF', '\u17D0', '\u17D1', '\u17D2', '\u17D3', '\u17DD', '\u180B', '\u180C', '\u180D', '\u18A9', '\u1920', '\u1921', '\u1922', '\u1927', '\u1928', '\u1932', '\u1939', '\u193A', '\u193B', '\u1A17', '\u1A18', '\u1A56', '\u1A58', '\u1A59', '\u1A5A', '\u1A5B', '\u1A5C', '\u1A5D', '\u1A5E', '\u1A60', '\u1A62', '\u1A65', '\u1A66', '\u1A67', '\u1A68', '\u1A69', '\u1A6A', '\u1A6B', '\u1A6C', '\u1A73', '\u1A74', '\u1A75', '\u1A76', '\u1A77', '\u1A78', '\u1A79', '\u1A7A', '\u1A7B', '\u1A7C', '\u1A7F', '\u1B00', '\u1B01', '\u1B02', '\u1B03', '\u1B34', '\u1B36', '\u1B37', '\u1B38', '\u1B39', '\u1B3A', '\u1B3C', '\u1B42', '\u1B6B', '\u1B6C', '\u1B6D', '\u1B6E', '\u1B6F', '\u1B70', '\u1B71', '\u1B72', '\u1B73', '\u1B80', '\u1B81', '\u1BA2', '\u1BA3', '\u1BA4', '\u1BA5', '\u1BA8', '\u1BA9', '\u1BAB', '\u1BE6', '\u1BE8', '\u1BE9', '\u1BED', '\u1BEF', '\u1BF0', '\u1BF1', '\u1C2C', '\u1C2D', '\u1C2E', '\u1C2F', '\u1C30', '\u1C31', '\u1C32', '\u1C33', '\u1C36', '\u1C37', '\u1CD0', '\u1CD1', '\u1CD2', '\u1CD4', '\u1CD5', '\u1CD6', '\u1CD7', '\u1CD8', '\u1CD9', '\u1CDA', '\u1CDB', '\u1CDC', '\u1CDD', '\u1CDE', '\u1CDF', '\u1CE0', '\u1CE2', '\u1CE3', '\u1CE4', '\u1CE5', '\u1CE6', '\u1CE7', '\u1CE8', '\u1CED', '\u1CF4', '\u1DC0', '\u1DC1', '\u1DC2', '\u1DC3', '\u1DC4', '\u1DC5', '\u1DC6', '\u1DC7', '\u1DC8', '\u1DC9', '\u1DCA', '\u1DCB', '\u1DCC', '\u1DCD', '\u1DCE', '\u1DCF', '\u1DD0', '\u1DD1', '\u1DD2', '\u1DD3', '\u1DD4', '\u1DD5', '\u1DD6', '\u1DD7', '\u1DD8', '\u1DD9', '\u1DDA', '\u1DDB', '\u1DDC', '\u1DDD', '\u1DDE', '\u1DDF', '\u1DE0', '\u1DE1', '\u1DE2', '\u1DE3', '\u1DE4', '\u1DE5', '\u1DE6', '\u1DFC', '\u1DFD', '\u1DFE', '\u1DFF', '\u20D0', '\u20D1', '\u20D2', '\u20D3', '\u20D4', '\u20D5', '\u20D6', '\u20D7', '\u20D8', '\u20D9', '\u20DA', '\u20DB', '\u20DC', '\u20E1', '\u20E5', '\u20E6', '\u20E7', '\u20E8', '\u20E9', '\u20EA', '\u20EB', '\u20EC', '\u20ED', '\u20EE', '\u20EF', '\u20F0', '\u2CEF', '\u2CF0', '\u2CF1', '\u2D7F', '\u2DE0', '\u2DE1', '\u2DE2', '\u2DE3', '\u2DE4', '\u2DE5', '\u2DE6', '\u2DE7', '\u2DE8', '\u2DE9', '\u2DEA', '\u2DEB', '\u2DEC', '\u2DED', '\u2DEE', '\u2DEF', '\u2DF0', '\u2DF1', '\u2DF2', '\u2DF3', '\u2DF4', '\u2DF5', '\u2DF6', '\u2DF7', '\u2DF8', '\u2DF9', '\u2DFA', '\u2DFB', '\u2DFC', '\u2DFD', '\u2DFE', '\u2DFF', '\u302A', '\u302B', '\u302C', '\u302D', '\u3099', '\u309A', '\uA66F', '\uA674', '\uA675', '\uA676', '\uA677', '\uA678', '\uA679', '\uA67A', '\uA67B', '\uA67C', '\uA67D', '\uA69F', '\uA6F0', '\uA6F1', '\uA802', '\uA806', '\uA80B', '\uA825', '\uA826', '\uA8C4', '\uA8E0', '\uA8E1', '\uA8E2', '\uA8E3', '\uA8E4', '\uA8E5', '\uA8E6', '\uA8E7', '\uA8E8', '\uA8E9', '\uA8EA', '\uA8EB', '\uA8EC', '\uA8ED', '\uA8EE', '\uA8EF', '\uA8F0', '\uA8F1', '\uA926', '\uA927', '\uA928', '\uA929', '\uA92A', '\uA92B', '\uA92C', '\uA92D', '\uA947', '\uA948', '\uA949', '\uA94A', '\uA94B', '\uA94C', '\uA94D', '\uA94E', '\uA94F', '\uA950', '\uA951', '\uA980', '\uA981', '\uA982', '\uA9B3', '\uA9B6', '\uA9B7', '\uA9B8', '\uA9B9', '\uA9BC', '\uAA29', '\uAA2A', '\uAA2B', '\uAA2C', '\uAA2D', '\uAA2E', '\uAA31', '\uAA32', '\uAA35', '\uAA36', '\uAA43', '\uAA4C', '\uAAB0', '\uAAB2', '\uAAB3', '\uAAB4', '\uAAB7', '\uAAB8', '\uAABE', '\uAABF', '\uAAC1', '\uAAEC', '\uAAED', '\uAAF6', '\uABE5', '\uABE8', '\uABED', '\uFB1E', '\uFE00', '\uFE01', '\uFE02', '\uFE03', '\uFE04', '\uFE05', '\uFE06', '\uFE07', '\uFE08', '\uFE09', '\uFE0A', '\uFE0B', '\uFE0C', '\uFE0D', '\uFE0E', '\uFE0F', '\uFE20', '\uFE21', '\uFE22', '\uFE23', '\uFE24', '\uFE25', '\uFE26', '\U000101FD', '\U00010A01', '\U00010A02', '\U00010A03', '\U00010A05', '\U00010A06', '\U00010A0C', '\U00010A0D', '\U00010A0E', '\U00010A0F', '\U00010A38', '\U00010A39', '\U00010A3A', '\U00010A3F', '\U00011001', '\U00011038', '\U00011039', '\U0001103A', '\U0001103B', '\U0001103C', '\U0001103D', '\U0001103E', '\U0001103F', '\U00011040', '\U00011041', '\U00011042', '\U00011043', '\U00011044', '\U00011045', '\U00011046', '\U00011080', '\U00011081', '\U000110B3', '\U000110B4', '\U000110B5', '\U000110B6', '\U000110B9', '\U000110BA', '\U00011100', '\U00011101', '\U00011102', '\U00011127', '\U00011128', '\U00011129', '\U0001112A', '\U0001112B', '\U0001112D', '\U0001112E', '\U0001112F', '\U00011130', '\U00011131', '\U00011132', '\U00011133', '\U00011134', '\U00011180', '\U00011181', '\U000111B6', '\U000111B7', '\U000111B8', '\U000111B9', '\U000111BA', '\U000111BB', '\U000111BC', '\U000111BD', '\U000111BE', '\U000116AB', '\U000116AD', '\U000116B0', '\U000116B1', '\U000116B2', '\U000116B3', '\U000116B4', '\U000116B5', '\U000116B7', '\U00016F8F', '\U00016F90', '\U00016F91', '\U00016F92', '\U0001D167', '\U0001D168', '\U0001D169', '\U0001D17B', '\U0001D17C', '\U0001D17D', '\U0001D17E', '\U0001D17F', '\U0001D180', '\U0001D181', '\U0001D182', '\U0001D185', '\U0001D186', '\U0001D187', '\U0001D188', '\U0001D189', '\U0001D18A', '\U0001D18B', '\U0001D1AA', '\U0001D1AB', '\U0001D1AC', '\U0001D1AD', '\U0001D242', '\U0001D243', '\U0001D244', '\U000E0100', '\U000E0101', '\U000E0102', '\U000E0103', '\U000E0104', '\U000E0105', '\U000E0106', '\U000E0107', '\U000E0108', '\U000E0109', '\U000E010A', '\U000E010B', '\U000E010C', '\U000E010D', '\U000E010E', '\U000E010F', '\U000E0110', '\U000E0111', '\U000E0112', '\U000E0113', '\U000E0114', '\U000E0115', '\U000E0116', '\U000E0117', '\U000E0118', '\U000E0119', '\U000E011A', '\U000E011B', '\U000E011C', '\U000E011D', '\U000E011E', '\U000E011F', '\U000E0120', '\U000E0121', '\U000E0122', '\U000E0123', '\U000E0124', '\U000E0125', '\U000E0126', '\U000E0127', '\U000E0128', '\U000E0129', '\U000E012A', '\U000E012B', '\uE012C', '\U000E012D', '\U000E012E', '\U000E012F', '\U000E0130', '\U000E0131', '\U000E0132', '\U000E0133', '\U000E0134', '\U000E0135', '\U000E0136', '\U000E0137', '\U000E0138', '\U000E0139', '\U000E013A', '\U000E013B', '\U000E013C', '\U000E013D', '\U000E013E', '\U000E013F', '\U000E0140', '\U000E0141', '\U000E0142', '\U000E0143', '\U000E0144', '\U000E0145', '\U000E0146', '\U000E0147', '\U000E0148', '\U000E0149', '\U000E014A', '\U000E014B', '\U000E014C', '\U000E014D', '\U000E014E', '\U000E014F', '\U000E0150', '\U000E0151', '\U000E0152', '\U000E0153', '\U000E0154', '\U000E0155', '\U000E0156', '\U000E0157', '\U000E0158', '\U000E0159', '\U000E015A', '\U000E015B', '\U000E015C', '\U000E015D', '\U000E015E', '\U000E015F', '\U000E0160', '\U000E0161', '\U000E0162', '\U000E0163', '\U000E0164', '\U000E0165', '\U000E0166', '\U000E0167', '\U000E0168', '\U000E0169', '\U000E016A', '\U000E016B', '\U000E016C', '\U000E016D', '\U000E016E', '\U000E016F', '\U000E0170', '\U000E0171', '\U000E0172', '\U000E0173', '\U000E0174', '\U000E0175', '\U000E0176', '\U000E0177', '\U000E0178', '\U000E0179', '\U000E017A', '\U000E017B', '\U000E017C', '\U000E017D', '\U000E017E', '\U000E017F', '\U000E0180', '\U000E0181', '\U000E0182', '\U000E0183', '\U000E0184', '\U000E0185', '\uE0186', '\U000E0187', '\U000E0188', '\U000E0189', '\U000E018A', '\U000E018B', '\U000E018C', '\U000E018D', '\U000E018E', '\U000E018F', '\U000E0190', '\U000E0191', '\U000E0192', '\U000E0193', '\U000E0194', '\U000E0195', '\U000E0196', '\U000E0197', '\U000E0198', '\U000E0199', '\U000E019A', '\U000E019B', '\U000E019C', '\U000E019D', '\U000E019E', '\U000E019F', '\U000E01A0', '\U000E01A1', '\U000E01A2', '\U000E01A3', '\U000E01A4', '\U000E01A5', '\U000E01A6', '\U000E01A7', '\U000E01A8', '\U000E01A9', '\U000E01AA', '\U000E01AB', '\U000E01AC', '\U000E01AD', '\U000E01AE', '\U000E01AF', '\U000E01B0', '\U000E01B1', '\U000E01B2', '\U000E01B3', '\U000E01B4', '\U000E01B5', '\U000E01B6', '\U000E01B7', '\U000E01B8', '\U000E01B9', '\U000E01BA', '\U000E01BB', '\U000E01BC', '\U000E01BD', '\U000E01BE', '\U000E01BF', '\U000E01C0', '\U000E01C1', '\U000E01C2', '\U000E01C3', '\U000E01C4', '\U000E01C5', '\U000E01C6', '\U000E01C7', '\U000E01C8', '\U000E01C9', '\U000E01CA', '\U000E01CB', '\U000E01CC', '\U000E01CD', '\U000E01CE', '\U000E01CF', '\U000E01D0', '\U000E01D1', '\U000E01D2', '\U000E01D3', '\U000E01D4', '\U000E01D5', '\U000E01D6', '\U000E01D7', '\U000E01D8', '\U000E01D9', '\U000E01DA', '\U000E01DB', '\U000E01DC', '\U000E01DD', '\U000E01DE', '\U000E01DF', '\U000E01E0', '\U000E01E1', '\U000E01E2', '\U000E01E3', '\U000E01E4', '\U000E01E5', '\U000E01E6', '\U000E01E7', '\U000E01E8', '\U000E01E9', '\U000E01EA', '\U000E01EB', '\U000E01EC', '\U000E01ED', '\U000E01EE', '\U000E01EF']
MARK_SET = set(chr(c) for c in range(sys.maxunicode + 1) if unicodedata.category(chr(c))[0] == 'M')
print('len(UNICODE_NSM) = {}'.format(len(UNICODE_NSM)))
print('len(MARK_SET) = {}'.format(len(MARK_SET)))
filepath = "UnicodeData.txt"
with open(filepath) as f:
text = f.read()
text = text[:1000]
def main():
ground_truth = loop_count(text)
functions = [(loop_count, 'loop_count'),
(generator_count, 'generator_count'),
(category_count, 'category_count'),
(markset_count, 'markset_count'),
]
functions = functions[::-1]
duration_list = {}
for func, name in functions:
is_correct = func(text) == ground_truth
durations = timeit.repeat(lambda: func(text), repeat=500, number=3)
if is_correct:
correctness = 'correct'
else:
correctness = 'NOT correct'
duration_list[name] = durations
print('{func:<20}: {correctness}, '
'min: {min:0.3f}s, mean: {mean:0.3f}s, max: {max:0.3f}s'
.format(func=name,
correctness=correctness,
min=min(durations),
mean=np.mean(durations),
max=max(durations),
))
create_boxplot(duration_list)
def create_boxplot(duration_list):
import seaborn as sns
import matplotlib.pyplot as plt
import operator
plt.figure(num=None, figsize=(8, 4), dpi=300,
facecolor='w', edgecolor='k')
sns.set(style="whitegrid")
sorted_keys, sorted_vals = zip(*sorted(duration_list.items(), key=operator.itemgetter(1)))
flierprops = dict(markerfacecolor='0.75', markersize=1,
linestyle='none')
ax = sns.boxplot(data=sorted_vals, width=.3, orient='h',
flierprops=flierprops,)
ax.set(xlabel="Time in s", ylabel="")
plt.yticks(plt.yticks()[0], sorted_keys)
plt.tight_layout()
plt.savefig("output.png")
def generator_count(text):
return sum(1 for char in text if char not in UNICODE_NSM)
def loop_count(text):
# 1769137
count = 0
for char in text:
if char not in UNICODE_NSM:
count += 1
return count
def markset_count(text):
return sum(char not in MARK_SET for char in text)
def category_count(text):
return sum(unicodedata.category(char) != 'Mn' for char in text)
if __name__ == '__main__':
main()

Extracting numbers in text file

I have a text file which came from excel. I dont know how to take five digits after a specific character.
I want to take only five digits after #ACA in a text file.
my text is like:
ERROR_MESSAGE
(((#ACA16018)|(#ACA16019))&(#AQV71767='')&(#AQV71765='2'))?1:((#AQV71765='4')?1:((#AQV71767$'')?(((#AQV71765='1')|(#AQV71765='3'))?1:'Hasar veya Lehe Hukuk seçebilirsiniz'):'Rücu sıra numarasını yazıp Hasar veya Lehe Hukuk seçebilirsiniz'))
Rücu Oranı Girilmesi Zorunludur...'
#ACA17660
#ACA16560
#ACA15623
#ACA17804
BU ALANI BOŞ GEÇEMEZSİNİZ.EKSPER RAPORU GELMEDEN DY YE GERİ GÖNDEREMEZSİNİZ. PERT İHBARI VARSA PERT ÇALINMA OPERASYONU AKTİVİTESİ OLUŞTURULMALIDIR.
(#TSC[T008UNSMAS;FIRM_CODE=2 AND UNIT_TYPE='SG' AND UNIT_NO=#AQV71830]>0)?1:'Girdiğiniz değer fihristte yoktur'
#ACA17602
#ACA17604
#ACA56169
BU ALANI BOŞ GEÇEMEZSİNİZ
#ACA17606
#ACA17608
(#AQV71835='')?'Boş geçilemez':1
Lütfen Gönderilecek Kişinin Mail Adresini Giriniz ! '
LÜTFEN RED NEDENİNİ GİRİNİZ.
EKSİK BİLGİ / BELGE ALANINA GİRMİŞ OLDUĞUNUZ DEĞER YANLIŞ VEYA GEÇERŞİZDİR!!! LÜTFEN KONTROL EDİP TEKRAR DENEYİNİZ.'
BU ALAN BOŞ GEÇİLEMEZ. ÖDEME YAPILMADAN EK ÖDEME SÜRECİNİ BAŞLATAMAZSINIZ.
ONAYLANDI VE REDDEDİLDİ SEÇENEKLERİNİ KULLANAMAZSINIZ
BU ALAN BOŞ GEÇİLEMEZ.EVRAKLARINIZI , VARSA EKSPER RAPORUNU VE MUALLAĞI KONTROL EDİNİZ.
Muallak Tutarını kontrol ediniz.
'OTO BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
'OTODIŞI BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
(#AQV70003$'')?((#TSC[T001HASIHB;FIRM_CODE=#FP10100 AND COMPANY_CODE=2 AND CLAIM_NO=#AQV70003]$0)?1:'Bu dosya sistemde bulunmamaktadır'):'Bu alan boş geçilemez'
(#AQV70503='')?'Bu alan boş geçilemez.':((#ACA18635=1)?1:'Mağdura ait uygun kriterli ödeme kaydı mevcut değildir.')
(#AQV71809=0)?'Boş geçilemez':1
(#FD101AQV71904_AFDS<0)?'Tarih bugünün tarihinden büyük olamaz
I want to take every 5 digits which comes after #ACA, so:
16018, 16019, 17660, etc...
grep -oP '#ACA\K[0-9]{5}' file.txt
#ACA\K will match #ACA but not printed as part of output
[0-9]{5} five digits following #ACA
If variable number of digits are needed, use
grep -oP '#ACA\K[0-9]+' file.txt
If you don't know or don't like regular expressions, you can do this, although the code is a bit longer :
if __name__ == '__main__':
pattern = '#ACA'
filename = 'yourfile.txt'
res = list()
with open(filename, 'rb') as f: # open 'yourfile.txt' in byte-reading mode
for line in f: # for each line in the file
for s in line.split(pattern)[1:]: # split the line on '#ACA'
try:
nb = int(s[:5]) # take the first 5 characters after as an int
res.append(nb) # add it to the list of numbers we found
except (NameError, ValueError): # if conversion fails, that wasn't an int
pass
print res # if you want them in the same order as in the file
print sorted(res) # if you want them in ascending order
This should do it
import re
print(re.findall("#ACA(\d+)",str_var))
If you have the whole text in the variable str_var
Output:
['16018', '16019', '17660', '16560', '15623', '17804', '17602', '17604', '56169', '17606', '17608', '18635']
re.findall(r'#ACA(\d{5})', str_var)
[x[:5] for x in content.split("#ACA")[1:]]
PowerShell solution:
$contet = Get-Content -Raw 'your_file'
$match = [regex]::Matches($contet, '#ACA(\d{5})')
$match | ForEach-Object {
$_.Groups[1].Value
}
Output:
16018
16019
17660
16560
15623
17804
17602
17604
56169
17606
17608
18635

how to find a particular string in an element of array in python

I have a list of strings in python and if an element of the list contains the word "parthipan" I should print a message. But the below script is not working
import re
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
my_regex = r"(?mis){0}".format(re.escape(last_name))
if my_regex in a:
print "matched"
The first element of the list contains the word "parthipan", so it should print the message.
If you want to do this with a regexp, you can't use the in operator. Use re.search() instead. But it works with strings, not a whole list.
for elt in a:
if re.search(my_regexp, elt):
print "Matched"
break # stop looking
Or in more functional style:
if any(re.search(my_regexp, elt) for elt in a)):
print "Matched"
You don't need regex for this simply use any.
>>> a = ["paul Parthipan","paul","sdds","sdsdd"]
>>> last_name = "Parthipan".lower()
>>> if any(last_name in name.lower() for name in a):
... print("Matched")
...
Matched
Why not:
a = ["paul Parthipan","paul","sdds","sdsdd"]
last_name = "Parthipan"
if any(last_name in ai for ai in a):
print "matched"
Also what for is this part:
...
import re
my_regex = r"(?mis){0}".format(re.escape(last_name))
...
EDIT:
Im just too blind to see what for do You need regex here. It would be best if You would give some real input and output. This is small example which could be done in that way too:
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[ai[1] for ai in full_names]
if any(last_name in ai for ai in last_names):
print "matched"
But if regex part is really needed I cant imagine how to find '(?mis)Parthipan' in 'Parthipan'. Most simple would be in reverse direction 'Parthipan' in '(?mis)Parthipan'. Like here...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'Mala_Koala','Czarna,Pala']
last_name = "Parthipan"
names=[]
breakers=[' ','_',',']
for ai in a:
for b in breakers:
if b in ai:
names.append(ai.split(b))
full_names=[ai for ai in names if len(ai)==2]
last_names=[r"(?mis){0}".format(re.escape(ai[1])) for ai in full_names]
print last_names
if any(last_name in ai for ai in last_names):
print "matched"
EDIT:
Yhm, with regex You have few possibilities...
import re
a = ["paul Parthipan","paul","sdds","sdsdd",'jony-Parthipan','koala_Parthipan','Parthipan']
lastName = "Parthipan"
myRegex = r"(?mis){0}".format(re.escape(lastName))
strA=';'.join(a)
se = re.search(myRegex, strA)
ma = re.match(myRegex, strA)
fa = re.findall(myRegex, strA)
fi=[i.group() for i in re.finditer(myRegex, strA, flags=0)]
se = '' if se is None else se.group()
ma = '' if ma is None else ma.group()
print se, 'match' if any(se) else 'no match'
print ma, 'match' if any(ma) else 'no match'
print fa, 'match' if any(fa) else 'no match'
print fi, 'match' if any(fi) else 'no match'
output, only first one seems ok, so only re.search gives proper solution:
Parthipan match
no match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match
['Parthipan', 'Parthipan', 'Parthipan', 'Parthipan'] match

Categories