Extracting numbers in text file - python

I have a text file which came from excel. I dont know how to take five digits after a specific character.
I want to take only five digits after #ACA in a text file.
my text is like:
ERROR_MESSAGE
(((#ACA16018)|(#ACA16019))&(#AQV71767='')&(#AQV71765='2'))?1:((#AQV71765='4')?1:((#AQV71767$'')?(((#AQV71765='1')|(#AQV71765='3'))?1:'Hasar veya Lehe Hukuk seçebilirsiniz'):'Rücu sıra numarasını yazıp Hasar veya Lehe Hukuk seçebilirsiniz'))
Rücu Oranı Girilmesi Zorunludur...'
#ACA17660
#ACA16560
#ACA15623
#ACA17804
BU ALANI BOŞ GEÇEMEZSİNİZ.EKSPER RAPORU GELMEDEN DY YE GERİ GÖNDEREMEZSİNİZ. PERT İHBARI VARSA PERT ÇALINMA OPERASYONU AKTİVİTESİ OLUŞTURULMALIDIR.
(#TSC[T008UNSMAS;FIRM_CODE=2 AND UNIT_TYPE='SG' AND UNIT_NO=#AQV71830]>0)?1:'Girdiğiniz değer fihristte yoktur'
#ACA17602
#ACA17604
#ACA56169
BU ALANI BOŞ GEÇEMEZSİNİZ
#ACA17606
#ACA17608
(#AQV71835='')?'Boş geçilemez':1
Lütfen Gönderilecek Kişinin Mail Adresini Giriniz ! '
LÜTFEN RED NEDENİNİ GİRİNİZ.
EKSİK BİLGİ / BELGE ALANINA GİRMİŞ OLDUĞUNUZ DEĞER YANLIŞ VEYA GEÇERŞİZDİR!!! LÜTFEN KONTROL EDİP TEKRAR DENEYİNİZ.'
BU ALAN BOŞ GEÇİLEMEZ. ÖDEME YAPILMADAN EK ÖDEME SÜRECİNİ BAŞLATAMAZSINIZ.
ONAYLANDI VE REDDEDİLDİ SEÇENEKLERİNİ KULLANAMAZSINIZ
BU ALAN BOŞ GEÇİLEMEZ.EVRAKLARINIZI , VARSA EKSPER RAPORUNU VE MUALLAĞI KONTROL EDİNİZ.
Muallak Tutarını kontrol ediniz.
'OTO BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
'OTODIŞI BRANŞINDA REDDEDİLDİ NEDENİ SEÇMELİSİNİZ'
(#AQV70003$'')?((#TSC[T001HASIHB;FIRM_CODE=#FP10100 AND COMPANY_CODE=2 AND CLAIM_NO=#AQV70003]$0)?1:'Bu dosya sistemde bulunmamaktadır'):'Bu alan boş geçilemez'
(#AQV70503='')?'Bu alan boş geçilemez.':((#ACA18635=1)?1:'Mağdura ait uygun kriterli ödeme kaydı mevcut değildir.')
(#AQV71809=0)?'Boş geçilemez':1
(#FD101AQV71904_AFDS<0)?'Tarih bugünün tarihinden büyük olamaz
I want to take every 5 digits which comes after #ACA, so:
16018, 16019, 17660, etc...

grep -oP '#ACA\K[0-9]{5}' file.txt
#ACA\K will match #ACA but not printed as part of output
[0-9]{5} five digits following #ACA
If variable number of digits are needed, use
grep -oP '#ACA\K[0-9]+' file.txt

If you don't know or don't like regular expressions, you can do this, although the code is a bit longer :
if __name__ == '__main__':
pattern = '#ACA'
filename = 'yourfile.txt'
res = list()
with open(filename, 'rb') as f: # open 'yourfile.txt' in byte-reading mode
for line in f: # for each line in the file
for s in line.split(pattern)[1:]: # split the line on '#ACA'
try:
nb = int(s[:5]) # take the first 5 characters after as an int
res.append(nb) # add it to the list of numbers we found
except (NameError, ValueError): # if conversion fails, that wasn't an int
pass
print res # if you want them in the same order as in the file
print sorted(res) # if you want them in ascending order

This should do it
import re
print(re.findall("#ACA(\d+)",str_var))
If you have the whole text in the variable str_var
Output:
['16018', '16019', '17660', '16560', '15623', '17804', '17602', '17604', '56169', '17606', '17608', '18635']

re.findall(r'#ACA(\d{5})', str_var)

[x[:5] for x in content.split("#ACA")[1:]]

PowerShell solution:
$contet = Get-Content -Raw 'your_file'
$match = [regex]::Matches($contet, '#ACA(\d{5})')
$match | ForEach-Object {
$_.Groups[1].Value
}
Output:
16018
16019
17660
16560
15623
17804
17602
17604
56169
17606
17608
18635

Related

matching names across multiple lists

the following code is what I've tried to do so far:
import json
uids = {'483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def _whois(string):
a = []
for i in uids:
i = json.loads(uids[i])
i = i.split()
if string in i:
a += i
for i in uids:
i = json.loads(uids[i])
i = i.split()
if bool(set(i) & set(a)) == True:
a += i
return list(set(a))
def whois(string):
a = []
ret = _whois(string)
for i in ret:
a += _whois(i)
return list(set(a))
print(whois("charm"))
I am trying to match a search term with accounts that share an id with the term in it, and then match each of those other accounts that are with the id to other accounts on other ids and so on and basically see all of the linked accounts that start from a single term.
For example, if I searched "charm" it would return: "charm feline voxela derp virgo bitten hello" from the example uids above.
After a certain way down the line of connected accounts it stops matching. How would I successfully do this so that it matches all accounts potentially infinitely?
i think i got it to work:
import json
terms = {'4837759863453450996': '"mamma riyoken"','4833480984509580996': '"mamma heika"','483775980980996': '"nemo heika"','4867568843796': '"control nemo"','4956775843796': '"t control"','483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def _search(string):
a = []
for i in terms:
i = json.loads(terms[i])
i = i.split()
if string in i:
a += i
return list(set(a))
def search(string):
a = []
a.append(string)
while True:
l = len(a)
for n in a:
a += _search(n)
a = list(set(a))
if l == len(a):
break
return a
print(search("charm"))
Try this:
ids = {'483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def find_word(word,dict):
for i,j in dict.items():
if word.lower() in j.lower():
print(i,j)
find_word('jared', ids)
Result:
483775843796 "jared trav"
483843796 "azu jared"

How to transform an affirmative sentence into a general question using Python Udapi?

I would like to trasnform some pretty simple affirmative sentences into general questions (the language of choise is Spanish). Consider the following example:
Esto es muy difícil. -> Es esto muy difícil?
So I just need to shift the position of subject and predicate (wherever they are).
Normally it can be done with the shift_before_node() method:
pron_node, aux_node = tree.descendants[0], tree.descendants[1]
aux_node.shift_before_node(pron_node)
However, if I want to automate the process (because subject and predicate will not always be in the same position) I need to create a cycle (See The Problem paragraph below) for each node of a tree, where it checks that if node's part of speech (upos) is a PRON or PROPN, and it is followed (not necessarily directly) by a node which is a VERB or AUX, it needs to shift the second node before the first one (like in the example above). But, I dont know how to implement it into cycle. Any suggestions?
Here is my code so far (done in Google Colab). I apologize for excluding some of the console text, otherwise it would be too lengthy:
Request to UDPipe server
import requests
response = requests.get("http://lindat.mff.cuni.cz/services/udpipe/api/models")
info = response.json()
info
for key, data in info["models"].items():
if "spanish" in key:
print(key, data)
params = {"tokenizer": "", "tagger": "", "parser": "", "model": "spanish-gsd-ud-2.6-200830"}
text = "Esto es muy difícil."
params["data"] = text
response = requests.get("http://lindat.mff.cuni.cz/services/udpipe/api/process", params)
json_response = response.json()
parse = json_response["result"]
print(parse)
Output #1 (print (parse)):
# generator = UDPipe 2, https://lindat.mff.cuni.cz/services/udpipe
# udpipe_model = spanish-gsd-ud-2.6-200830
# udpipe_model_licence = CC BY-NC-SA
# newdoc
# newpar
# sent_id = 1
# text = Esto es muy difícil.
1 Esto este PRON _ Number=Sing|PronType=Dem 4 nsubj _ _
2 es ser AUX _ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin 4 cop _ _
3 muy mucho ADV _ _ 4 advmod _ _
4 difícil difícil ADJ _ Number=Sing 0 root _ SpaceAfter=No
5 . . PUNCT _ _ 4 punct _ SpaceAfter=No
Udapi Installation:
!pip install --upgrade git+https://github.com/udapi/udapi-python.git
import os
os.environ['PATH'] += ":" + os.path.join(os.environ['HOME'], ".local/bin")
from udapi.block.read.conllu import Conllu
from udapi.core.document import Document
from udapi.block.write.textmodetrees import TextModeTrees
from io import StringIO
Building a tree:
In my understanding a tree is a variable of a built in Udapi class, which is a structured version of a parse variable, and which contains all the information about each word of a sentence - its order (ord), given form (form), initial form (lemma), part of speech (upos) and so on:
tree = Conllu(filehandle=StringIO(parse)).read_tree()
writer = TextModeTrees(attributes="ord,form,lemma,upos,feats,deprel", layout="align")
writer.process_tree(tree)
Output #2 (writer.process_tree(tree)):
# sent_id = 1
# text = Esto es muy difícil.
─┮
│ ╭─╼ 1 Esto este PRON Number=Sing|PronType=Dem nsubj
│ ┢─╼ 2 es ser AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop
│ ┢─╼ 3 muy mucho ADV _ advmod
╰─┾ 4 difícil difícil ADJ Number=Sing root
╰─╼ 5 . . PUNCT _ punct
It is also possible to print out all the dependents for each node of a given tree. As already correctly noted in the comments, tree.descendants consists of a list of nodes:
for node in tree.descendants:
print(f"{node.ord}:{node.form}")
left_children = node.children(preceding_only=True)
if len(left_children) > 0:
print("Left dependents:", end=" ")
for child in left_children:
print(f"{child.ord}:{child.form}", end=" ")
print("")
right_children = node.children(following_only=True)
if len(right_children) > 0:
print("Right dependents:", end=" ")
for child in right_children:
print(f"{child.ord}:{child.form}", end=" ")
print("")
Output #3:
1:Esto
2:es
3:muy
4:difícil
Left dependents: 1:Esto 2:es 3:muy
Right dependents: 5:.
5:.
The problem (beginning of a cycle):
for node in tree.descendants:
if node.upos == "VERB" or node.upos == "AUX":
UPDATE 1
So, I`ve come to the first somewhat complete version of a needed cycle and now it looks like this:
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
curr_node.shift_before_node(prev_node)
But now I get this error:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-a967bbd730fe> in <module>()
9
10
---> 11 for i, curr_node in enumerate(nodes[1:], 1):
12 prev_node = nodes[i-1]
13 if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
NameError: name 'nodes' is not defined
UPDATE 2
I tried defining nodes like that:
nodes = tree.descendants
And now my cycle compiles at least, but it still didn't do anything with the structure of a given sentence:
nodes = tree.descendants
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if (prev_node.upos == "PRON" or prev_node.upos == "PROPN") and (curr_node.upos == "VERB" or curr_node.upos == "AUX"):
curr_node.shift_before_node(prev_node)
Checking the tree:
tree = Conllu(filehandle=StringIO(parse)).read_tree()
writer = TextModeTrees(attributes="ord,form,lemma,upos,feats,deprel", layout="align")
writer.process_tree(tree)
# sent_id = 1
# text = Esto es muy difícil.
─┮
│ ╭─╼ 1 Esto este PRON Number=Sing|PronType=Dem nsubj
│ ┢─╼ 2 es ser AUX Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin cop
│ ┢─╼ 3 muy mucho ADV _ advmod
╰─┾ 4 difícil difícil ADJ Number=Sing root
╰─╼ 5 . . PUNCT _
Nothing changed.
UPDATE 3
I've also tried to check if the cycle swaps subject and predicate back again (2nd time), making the sentence look like the original one, but I guess it's not the case, becuase even if I comment the break part, flag has increased by 1 only:
nodes = tree.descendants
flag = 1
for i, curr_node in enumerate(nodes[1:], 1):
prev_node = nodes[i-1]
if ((prev_node.upos == "PRON") or (prev_node.upos == "PROPN")) and ((curr_node.upos == "VERB") or (curr_node.upos == "AUX")):
curr_node.shift_before_node(prev_node)
flag = flag + 1
# if flag == 2:
# break
print(flag)
Output
2
HOWEVER, it means, that the condition if ((prev_node.upos == "PRON") or (prev_node.upos == "PROPN")) and ((curr_node.upos == "VERB") or (curr_node.upos == "AUX")) was satisified.
Suppose there is one sentence per line in affirm.txt with affirmative Spanish sentences such as "Esto es muy difícil." or "Tus padres compraron esa casa de la que me hablaste.".
As an alternative to using the UDPipe web service, we can parse the sentences locally (I slightly prefer the es_ancora model over es_gsd):
import udapi
doc = udapi.Document('affirm.txt')
udapi.create_block('udpipe.Base', model_alias='es_ancora').apply_on_document(doc)
To make repeated experiments faster, we can now store the parsed trees to a CoNLL-U file using doc.store_conllu('affirm.conllu') and later load it using doc = udapi.Document('affirm.conllu').
To draw the trees we can use the doc.draw() method (or even tree.draw()), which is a syntactic sugar that uses TextModeTrees() behind the scenes. So to compare the sentences before and after changing the word order, we can use:
print("Original word order:")
doc.draw() # or doc.draw(attributes="ord,form,lemma,deprel,feats,misc")
for tree in doc.trees:
process_tree(tree)
print("Question-like word order:")
doc.draw()
Now comes the main work - to implement the process_tree() subroutine. Note that
We need to change the word order of the main clause only (e.g "Tus padres compraron esa casa."), not any dependent clauses (e.g. "de la que me hablaste".) So we don't want to iterate over all nodes (tree.descendants), we just need to find the main predicate (usually a verb) and its subject.
The subject does not need to be only PRON and PROPN, it can be NOUN or maybe just ADJ if the governing noun is omitted. So it is safer to just ask for deprel=nsubj (handling csubj is beyond the scope of this question).
I don't speak Spanish, but I think the rule is not as simple as moving the verb before the subject (or moving the subject after the verb). At least, we need to distinguish transitive verbs (with objects) and copula constructions. Of course, even the solution below is not perfect. It is rather an example how to use Udapi.
We should handle the nasty details like capitalization and spacing.
def process_tree(tree):
# Find the main predicate and its subject
main_predicate = tree.children[0]
nsubj = next((n for n in main_predicate.children if n.udeprel == 'nsubj'), None)
if not nsubj:
return
# Move the subject
# - after the auxiliary copula verb if present
# - or after the last object if present
# - or after the main predicate (verb)
cop = next((n for n in main_predicate.children if n.udeprel == 'cop'), None)
if cop:
nsubj.shift_after_subtree(cop)
else:
objects = [n for n in main_predicate.children if n.udeprel in ('obj', 'iobj')]
if objects:
nsubj.shift_after_subtree(objects[-1])
else:
nsubj.shift_after_node(verb)
# Fix the capitalization
nsubj_start = nsubj.descendants(add_self=True)[0]
if nsubj_start.lemma[0].islower() and nsubj_start.form[0].isupper():
nsubj_start.form = nsubj_start.form.lower()
tree.descendants[0].form = tree.descendants[0].form.capitalize()
# Add a question mark (instead of fullstop)
dots = [n for n in main_predicate.children if n.form == '.']
if not dots:
dots = [main_predicate.create_child(upos="PUNCT", deprel="punct")]
dots[-1].form = '?'
# Fix spacing
dots[-1].prev_node.misc["SpaceAfter"] = "No"
nsubj_start.prev_node.misc["SpaceAfter"] = ""
# Recompute the string representation of the sentence
tree.text = tree.compute_text()
The solution above uses Udapi as a library. An alternative would be to move the main code into a Udapi block called e.g. MakeQuestions:
from udapi.core.block import Block
class MakeQuestions(Block):
def process_tree(self, tree):
# the rest is same as in the solution above
If we store this block in the current directory in file makequestions.py, we can call it from the command line in many ways:
# parse the affirmative sentences
cat affirm.txt | udapy -s \
read.Sentences \
udpipe.Base model_alias=es_ancora \
> affirm.conllu
# visualize the output with TextModeTrees (-T)
cat affirm.conllu | udapy -T .MakeQuestions | less -R
# store the output in CoNLL-U
udapy -s .MakeQuestions < affirm.conllu > questions.conllu
# show just the plain-text sentences
udapy write.Sentences < questions.conllu > questions.txt
# visualize the differences in HTML
udapy -H \
read.Conllu files=affirm.conllu zone=affirm \
read.Conllu files=questions.conllu zone=questions \
util.MarkDiff gold_zone=affirm attributes=form ignore_parent=1 \
> differences.html

How do I remove the last character of a variable until there is only one left in python

I am working on a program that prints every possible combination of a word. Everything works, but I wanted to take it a step further so it doesn't only print all combinations of a word. It should remove the last character of a word until there is only one letter left.
Here is what I have written:
# Enter Word
print("Enter your word:")
print()
s = input(colorGreen)
# If s is a string, print all combinations
if all(s.isalpha() or s.isspace() for s in s):
t=list(itertools.permutations(s,len(s)))
for i in range(0,len(t)):
print(colorGreen + "".join(t[i]))
while len(s) != 1:
t=list(itertools.permutations(s,len(s)))
for i in range(0,len(t)):
print(colorGreen + "".join(t[i]))
print()
print("Finished!")
print()
input("Press anything to quit...")
# If s is not a string, print error
if not all(s.isalpha() or s.isspace() for s in s):
print(colorRed + "You did not enter a correct word")
print("Only use a word for Word combinations")
print("Please restart the program.")
print()
input("Press anything to quit...")
Thanks in advance :D
You can do it this way:
from itertools import permutations
string = 'hello'
c = reversed([''.join(s) for i in range(1,len(string)+1) for s in permutations(string,i)])
print('\n'.join(c))
Output:
olleh
ollhe
olelh
olehl
olhle
olhel
olleh
ollhe
olelh
olehl
olhle
olhel
oellh
oelhl
oellh
oelhl
oehll
oehll
ohlle
ohlel
ohlle
ohlel
ohell
ohell
loleh
lolhe
loelh
loehl
lohle
lohel
lloeh
llohe
lleoh
lleho
llhoe
llheo
leolh
leohl
leloh
lelho
lehol
lehlo
lhole
lhoel
lhloe
lhleo
lheol
lhelo
loleh
lolhe
loelh
loehl
lohle
lohel
lloeh
llohe
lleoh
lleho
llhoe
llheo
leolh
leohl
leloh
lelho
lehol
lehlo
lhole
lhoel
lhloe
lhleo
lheol
lhelo
eollh
eolhl
eollh
eolhl
eohll
eohll
elolh
elohl
elloh
ellho
elhol
elhlo
elolh
elohl
elloh
ellho
elhol
elhlo
eholl
eholl
ehlol
ehllo
ehlol
ehllo
holle
holel
holle
holel
hoell
hoell
hlole
hloel
hlloe
hlleo
hleol
hlelo
hlole
hloel
hlloe
hlleo
hleol
hlelo
heoll
heoll
helol
hello
helol
hello
olle
ollh
olel
oleh
olhl
olhe
olle
ollh
olel
oleh
olhl
olhe
oell
oelh
oell
oelh
oehl
oehl
ohll
ohle
ohll
ohle
ohel
ohel
lole
lolh
loel
loeh
lohl
lohe
lloe
lloh
lleo
lleh
llho
llhe
leol
leoh
lelo
lelh
leho
lehl
lhol
lhoe
lhlo
lhle
lheo
lhel
lole
lolh
loel
loeh
lohl
lohe
lloe
lloh
lleo
lleh
llho
llhe
leol
leoh
lelo
lelh
leho
lehl
lhol
lhoe
lhlo
lhle
lheo
lhel
eoll
eolh
eoll
eolh
eohl
eohl
elol
eloh
ello
ellh
elho
elhl
elol
eloh
ello
ellh
elho
elhl
ehol
ehol
ehlo
ehll
ehlo
ehll
holl
hole
holl
hole
hoel
hoel
hlol
hloe
hllo
hlle
hleo
hlel
hlol
hloe
hllo
hlle
hleo
hlel
heol
heol
helo
hell
helo
hell
oll
ole
olh
oll
ole
olh
oel
oel
oeh
ohl
ohl
ohe
lol
loe
loh
llo
lle
llh
leo
lel
leh
lho
lhl
lhe
lol
loe
loh
llo
lle
llh
leo
lel
leh
lho
lhl
lhe
eol
eol
eoh
elo
ell
elh
elo
ell
elh
eho
ehl
ehl
hol
hol
hoe
hlo
hll
hle
hlo
hll
hle
heo
hel
hel
ol
ol
oe
oh
lo
ll
le
lh
lo
ll
le
lh
eo
el
el
eh
ho
hl
hl
he
o
l
l
e
h

python extracting string from data

I have the following data and I need to extract the first string occurrence It is separated from rest of data with \t. I'm trying to use split(),regex but the problem is it is taking more than 1 second to do this for each line. Is there anyway that it could be done faster?
Data:
DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618
So,the bottom line is I need to extract DT, PRP, VBD... from the above text really fast.
You can just call split with maxsplit argument and wrap it into a list generator.
result = [line.split('\t', 1)[0] for line in data]
As you see, passing 1 in the method call makes it stop after the first splitting takes place. I bet this is the fastest solution in Python.
A manual alternative.
def end_of_loop():
raise StopIteration
def my_split(line):
return ''.join(end_of_loop() if char == '\t' else char for char in line)
result = [my_split(line) for line in lines]
Provided your data are in a file:
with open(file) as data:
result = [my_split(line) for line in data]
This will be a lot slower than the first one.
You can use split in a list comprehension :
>>> s="""DT 0.00155095460731831934 0.00121897344629313064 0.00000391325536877105 0.09743272975663436197 0.00002271067721789807 0.00614528909266214615 0.00000445295550745487 0.70422975214810612510 0.00000042521183266708 0.00080380970031485965 0.00046229528280753270 0.00019894095277762626 0.00041012830368947716 0.00013156663380611624 0.00000001065986007929 0.00004244196517011733 0.00061444160944146384 0.02101761386512242258 0.00010328516871273944 0.00001128873771536226 0.00279163054567377073 0.00018903663417650421 0.00006490063677390687 0.00002151218889856898 0.00032824534915777535 0.00040349658620449016 0.00042393411014689220 0.00053643791028589382 0.00001032961180051124 0.00025743865541833909 0.00011497457801324625 0.00005359814320647386 0.00010336445810407512 0.00040942464084107332 0.00009098970100047888 0.00000091369931486168 0.00059479547081431436 0.00000009853464391239 0.00020303484015768289 0.00050594563648307127 0.15679657927655424321 0.00034115929559768240 0.00115490132012489345 0.00019823414624750937
... PRP 0.00000131203717608417 0.99998368311809904263 0.00000002192874737415 0.00000073240710142655 0.00000000536610432900 0.00000195554704853124 0.00000000012203475361 0.00000017206852489982 0.00000040268728691384 0.00000034167449501884 0.00000077203219019333 0.00000003082351874675 0.00000052849070550174 0.00000319144710228690 0.00000000009512989203 0.00000002016363199180 0.00000005598551431381 0.00000129166108708107 0.00000004127954869435 0.00000099983230311242 0.00000032415702089502 0.00000010477525952469 0.00000000011045642123 0.00000006942075882668 0.00000017433924380308 0.00000028874823360049 0.00000048656924101513 0.00000017722073116061 0.00000037193481161874 0.00000000452174124394 0.00000081986547018432 0.00000001740977711224 0.00000000808377988046 0.00000001418892143074 0.00000045250939471023 0.00000000000050232556 0.00000043504206149021 0.00000011310292804313 0.00000000013241046549 0.00000015302998639348 0.00000002800056509608 0.00000038361859715043 0.00000000099713364069 0.00000001345362455494
... VBD 0.00000002905639670475 0.00000000730896486886 0.00000000406530491040 0.00000009048972500851 0.00000000380338117015 0.00000000000390031394 0.00000000169948197867 0.00000000091890304843 0.00000000013856552537 0.00000191013917141413 0.00000002300239228881 0.00000003601993413087 0.00000004266629173115 0.00000000166497478879 0.00000000000079281873 0.00000180895378547175 0.00000000000159251758 0.00000000081310874277 0.00000000334322892919 0.99999591744268101490 0.00000000000454647012 0.00000000060884665646 0.00000000000010515727 0.00000000019245471748 0.00000000308524019147 0.00000001376847404364 0.00000001449670334202 0.00000001434634011983 0.00000000656887521298 0.00000000796791556475 0.00000000578334901413 0.00000000142124935798 0.00000000213053365838 0.00000000487780229311 0.00000001702409705978 0.00000000391793832836 0.00000001292779157438 0.00000000002447935587 0.00000000000435117453 0.00000000408872313468 0.00000000007201124397 0.00000000431736839121 0.00000000002970930698 0.00000000080852330796
... RB 0.00000015663242474016 0.00000002464350694082 0.00000000095443410385 0.99998778106321006831 0.00000000021007124986 0.00000006156902517681 0.00000000277279124155 0.00000000301727284928 0.00000000030682776953 0.00000007379165980724 0.00000012399749754355 0.00000494600825959811 0.00000008488215978963 0.00000000897527112360 0.00000000000009257081 0.00000000223574222125 0.00000000371653801739 0.00000548300954899374 0.00000001802212638276 0.00000000022437343140 0.00000001084514551630 0.00000000328207000562 0.00000000672649111321 0.00000003640165688536 0.00000050812474700731 0.00000007422081603379 0.00000018000760320187 0.00000007733588104368 0.00000008890139839523 0.00000001494850369145 0.00000003233439691280 0.00000000299507821025 0.00000000501198681017 0.00000000271863832841 0.00000004782796496077 0.00000000000160157399 0.00000006968900381578 0.00000000003199719817 0.00000001234122837743 0.00000002204081342858 0.00000000038818632144 0.00000002327335651712 0.00000000016015202564 0.00000000435845392228
... VBN 0.00222925562857408935 0.00055631931823257885 0.00000032474066230587 0.00333293927262896372 0.12594759350192680225 0.00142014631420757115 0.00008260266473343272 0.00001658664201138300 0.00000444848747905589 0.00025881226046863004 0.00176478222683846956 0.00226268536384150636 0.00120807701719786715 0.00016158429451364274 0.00000000200391980114 0.00012971908549403702 0.41488930515218963579 0.41237674095727266943 0.00025649814681915863 0.00001340291420511781 0.00067983726358035045 0.00001718712609473795 0.00009573412529081616 0.02342065200703593100 0.00010281749829896253 0.00243912549478067552 0.00111221146411718771 0.00110067534479759994 0.00048702441892562549 0.00014537544850052323 0.00046019613393571187 0.00004100416046505168 0.00001820421200359182 0.00013212194667244404 0.00112515351673182361 0.00000022002597310723 0.00099184191436586821 0.00000187809735682276 0.00000214888688830288 0.00031369371619907773 0.00000552482376141306 0.00033123576486582436 0.00000227934800338172 0.00006203126813779618"""
>>> [i.split()[0] for i in s.split('\n')]
['DT', 'PRP', 'VBD', 'RB', 'VBN']
import re
p = re.compile(r'^\S+', re.MULTILINE)
re.findall(p, test_str)
You can simply do this to get a list of strings you want.

why my code print this when i read and write

def sss(request):
handle=open('b.txt','r+')
handle.write("I AM NEW FILE")
var=handle.read();
return HttpResponse(var)
urlpatterns = patterns('',
('^$',sss),
)
1.my b.txt has nothing
2.when i run my code ,it print this :
I AM NEW FILE7 鸸?; ??x 鸸鸸v1鸸pZ€0 鸸鸸燛?鸸8N鸸鸸p 坮 愵) 犭 ?`16鸸鸸 S6鸸鸸榑 鸸? 鸸# 鸸鸸p叠 {鸸€1鸸鸸 V 鸸鸸 #+ 爏 鸸 职 鑮 鸸鸸鸸`埤 >?) ?鸸鸸#? Z!x`%鸸p?鸸? 鸸鸸鄧鸸鸸#?`7鸸鸸鸸`? 柜 鸸鸸鑎1X 鸸鸸鸸鸸鸸?#鸸餷?鸸€0鸸(Q?鸸H?鸸P?#鸸 ' 鸸(5 ?, 7鸸啵6H宏 0??+噌? k%8除 `烋 鸸爐"繳` 鸸埻 鸸0?郤 鸸鸸鸸?爛/啊 鸸鸸鸸睾8S1`?`?鸸鸸悀0鸸 ?`??鸸繧爅 鸸餡 鸸些 鸸鸸鸸鸸鸸#]鄡HE,鸸鸸?瘅+?+鸸鸸鸸p戙 #O鸸?? 鸸鸸 37€P6蠯7鸸#= 鸸嘣 囗 ?+xP?x?如?70暡 鸸鸸鸸鸸鸸鸸 €鸸鸸鸸€ h *??x 纙1鸸鸸鸸€K 叠 鸸鹞8? ?鸸 鸸萰 鸸`?辣 #?饆 鸸鸸鸸鸸? 鸸€?鸸鸸鸸鸸鸸鄧鸸8(鸸P⒊ ?鸸? p(0B?鸸鸸嗨鸸鸸鸸鸸李 鸸鸸鸸邪 P?鸸鸸鸫 爛/爦+鸸蜣 9 鸸 楈 ?鸸鸸怱1鸸鸸恏鸸鸸鸸鸸袖 ; 鸸€?鸸€札 `?(?鸸ㄈ 鸸鸸+ 鸸栉0鸸愵 鸸鸸恾谿6 ?1谹,鸸鸸鸸 {0鸸鸸? X?鸸€D 鸸&?€?` 鸸H{ ?鸸葉Xw鸸鸸鸸皢 鸸狑 鸸鄩0缊0堩)€Q 鸸? ?鸸 ④ #?鸸鸸鸸鸸鸸 ?XA6鸸鸸? O 鸸0 h 鸸 鸸鸸李 鸸 ? j鸸鸸鸸鸸0昌 57極7#?H+ 鸸鸩 尛 `?鸸 18戙 鸸P ?噍6嗤0鸸鸸鸸楧6鸸坆 鸸a 鸸` 鸸鸸鸸鸸鸸鸸鸸惍砾 pG8s鸸鸸鸸# ?  (, 蠵 ( 鄭? 鸸╒&鸸缞鸸鐽圡7鸸繮!0[ 0m 鸸鸸鸸鸸#?発0鸸鸸鸸鸸鸸? ?鸸饗 p?pZ爦+鸸#?€\1鸸犎 0如 ?艾 鸸棱? 鸸€;鸸? 鸸鸸`? 褶 ? 鸸鸸鸸给*`7鸸#嵀 6 R 恈鸸鸸鸸鸸p?鸸饇鸸埪00^#燽 鸸鸸8褶 h €,h ? 鸸鸸x+ 鸸鸸€37鸸鸸鸸鸸`+鸸P?鸸 1 杞 鸸鸸鸸鸸惥*鸸郔6鸸李 鸸鸸h: 鸸鸸83 ? 哀犎鸸鸸0s 鸸鸸鸸鸸? 蝎p篆 鸸鸸鸸鸸鸸纞" s找( ??x Q s l??x ndies".
* If value is 1, cand{{ value|pluralize:"y,ies" }} displays "1 candy".
* If value is 2, cand{{ value|pluralize:"y,ies" }} displays "2 candies".
u ,i u i ( RE RG R5 R3 R4 ( R R< R t singular_suffixt
plural_suffix( ( s? D:\Python25\lib\site-packages\django\template\defaultfilters.pyt pluralize4 s$
c C s d d k l } | | ? S( sD Takes a phone number and converts it in to its numerical equivalent.i( t
phone2numeric( Rc R ( R R
why?
thanks
The only way I can repro this is to open an existing non-empty file using 'r+' (Are you absolutely sure it's empty?). In any event, opening the file in the 'w+' mode truncates it.
What middle-ware are you using? I guess that you have a lot of middle-ware installed, which explains some of the garbage.
For debugging, use a logging module to log what var was. Otherwise you can't isolate the problem, right?
Also, should you convert the string to unicode before sending it off to HttpResponse?

Categories