I am trying an efficient way to strip numbers dates or any other characters present in a string until the first alphabet is found from the end.
string - '12.abd23yahoo 04/44 231'
Output - '12.abd23yahoo'
line_inp = "12.abd23yahoo 04/44 231"
line_out = line_inp.rstrip('0123456789./')
This rstrip() call doesn't seem to work as expected, I get '12.abd23yahoo 04/44 ' instead.
I am trying below and it doesn't seem to be working.
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line.rstrip('0123456789./ '))
You need to strip spaces too:
line_out = line_inp.rstrip('0123456789./ ')
Demo:
>>> line_inp = "12.abd23yahoo 04/44 231"
>>> line_inp.rstrip('0123456789./ ')
'12.abd23yahoo'
You need to strip the newlines and add it again before you write :
for fname in filenames:
with open(fname) as infile:
outfile.writelines(line.rstrip('0123456789./ \n') + "\n"
for line in infile)
If the format is always the same you can just split:
with open(fname) as infile:
outfile.writelines(line.split(None, 1)[0] + "\n"
for line in infile)
Here's a solution using a regular expression:
import re
line_inp = "12.abd23yahoo 04/44 231"
r = re.compile('^(.*[a-zA-Z])')
m = re.match(r, line_inp)
line_out = m.group(0) # 12.abd23yahoo
The regular expression matches a group of arbitrary characters which end in a letter.
Related
Friends, how are you? I hope so!!
I need help using REGEX in python.
I need to validate some numbers, so that they are all in the same pattern. I explain:
Numbers (exemple):
0206071240004013000
04073240304015000
0001304-45.2034.4.01.2326
I need the script to read the numbers and change them so that they all have the following pattern:
20 numeric characters (The numbers that do not have must be added "0" to the left)
04073240304015000 = 00004073240304015000
Put "-" and "." this way:
0000407-32.4030.4.01.5000
I was writing the code as follows: First I remove the non-numeric characters, then I check if it has 20 numeric characters, and if I don't have it added. Now I need to put the score, but I'm having difficulties..
with open("num.txt", "r") as arquivo:
leitura = arquivo.readlines()
dados = leitura
for num in dados:
non_numeric = re.sub("[^0-9]", "", num)
characters = f'{non_numeric:0>20}'
Try:
import re
txt = '''\
0206071240004013000
04073240304015000
0001304-45.2034.4.01.2326'''
for n in re.findall(r'[\d.-]+', txt):
n = '{:0>20}'.format(n.replace('.', '').replace('-', ''))
print('{}-{}.{}.{}.{}.{}'.format(n[:7], n[7:9], n[9:13], n[13:14], n[14:16], n[16:]))
Prints:
0020607-12.4000.4.01.3000
0000407-32.4030.4.01.5000
0001304-45.2034.4.01.2326
EDIT: To read the text from file and write to a new file you can do:
import re
with open('in.txt', 'r') as f_in, open('out.txt', 'w') as f_out:
for n in re.findall(r'[\d\.\-]+', f_in.read()):
n = '{:0>20}'.format(n.replace('.', '').replace('-', ''))
print('{}-{}.{}.{}.{}.{}'.format(n[:7], n[7:9], n[9:13], n[13:14], n[14:16], n[16:]), file=f_out)
I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.
Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
w = f.read().translate(remove)
print(word.lower())
I have this code here and for some reason, the translate(remove) is leaving a good amount of punctuation in the parsed file.
Why are you reading the whole file within the for loop?
Try this:
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
word = word.translate(remove)
print(word.lower())
This will print our the lower cased and stripped words, one per line. Not really sure if that's what you want.
Given that the infile contains:
aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB
aawerwefrewqa"pic02.jpg"bbbertebbbsize 100KB
atyrtyruraa"pic03.jpg"bbbwtrwtbbbsize 190KB
How to obtain the outfile as:
pic01.jpg 110KB
pic02.jpg 100KB
pic03.jpg 190KB
My code is:
with open ('test.txt', 'r') as infile, open ('outfile.txt', 'w') as outfile:
for line in infile:
lines_set1 = line.split ('"')
lines_set2 = line.split (' ')
for item_set1 in lines_set1:
for item_set2 in lines_set2:
if item_set1.endswith ('.jpg'):
if item_set2.endswith ('KB'):
outfile.write (item_set1 + ' ' + item_set2 + '\n')
What is wrong with my code, please help!
The problem has been solved here:
what is wrong in the code written inpython
Often you can solve string manipulation problems without regex as Python has an amazing string library. In your case, just calling str.split twice with different delimiters (quote and space) solves your issue
Demo
>>> st = """aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB
aawerwefrewqa"pic02.jpg"bbbertebbbsize 100KB
atyrtyruraa"pic03.jpg"bbbwtrwtbbbsize 190KB"""
>>> def foo(st):
#Split the string based on quotation mark
_, fname, rest = st.split('"')
#from the residual part split based on space
#and select the last part
rest = rest.split()[-1]
#join and return fname and the residue
return ' '.join([fname, rest])
>>> for e in st.splitlines():
print foo(e)
pic01.jpg 110KB
pic02.jpg 100KB
pic03.jpg 190KB
Regex would be easier:
with open ('test.txt', 'r') as infile, open ('outfile.txt', 'w') as outfile:
for line in infile:
m = re.search('"([^"]+)".*? (\d+.B)', line)
if m:
outfile.write(m.group(1) + ' ' + m.group(2) + '\n')
You can use regex and str.rsplit here, your code seems to be an overkill for this simple task:
>>> import re
>>> strs = 'aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB\n'
>>> name = re.search(r'"(.*?)"', strs).group(1)
>>> size = strs.rsplit(None, 1)[-1]
>>> name, size
('pic01.jpg', '110KB')
or
>>> name, size = re.search(r'"(.*?)".*?(\w+)$', strs).groups()
>>> name, size
('pic01.jpg', '110KB')
Now use string formatting:
>>> "{} {}\n".format(name, size) #write this to file
'pic01.jpg 110KB\n'
I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.
Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.