I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two.
So I want to print all of the subspecies names only, and to ignore the sequence IDs.
So far I have
import re
file = open('repCleanSubs2.txt')
for line in file:
if line.startswith('[a-zA-Z]'):
print line
I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want.
Is there some syntax that I'm missing in my if statement?
Thanks!
startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match:
if re.match('^[a-zA-Z]+', line) is not None:
print line
starts with: ^
one or more matching characters: +
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
import re
file = open('repCleanSubs2.txt')
for line in file:
match = re.findall('^[a-zA-Z]+', line)
if match:
print line, match
The ^ sign means match from the beginning of the line, letters between a-z and A-Z
+ means at least one or more characters in [a-zA-Z] must be found
re.findall will return a list of all the patterns it could find in the string you supplied to it
Try the following lines instead of the startswith.
if re.match("^[a-zA-Z]", line):
print line
Try this, its working for me:
import re
file = open('repCleanSubs2.txt')
for line in file:
if (re.match('[a-zA-Z]',line)):
print line
without using re:
import string
with open('repCleanSubs2.txt') as c_file:
for line in c_file:
if any([line.startswith(c) for c in string.letters]):
print line
Try this
file = open("abc.xyz")
file_content = file.read()
line = file_content.splitlines()
output_data = []
for i in line:
if i[0] == '[a-zA-Z]':
output_data.append(i)
print(i)
It can be done without regular expression
data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list
print [i for i in data if i[0].isalpha()]
Related
I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')
I am trying to count number of occurrences of a word from a file, using Python. But I have to ignore comments in the file.
I have a function like this:
def getWordCount(file_name, word):
count = file_name.read().count(word)
file_name.seek(0)
return count
How to ignore where the line begins with a # ?
I know this can be done by reading the file line by line like stated in this question. Are there any faster, more Pythonian way to do so ?
You can use a regular expression to filter out comments:
import re
text = """ This line contains a word. # empty
This line contains two: word word # word
newline
# another word
"""
filtered = ''.join(re.split('#.*', text))
print(filtered)
# This line contains a word.
# This line contains two: word word
# newline
print(text.count('word')) # 5
print(filtered.count('word')) # 3
Just replace text with your file_name.read().
You can do one thing just create a file that is not having the commented line then run your code Ex.
infile = file('./file_with_comment.txt')
newopen = open('./newfile.txt', 'w')
for line in infile :
li=line.strip()
if not li.startswith("#"):
newopen.write(line)
newopen.close()
This will remove every line startswith # then run your function on newfile.txt
def getWordCount(file_name, word):
count = file_name.read().count(word)
file_name.seek(0)
return count
More Pythonian would be this:
def getWordCount(file_name, word):
with open(file_name) as wordFile:
return sum(line.count(word)
for line in wordFile
if not line.startswith('#'))
Faster (which is independent from being Pythonian) could be to read the whole file into one string, then use regexps to find the words not in a line starting with a hash.
I have thousands of values (as list but might convert to dictionary or so if that helps) and want to compare to files with millions of lines. What I want to do is to filter lines in files to only the ones starting with values in the list.
What is the fastest way to do it?
My slow code:
for line in source_file:
# Go through all IDs
for id in my_ids:
if line.startswith(str(id) + "|"):
#replace comas with semicolons and pipes with comas
target_file.write(line.replace(",",";").replace("|",","))
If you sure the line starts with id + "|", and "|" will not present in id, I think you could play some trick with "|". For example:
my_id_strs = map(str, my_ids)
for line in source_file:
first_part = line.split("|")[0]
if first_part in my_id_strs:
target_file.write(line.replace(",",";").replace("|",","))
Hope this will help :)
Use string.translate to do replace. Also you can do a break after you match the id.
from string import maketrans
trantab = maketrans(",|", ";,")
ids = ['%d|' % id for id in my_ids]
for line in source_file:
# Go through all IDs
for id in ids:
if line.startswith(id):
#replace comas with semicolons and pipes with comas
target_file.write(line.translate(trantab))
break
or
from string import maketrans
#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)
for line in source_file:
try:
if line[:line.index('|')] in idset:
target_file.write(line.translate(trantab))
except ValueError as ve:
pass
Use a regular expression. Here is an implementation:
import re
def filterlines(prefixes, lines):
pattern = "|".join([re.escape(p) for p in prefixes])
regex = re.compile(pattern)
for line in lines:
if regex.match(line):
yield line
We build and compile a regular expression first (expensive, but once only), but then the matching is very, very fast.
Test code for the above:
with open("/usr/share/dict/words") as words:
prefixes = [line.strip() for line in words]
lines = [
"zoo this should match",
"000 this shouldn't match",
]
print(list(filterlines(prefixes, lines)))
I am trying to take an input file with a list of DNS lookups that contains subdomain/domain separators with the string length in parenthesis as opposed to periods. It looks like this:
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
I would like to replace the parenthesis and numbers with periods and then remove the first and last period. My code currently does this, but leaves the last period. Any help is appreciated. Here is the code:
import re
file = open('test.txt', 'rb')
writer = open('outfile.txt', 'wb')
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line)
if newline1.startswith('.'):
newline1 = newline1[1:-1]
writer.write(newline1)
You can split the lines with \(\d+\) regex and then join with . stripping commas at both ends:
for line in file:
res =".".join(re.split(r'\(\d+\)', line))
writer.write(res.strip('.'))
See IDEONE demo
Given that your re.sub call works like this:
> re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
'.subdomain.domain.com.'
the only thing you need to do is strip the resulting string from any leading and trailing .:
> s = re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
> s.strip(".")
'subdomain.domain.com'
Full drop in solution:
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line).strip(".")
writer.write(newline1)
import re
def repl(matchobj):
if matchobj.group(1):
return "."
else:
return ""
x="(8)subdomain(5)domain(3)com(0)"
print re.sub(r"^\(\d+\)|((?<!^)\(\d+\))(?!$)|\(\d+\)$",repl,x)
Output:subdomain.domain.com.
You can define your own replace function.
import re
for line in file:
line = re.sub(r'\(\d\)','.',line)
line = line.strip('.')
I have the following matched strings:
punctacros="Tasla"_TONTA
punctacros="Tasla"_SONTA
punctacros="Tasla"_JONTA
punctacros="Tasla"_BONTA
I want to replace only a part (before the underscore) of the matched strings, and the rest of it should remain the same in each original string.
The result should look like this:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Edit:
This should work:
from re import sub
with open("/path/to/file") as myfile:
lines = []
for line in myfile:
line = sub('punctacros="Tasla"(_.*)', r'TROGA\1', line)
lines.append(line)
with open("/path/to/file", "w") as myfile:
myfile.writelines(lines)
Result:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Note however, if your file is exactly like the sample given, you can replace the re.sub line with this:
line = "TROGA_"+line.split("_", 1)[1]
eliminating the need of Regex altogether. I didn't do this though because you seem to want a Regex solution.
mystring.replace('punctacross="Tasla"', 'TROGA_')
where mystring is string with those four lines. It will return string with replaced values.
If you want to replace everything before the first underscore, try this:
#! /usr/bin/python3
data = ['punctacros="Tasla"_TONTA',
'punctacros="Tasla"_SONTA',
'punctacros="Tasla"_JONTA',
'punctacros="Tasla"_BONTA',
'somethingelse!="Tucku"_CONTA']
for s in data:
print('TROGA' + s[s.find('_'):])