Fastest check if line starts with value in list? - python
I have thousands of values (as list but might convert to dictionary or so if that helps) and want to compare to files with millions of lines. What I want to do is to filter lines in files to only the ones starting with values in the list.
What is the fastest way to do it?
My slow code:
for line in source_file:
# Go through all IDs
for id in my_ids:
if line.startswith(str(id) + "|"):
#replace comas with semicolons and pipes with comas
target_file.write(line.replace(",",";").replace("|",","))
If you sure the line starts with id + "|", and "|" will not present in id, I think you could play some trick with "|". For example:
my_id_strs = map(str, my_ids)
for line in source_file:
first_part = line.split("|")[0]
if first_part in my_id_strs:
target_file.write(line.replace(",",";").replace("|",","))
Hope this will help :)
Use string.translate to do replace. Also you can do a break after you match the id.
from string import maketrans
trantab = maketrans(",|", ";,")
ids = ['%d|' % id for id in my_ids]
for line in source_file:
# Go through all IDs
for id in ids:
if line.startswith(id):
#replace comas with semicolons and pipes with comas
target_file.write(line.translate(trantab))
break
or
from string import maketrans
#replace comas with semicolons and pipes with comas
trantab = maketrans(",|", ";,")
idset = set(my_ids)
for line in source_file:
try:
if line[:line.index('|')] in idset:
target_file.write(line.translate(trantab))
except ValueError as ve:
pass
Use a regular expression. Here is an implementation:
import re
def filterlines(prefixes, lines):
pattern = "|".join([re.escape(p) for p in prefixes])
regex = re.compile(pattern)
for line in lines:
if regex.match(line):
yield line
We build and compile a regular expression first (expensive, but once only), but then the matching is very, very fast.
Test code for the above:
with open("/usr/share/dict/words") as words:
prefixes = [line.strip() for line in words]
lines = [
"zoo this should match",
"000 this shouldn't match",
]
print(list(filterlines(prefixes, lines)))
Related
python open csv search for pattern and strip everything else
I got a csv file 'svclist.csv' which contains a single column list as follows: pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1 pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory and should look like that PL5,00 PL5,01 I started the code as follow: clean_data = [] with open('svclist.csv', 'rt') as f: for line in f: if line.__contains__('profile'): print(line, end='') and I'm stuck here. Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2}) For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes. import re test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1', 'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs'] regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})") result = [] for test_string in test_string_list: matchArray = regex.findall(test_string) result.append(matchArray[0]) with open('outfile.txt', 'w') as f: for row in result: f.write(f'{str(row)[1:-1]}\n') In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string. Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option) import re pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})" with open('svclist.csv', 'rt') as f: for line in f: if 'profile' in line: last_two_numbers = pattern.findall(line)[0] print(f'PL5,{last_two_numbers}') This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop. test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1" test_list = test_str.split("_") # splits the string at the underscores output = test_list[1].strip( "abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character try: int(output) # testing if the any special characters are left print(f"PL5, {output}") except ValueError: print(f'Something went wrong! Output is PL5,{output}')
Getting the line number of a string
Suppose I have a very long string taken from a file: lf = open(filename, 'r') text = lf.readlines() lf.close() or lineList = [line.strip() for line in open(filename)] text = '\n'.join(lineList) How can one find specific regular expression's line number in this string( in this case the line number of 'match'): regex = re.compile(somepattern) for match in re.findall(regex, text): continue Thank you for your time in advance Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then: import re lineList = ["123","456","789","ABC","XYZ"] text = '\n'.join(lineList) for match in re.finditer(r"\d\n\d", text, re.MULTILINE): start = match.span()[0] # .span() gives tuple (start, end) line_no = text[:start].count("\n") print(line_no) Output: 0 1 Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this: lf = open(filename, 'r') text_lines = lf.readlines() lf.close() regex = re.compile(somepattern) for line_number, line in enumerate(text_lines): for match in re.findall(regex, line): print('Match found on line %d: %s' % (line_number, match))
Print line if line starts with any letter of the alphabet
I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two. So I want to print all of the subspecies names only, and to ignore the sequence IDs. So far I have import re file = open('repCleanSubs2.txt') for line in file: if line.startswith('[a-zA-Z]'): print line I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want. Is there some syntax that I'm missing in my if statement? Thanks!
startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match: if re.match('^[a-zA-Z]+', line) is not None: print line starts with: ^ one or more matching characters: + http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
import re file = open('repCleanSubs2.txt') for line in file: match = re.findall('^[a-zA-Z]+', line) if match: print line, match The ^ sign means match from the beginning of the line, letters between a-z and A-Z + means at least one or more characters in [a-zA-Z] must be found re.findall will return a list of all the patterns it could find in the string you supplied to it
Try the following lines instead of the startswith. if re.match("^[a-zA-Z]", line): print line
Try this, its working for me: import re file = open('repCleanSubs2.txt') for line in file: if (re.match('[a-zA-Z]',line)): print line
without using re: import string with open('repCleanSubs2.txt') as c_file: for line in c_file: if any([line.startswith(c) for c in string.letters]): print line
Try this file = open("abc.xyz") file_content = file.read() line = file_content.splitlines() output_data = [] for i in line: if i[0] == '[a-zA-Z]': output_data.append(i) print(i)
It can be done without regular expression data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list print [i for i in data if i[0].isalpha()]
Replace part of a matched string in python
I have the following matched strings: punctacros="Tasla"_TONTA punctacros="Tasla"_SONTA punctacros="Tasla"_JONTA punctacros="Tasla"_BONTA I want to replace only a part (before the underscore) of the matched strings, and the rest of it should remain the same in each original string. The result should look like this: TROGA_TONTA TROGA_SONTA TROGA_JONTA TROGA_BONTA
Edit: This should work: from re import sub with open("/path/to/file") as myfile: lines = [] for line in myfile: line = sub('punctacros="Tasla"(_.*)', r'TROGA\1', line) lines.append(line) with open("/path/to/file", "w") as myfile: myfile.writelines(lines) Result: TROGA_TONTA TROGA_SONTA TROGA_JONTA TROGA_BONTA Note however, if your file is exactly like the sample given, you can replace the re.sub line with this: line = "TROGA_"+line.split("_", 1)[1] eliminating the need of Regex altogether. I didn't do this though because you seem to want a Regex solution.
mystring.replace('punctacross="Tasla"', 'TROGA_') where mystring is string with those four lines. It will return string with replaced values.
If you want to replace everything before the first underscore, try this: #! /usr/bin/python3 data = ['punctacros="Tasla"_TONTA', 'punctacros="Tasla"_SONTA', 'punctacros="Tasla"_JONTA', 'punctacros="Tasla"_BONTA', 'somethingelse!="Tucku"_CONTA'] for s in data: print('TROGA' + s[s.find('_'):])
Splitting lines in python based on some character
Input: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22. Output: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22. '!' is the starting character and +0013 should be the ending of each line (if present). Problem which I am getting: Output is like : !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W Any help would be highly appreciated...!!! My code: file_open= open('sample.txt','r') file_read= file_open.read() file_open2= open('output.txt','w+') counter =0 for i in file_read: if '!' in i: if counter == 1: file_open2.write('\n') counter= counter -1 counter= counter +1 file_open2.write(i)
You can try something like this: with open("abc.txt") as f: data=f.read().replace("\r\n","") #replace the newlines with "" #the newline can be "\n" in your system instead of "\r\n" ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines for x in ans: print "!"+x #or write to some other file .....: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split? lines = file_read.split('!') Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file: file_open2.writelines('!{0}\n'.format(line) for line in lines) You might need: file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines) if you find that you're getting more newlines than you wanted in the output. A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly: with open('inputfile') as fin: lines = fin.read() with open('outputfile','w') as fout: fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line: In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '') In [15]: print data.replace('+0013!', "+0013\n!") !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer: import re outputFile = open('output.txt', 'w+') with open('sample.txt', 'r') as f: for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL): outputFile.write(line.replace("\n", "") + '\n') outputFile.close() It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4 After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) : file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files >>> def split_on_stream(it,sep="!"): prev = "" for line in it: line = (prev + line.strip()).split(sep) for parts in line[:-1]: yield parts prev = line[-1] yield prev >>> with open("test.txt") as fin: for parts in split_on_stream(fin): print parts ,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:19,000.0,0,37N22.