How do I copy multiple lines? - python

I have the following file:
this is the first line
and this is the second line
now it is the third line
wow, the fourth line
but now it's the fifth line
etc...
etc...
etc...
Starting from "now it is the third line" to "but now it's the fifth line", how do I copy those three lines (without knowing the line numbers of those lines)? In perl, you would do something like:
/^now it is/../^but now/
What is the equivalent in python?
I have (which obviously only grabs 1 of the lines):
regex = re.compile("now it is")
for line in content:
if regex.match(line):
print line
EDIT:
reg = re.compile(r"now it is.*but now it.*", re.MULTILINE | re.DOTALL)
matches = reg.search(urllib2.urlopen(url).read())
for match in matches.group():
print match
This prints:
n
o
w
i
t
i
s
.
.
.
ie it returns characters and not the complete line

I think you just need to see re.MULTILINE flag. Thanks to it you can perform similar match and get the text that is combined from the lines you want.
EDIT:
The complete solution involves using re.MULTILINE and re.DOTALL flags, plus non-greedy regexp:
>>> text = """this is the first line
and this is the second line
now it is the third line
wow, the fourth line
but now it's the fifth line
etc...
etc...
etc..."""
>>> import re
>>> match = re.search('^(now it is.*?but now.*?)$', text, flags=re.MULTILINE|re.DOTALL)
>>> print match.group()
now it is the third line
wow, the fourth line
but now it's the fifth line

you can easily make a generator to do this
def re_range(f, re_start, re_end):
for line in f:
if re_start.match(line):
yield line
break
for line in f:
yield line
if re_end.match(line):
break
and you can call it like this
import re
re_start = re.compile("now it is")
re_end = re.compile("but now")
with open('in.txt') as f:
for line in re_range(f, re_start, re_end):
print line,

f = open("yourfile") #that is, the name of your file with extension in quotes
f = f.readlines()
Now f will be a list of each line in the file. f[0] will be the first line, f[1] the second and so on. To grab the third to fifth line you would use f[2:5]

Something like that?
import re
valid = False
for line in open("/path/to/file.txt", "r"):
if re.compile("now it is").match(line):
valid = True
if re.compile("but now").match(line):
valid = False
if valid:
print line
Like this your caching just one line at a time, contrary to using readlines() where you would cache the whole file in memory.
This is assuming the regex patterns are unique in your text block, if this is not the case please give more information regarding exactly how you match the start line and the end line.
In case you just need to check the beginning of the line for a match it's even easier:
valid = False
for line in open("/path/to/file.txt", "r"):
if line.startswith("now it is"):
valid = True
if line.startswith("but now"):
valid = False
if valid:
print line

Related

extract the dimensions from the head lines of text file

Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.
Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.
Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'
What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"

Print line if line starts with any letter of the alphabet

I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two.
So I want to print all of the subspecies names only, and to ignore the sequence IDs.
So far I have
import re
file = open('repCleanSubs2.txt')
for line in file:
if line.startswith('[a-zA-Z]'):
print line
I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want.
Is there some syntax that I'm missing in my if statement?
Thanks!
startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match:
if re.match('^[a-zA-Z]+', line) is not None:
print line
starts with: ^
one or more matching characters: +
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html
import re
file = open('repCleanSubs2.txt')
for line in file:
match = re.findall('^[a-zA-Z]+', line)
if match:
print line, match
The ^ sign means match from the beginning of the line, letters between a-z and A-Z
+ means at least one or more characters in [a-zA-Z] must be found
re.findall will return a list of all the patterns it could find in the string you supplied to it
Try the following lines instead of the startswith.
if re.match("^[a-zA-Z]", line):
print line
Try this, its working for me:
import re
file = open('repCleanSubs2.txt')
for line in file:
if (re.match('[a-zA-Z]',line)):
print line
without using re:
import string
with open('repCleanSubs2.txt') as c_file:
for line in c_file:
if any([line.startswith(c) for c in string.letters]):
print line
Try this
file = open("abc.xyz")
file_content = file.read()
line = file_content.splitlines()
output_data = []
for i in line:
if i[0] == '[a-zA-Z]':
output_data.append(i)
print(i)
It can be done without regular expression
data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list
print [i for i in data if i[0].isalpha()]

python string find and replace in text file

I'm reading a file and replacing a text string. I'm doing this for a few different strings, I know this isn't the most effeciant code I'm just trying to get it working. Here's what I have:
for line in fileinput.input(new_config, inplace=1):
print line.replace("http://domain.com", self.site_address)
print line.replace("dbname", self.db_name)
print line.replace("root", self.db_user)
print line.replace("password", self.db_pass)
print line.replace("smvc_", self.prefix
this works but it also writes every line in the file 5 times and only replaces the string on the first attempt on not on the new lines it creates (doesn't matter if it matches the string or not)
You just need to treat it as one string and apply all replacements to it.
for line in fileinput.input(new_config, inplace=1):
line = line.replace("http://domain.com", self.site_address)
line = line.replace("dbname", self.db_name)
line = line.replace("root", self.db_user)
line = line.replace("password", self.db_pass)
line = line.replace("smvc_", self.prefix)
print line
If it doesn't find any of those targets it will just not make any change to the line string so it will just replace what it DOES find.
for line in fileinput.input(new_config, inplace=1):
print line.replace(
"http://domain.com", self.site_address).replace(
"dbname", self.db_name).replace(
"root", self.db_user).replace(
"password", self.db_pass).replace("smvc_", self.prefix)
Done by copying and pasting what you wrote and using only the delete key and re-indenting. No characters added except the last closing paren.
Alternatively, this format may be clearer. It uses the backslash character to split the one line in a neater way, for better readability:
print line.replace("http://domain.com", self.site_address) \
.replace("dbname", self.db_name) \
.replace("root", self.db_user) \
.replace("password", self.db_pass) \
.replace("smvc_", self.prefix)
You can read a file line by line.
And at every line look for the word you would like to replace.
For example:
line1 = 'Hello World There'
def Dummy():
lineA = line1.replace('H', 'A')
lineB = lineA.replace('e', 'o')
print(lineB)
Dummy()
Then wirte lineB to a file.

How to only read a file between certain phrases

Just a basic question. I know how to read information from a file etc but how would I go about only including the lines that are in between certain lines?
Say I have this :
Information Included in file but before "beginning of text"
" Beginning of text "
information I want
" end of text "
Information included in file but after the "end of text"
Thank you for any help you can give to get me started.
You can read the file in line by line until you reach the start-markerline, then do something with the lines (print them, store them in a list, etc) until you reach the end-markerline.
with open('myfile.txt') as f:
line = f.readline()
while line != ' Beginning of text \n':
line = f.readline()
while line != ' end of text \n':
# add code to do something with the line here
line = f.readline()
Make sure to exactly match the start- and end-markerlines. In your example they have a leading and trailing blank.
Yet another way to do it, is to use two-argument version of iter():
start = '" Beginning of text "\n'
end = '" end of text "\n'
with open('myfile.txt') as f:
for line in iter(f.readline, start):
pass
for line in iter(f.readline, end):
print line
see https://docs.python.org/2/library/functions.html#iter for details
I would just read the file line by line and check each line if it matches beginning or end string. The boolean readData then indicates if you are between beginning and end and you can read the actual information to another variable.
# Open the file
f = open('myTextFile.txt')
# Read the first line
line = f.readline()
readData=false;
# If the file is not empty keep reading line one at a time
# until the file is empty
while line:
# Check if line matches beginning
if line == "Beginning of text":
readData=true;
# Check if line matches end
if line == "end of text"
readData=false;
# We are between beginning and end
if readData:
(...)
line = f.readline()
f.close()

Match an element of every line

I have a list of rules for a given input file for my function. If any of them are violated in the file given, I want my program to return an error message and quit.
Every gene in the file should be on the same chromosome
Thus for a lines such as:
NM_001003443 chr11 + 5997152 5927598 5921052 5926098 1 5928752,5925972, 5927204,5396098,
NM_001003444 chr11 + 5925152 5926098 5925152 5926098 2 5925152,5925652, 5925404,5926098,
NM_001003489 chr11 + 5925145 5926093 5925115 5926045 4 5925151,5925762, 5987404,5908098,
etc.
Each line in the file will be variations of this line
Thus, I want to make sure every line in the file is on chr11
Yet I may be given a file with a different list of chr(and any number of numbers). Thus I want to write a function that will make sure whatever number is found on chr in the line is the same for every line.
Should I use a regular expression for this, or what should I do? This is in python by the way.
Such as: chr\d+ ?
I am unsure how to make sure that whatever is matched is the same in every line though...
I currently have:
from re import *
for line in file:
r = 'chr\d+'
i = search(r, line)
if i in line:
but I don't know how to make sure it is the same in every line...
In reference to sajattack's answer
fp = open(infile, 'r')
for line in fp:
filestring = ''
filestring +=line
chrlist = search('chr\d+', filestring)
chrlist = chrlist.group()
for chr in chrlist:
if chr != chrlist[0]:
print('Every gene in file not on same chromosome')
Just read the file and have a while loop check each line to make sure it contains chr11. There are string functions to search for substrings in a string. As soon as you find a line that returns false (does not contain chr11) then break out of the loop and set a flag valid = false.
import re
fp = open(infile, 'r')
fp.readline()
tar = re.findall(r'chr\d+', fp.readline())[0]
for line in fp:
if (line.find(tar) == -1):
print("Not valid")
break
This should search for a number in the line and check for validity.
Is it safe to assume that the first chr is the correct one? If so, use this:
import re
chrlist = re.findall("chr[0-9]+", open('file').read())
# ^ this is a list with all chr(whatever numbers)
for chr in chrlist:
if chr != chrlist[0]
print("Chr does not match")
break
My solution uses a "match group" to collect the matched numbers from the "chr" string.
import re
pat = re.compile(r'\schr(\d+)\s')
def chr_val(line):
m = re.search(pat, line)
if m is not None:
return m.group(1)
else:
return ''
def is_valid(f):
line = f.readline()
v = chr_val(line)
if not v:
return False
return all(chr_val(line) == v for line in f)
with open("test.txt", "r") as f:
print("The file is {0}".format("valid" if is_valid(f) else "NOT valid"))
Notes:
Pre-compiles the regular expression for speed.
Uses a raw string (r'') to specify the regular expression.
The pattern requires white space (\s) on either side of the chr string.
is_valid() returns False if the first line doesn't have a good chr value. Then it returns a Boolean value that is true if all of the following lines match the chr value of the first line.
Your sample code just prints something like The file is True so I made it a bit friendlier.

Categories