Remove string and all lines before string from file - python
I have a filename with thousands of lines of data in it.
I am reading in the filename and editing it.
The following tag is about ~900 lines in or more (it varies per file):
<Report name="test" xmlns:cm="http://www.example.org/cm">
I need to remove that line and everything before it in several files.
so I need to the code to search for that tag and delete it and everything above it
it will not always be 900 lines down, it will vary; however, the tag will always be the same.
I already have the code to read in the lines and write to a file. I just need the logic behind finding that line and removing it and everything before it.
I tried reading the file in line by line and then writing to a new file once it hits on that string, but the logic is incorrect:
readFile = open(firstFile)
lines = readFile.readlines()
readFile.close()
w = open('test','w')
for item in lines:
if (item == "<Report name="test" xmlns:cm="http://www.example.org/cm">"):
w.writelines(item)
w.close()
In addition, the exact string will not be the same in each file. The value "test" will be different. I perhaps need to check for the tag name ""<Report name"
You can use a flag like tag_found to check when lines should be written to the output. You initially set the flag to False, and then change it to True once you've found the right tag. When the flag is True, you copy the line to the output file.
TAG = '<Report name="test" xmlns:cm="http://www.domain.org/cm">'
tag_found = False
with open('tag_input.txt') as in_file:
with open('tag_output.txt', 'w') as out_file:
for line in in_file:
if not tag_found:
if line.strip() == TAG:
tag_found = True
else:
out_file.write(line)
PS: The with open(filename) as in_file: syntax is using what Python calls a "context manager"- see here for an overview. The short explanation of them is that they automatically take care of closing the file safely for you when the with: block is finished, so you don't have to remember to put in my_file.close() statements.
You can use a regular expression to match you line:
regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$'
Get the index of the item that matches the regex:
listIndex = [i for i, item in enumerate(lines) if re.search(regex, item)]
Slice the list:
listLines = lines[listIndex:]
And write to a file:
with open("filename.txt", "w") as fileOutput:
fileOutput.write("\n".join(listLines))
pseudocode
Try something like this:
import re
regex1 = '^<Report name=.*xmlns:cm="http://www.domain.org/cm">$' # Variable #name
regex2 = '^<Report name=.*xmlns:cm=.*>$' # Variable #name & #xmlns:cm
with open(firstFile, "r") as fileInput:
listLines = fileInput.readlines()
listIndex = [i for i, item in enumerate(listLines) if re.search(regex1, item)]
# listIndex = [i for i, item in enumerate(listLines) if re.search(regex2, item)] # Uncomment for variable #name & #xmlns:cm
with open("out_" + firstFile, "w") as fileOutput:
fileOutput.write("\n".join(lines[listIndex:]))
Related
delete all rows up to a specific row
How you can implement deleting lines in a text document up to a certain line? I find the line number using the code: #!/usr/bin/env python lookup = '00:00:00' filename = "test.txt" with open(filename) as text_file: for num, line in enumerate(text_file, 1): if lookup in line: print(num) print(num) outputs me the value of the string, for example 66. How do I delete all the lines up to 66, i.e. up to the found line by word?
As proposed here with a small modification to your case: read all lines of the file. iterate the lines list until you reach the keyword. write all remaining lines with open("yourfile.txt", "r") as f: lines = iter(f.readlines()) with open("yourfile.txt", "w") as f: for line in lines: if lookup in line: f.write(line) break for line in lines: f.write(line)
That's easy. filename = "test.txt" lookup = '00:00:00' with open(filename,'r') as text_file: lines = text_file.readlines() res=[] for i in range(0,len(lines),1): if lookup in lines[i]: res=lines[i:] break with open(filename,'w') as text_file: text_file.writelines(res)
Do you know what lines you want to delete? #!/usr/bin/env python lookup = '00:00:00' filename = "test.txt" with open(filename) as text_file, open('okfile.txt', 'w') as ok: lines = text_file.readlines() ok.writelines(lines[4:]) This will delete the first 4 lines and store them in a different document in case you wanna keep the original. Remember to close the files when you're done with them :)
Providing three alternate solutions. All begin with the same first part - reading: filename = "test.txt" lookup = '00:00:00' with open(filename) as text_file: lines = text_file.readlines() The variations for the second parts are: Using itertools.dropwhile which discards items from the iterator until the predicate (condition) returns False (ie discard while predicate is True). And from that point on, yields all the remaining items without re-checking the predicate: import itertools with open(filename, 'w') as text_file: text_file.writelines(itertools.dropwhile(lambda line: lookup not in line, lines)) Note that it says not in. So all the lines before lookup is found, are discarded. Bonus: If you wanted to do the opposite - write lines until you find the lookup and then stop, replace itertools.dropwhile with itertools.takewhile. Using a flag-value (found) to determine when to start writing the file: with open(filename, 'w') as text_file: found = False for line in lines: if not found and lookup in line: # 2nd expression not checked once `found` is True found = True # value remains True for all remaining iterations if found: text_file.write(line) Similar to #c yj's answer, with some refinements - use enumerate instead of range, and then use the last index (idx) to write the lines from that point on; with no other intermediate variables needed: for idx, line in enumerate(lines): if lookup in line: break with open(filename, 'w') as text_file: text_file.writelines(lines[idx:])
Open and Read a CSV File without libraries
I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library. I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row. This is what I have so far. with open(path, 'r+') as file: results=[] text = file.readline() while text != '': for line in text.split('\n'): a=line.split(',') b=tuple(a) results.append(b) return results The output should: be every line in a tuple and all the tuples in a list. My question is now, how can I read the other lines in python? I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake. Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code. But much better is to use module csv for this. You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines. But after changing this there are still other problems so it still will not read next lines. You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = '' You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while def read_csv(path): with open(path, 'r+') as file: results = [] text = file.read() for line in text.split('\n'): items = line.split(',') results.append(tuple(items)) # after for-loop return results def read_csv(path): with open(path, 'r+') as file: results = [] lines = file.readlines() for line in lines: line = line.rstrip('\n') # remove `\n` at the end of line items = line.split(',') results.append(tuple(items)) # after for-loop return results def read_csv(path): with open(path, 'r+') as file: results = [] for line in file: line = line.rstrip('\n') # remove `\n` at the end of line items = line.split(',') results.append(tuple(items)) # after for-loop return results All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal: with open(path, 'r+') as file: results=[] text = file.read() #while text != '': for line in text.split('\n'): a=line.split(',') b=tuple(a) results.append(b) return results Your Code: with open(path, 'r+') as file: results=[] text = file.readline() while text != '': for line in text.split('\n'): a=line.split(',') b=tuple(a) results.append(b) return results So enjoy learning :) One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley) To prevent this you have to check for empty lines: if line != '': after the for will do the trick.
How to open a file in python, read the comments ("#"), find a word after the comments and select the word after it?
I have a function that loops through a file that Looks like this: "#" XDI/1.0 XDAC/1.4 Athena/0.9.25 "#" Column.4: pre_edge Content That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":" For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored. An example of my current approach follows: with open(file, "r") as f: for line in f: if line.startswith ('#'): word = line.split(" Column.4:")[1] else: print("n") I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way. with open(filepath) as f: for line in f: if line.startswith('#'): # Here you proceed comment lines if 'Column.4' in line: first, remainder = line.split('Column.4: ') # Remainder contains everything after '# Column.4: ' # So if you want to get first word -> word = remainder.split()[0] else: # Here you can proceed lines that are not comments pass Note Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead: file = 'test.txt' #<- call file whatever you want with open(file, "r") as f: txt = f.readlines() for line in txt: if line.startswith ('"#"'): word = line.split(" Column.4: ") try: print(word[1]) except IndexError: print(word) else: print("n") Output: >>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n'] >>> pre_edge Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic. Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f: data = f.readlines() for line in data: words = line.split() if words and ('#' in words[0]) and ("Column.4:" in words): print(words[-1]) # pre_edge
Removing word from the beginning of my text object?
I have a function that scrapes speeches from millercenter.org and returns the processed speech. However, every one of my speeches has the word "transcript" at the beginning (that's just how it's coded into the HTML). So, all of my text files look like this: \n <--- there's really just a new line, here, not literally '\n' transcript fourscore and seven years ago, blah blah blah I have these saved in my U:/ drive - how can I iterate through these files and remove 'transcript'? The files look like this, essentially: Edit: speech_dict = {} for filename in glob.glob("U:/FALL 2015/ENGL 305/NLP Project/Speeches/*.txt"): with open(filename, 'r') as inputFile: filecontent = inputFile.read(); filecontent.replace('transcript','',1) speech_dict[filename] = filecontent # put the speeches into a dictionary to run through the algorithm This is not doing anything to change my speeches. 'transcript' is still there. I also tried putting it into my text-processing function, but that doesn't work, either: def processURL(l): open_url = urllib2.urlopen(l).read() item_soup = BeautifulSoup(open_url) item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'}) item_str = item_div.text.lower() item_str_processed = punctuation.sub(' ',item_str) item_str_processed_final = item_str_processed.replace('—',' ').replace('transcript','',1) splitlink = l.split("/") president = splitlink[4] speech_num = splitlink[-1] filename = "{0}_{1}".format(president, speech_num) return filename, item_str_processed_final # giving back filename and the text itself Here's an example url I run through processURL: http://millercenter.org/president/harding/speeches/speech-3805
You can use Python's excellent replace() for this: data = data.replace('transcript', '', 1) This line will replace 'transcript' with '' (empty string). The final parameter is the number of replacements to make. 1 for only the first instance of 'transcript', blank for all instances.
If you know that the data you want always starts on line x then do this: with open('filename.txt', 'r') as fin: for _ in range(x): # This loop will skip x no. of lines. next(fin) for line in fin: # do something with the line. print(line) Or let's say you want to remove any lines before transcript: with open('filename.txt', 'r') as fin: while next(fin) != 'transcript': # This loop will skip lines until it reads the *transcript* lines. break # if you want to skip the empty line after *transcript* next(fin) # skips the next line. for line in fin: # do something with the line. print(line)
Splitting lines in python based on some character
Input: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22. Output: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22. '!' is the starting character and +0013 should be the ending of each line (if present). Problem which I am getting: Output is like : !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W Any help would be highly appreciated...!!! My code: file_open= open('sample.txt','r') file_read= file_open.read() file_open2= open('output.txt','w+') counter =0 for i in file_read: if '!' in i: if counter == 1: file_open2.write('\n') counter= counter -1 counter= counter +1 file_open2.write(i)
You can try something like this: with open("abc.txt") as f: data=f.read().replace("\r\n","") #replace the newlines with "" #the newline can be "\n" in your system instead of "\r\n" ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines for x in ans: print "!"+x #or write to some other file .....: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split? lines = file_read.split('!') Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file: file_open2.writelines('!{0}\n'.format(line) for line in lines) You might need: file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines) if you find that you're getting more newlines than you wanted in the output. A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly: with open('inputfile') as fin: lines = fin.read() with open('outputfile','w') as fout: fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line: In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '') In [15]: print data.replace('+0013!', "+0013\n!") !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer: import re outputFile = open('output.txt', 'w+') with open('sample.txt', 'r') as f: for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL): outputFile.write(line.replace("\n", "") + '\n') outputFile.close() It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4 After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) : file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files >>> def split_on_stream(it,sep="!"): prev = "" for line in it: line = (prev + line.strip()).split(sep) for parts in line[:-1]: yield parts prev = line[-1] yield prev >>> with open("test.txt") as fin: for parts in split_on_stream(fin): print parts ,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:19,000.0,0,37N22.