Python 2.7 Search Line if match pattern and replace string - python
How can I read the file and find all lines match pattern start with \d+\s. And the replace the write space to , . Some of lines are contain English character. But some of line are Chinese. I guest the write space in chinese encoding is different with english?
Example (text.txt)
asdfasdf
1 abcd
2 asdfajklsd
3 asdfasdf
4 ...
asdfasdf
66 ...
aasdfasdf
99 ...
100 中文
101 中文
102 asdfga
103 中文
My Test Code:
with open('text.txt', 'r') as t:
with open('newtext.txt', 'w') as nt:
content = t.readlines()
for line in content:
okline = re.compile('^[\d+]\s')
if okline:
ntext = re.sub('\s', ',', okline)
nt.write(ntext)
With single re.subn() function:
with open('text.txt', 'r') as text, open('newtext.txt', 'w') as new_text:
lines = text.read().splitlines()
for l in lines:
rpl = re.subn(r'^(\d+)\s+', '\\1,', l)
if rpl[1]:
new_text.write(rpl[0] + '\n')
The main advantage of this is that re.subn will return a tuple (new_string, number_of_subs_made) where number_of_subs_made is the crucial value pointing to the substitution made upon the needed matched line
You could do this:
# Reading lines from input file
with open('text.txt', 'r') as t:
content = t.readlines()
# Opening file for writing
with open('newtext.txt', 'w') as nt:
# For each line
for line in content:
# We search for regular expression
if re.search('^\d+\s', line):
# If we found pattern inside line only then can continue
# and substitute white spaces with commas and write to output file
ntext = re.sub('\s', ',', line)
nt.write(ntext)
There were multiple problems with your code, for starters \d is character class, basically \d is same as [0-9] so you don't need to put it inside square brackets. You can see regex demo here. Also you were checking if compile object is True, since compile operation succeeds compile object will always be True.
Also, you should avoid nested with statements, more Pythonic way is to open files using with, read it, and then close it.
Compact code
import re
with open('esempio.txt', 'r') as original, open('newtext2.txt', 'w') as newtext:
for l in original.read().split('\n'):
if re.search("^\d+\s",l):
newtext.write(re.sub('\s', ',', l)+'\n')
Related
why count of record from file is return abnormal
I have file in which their are lot's of records In that few empty lines are their in middle and even with spaces , tabs as well File content : ABC GSHJSKK jjj ajjk So the count should be : 4 but it return 6 from file using below code My code: num_lines = sum(1 for line in open('myfile.txt'))
I suggest you to try to read the lines using regular expressions. Regular expressions can help you filtering the lines with the content you say is relevant. From what you wrote, I understand that you want to count only the lines containing alphanumeric strings, and ignore everything else. You can filter alphanumeric lines of the files by using this pattern ^\w+$ as explained here. Your code could became something like: import re file = open("myfile.txt", "r") pattern = r"^\w+$" line_count = 0 for line in file: # for each line in file if re.search(pattern, line) : # if the line read matches the pattern line_count += 1 file.close() If you're not so familiar with regular expressions (or you need to verify how your pattern works), you can use this website, I find it so useful!
sum([1 for i in open('myfile.txt',"r").readlines() if i.strip()])
Regex to exclude a specific pattern python
I'm trying to find any occurunce of "fiction" preceeded or followed by anything, except for "non-" I tried : .*[^(n-)]fiction.* but it's not working as I want it to. Can anyone help me out?
Check if this works for you: .*(?<!non\-)fiction.*
You should avoid patterns starting with .*: they cause too many backtracking steps and slow down the code execution. In Python, you may always get lines either by reading a file line by line, or by splitting a line with splitlines() and then get the necessary lines by testing them against a pattern without .*s. Reading a file line by line: final_output = [] with open(filepath, 'r', newline="\n", encoding="utf8") as f: for line in f: if "fiction" in line and "non-fiction" not in line: final_output.append(line.strip()) Or, getting the lines even with non-fiction if there is fiction with no non- in front using a bit modified #jlesuffleur's regex: import re final_output = [] rx = re.compile(r'\b(?<!non-)fiction\b') with open(filepath, 'r', newline="\n", encoding="utf8") as f: for line in f: if rx.search(line): final_output.append(line.strip()) Getting lines from a multiline string (with both approaches mentioned above): import re text = "Your input string line 1\nLine 2 with fiction\nLine 3 with non-fiction\nLine 4 with fiction and non-fiction" rx = re.compile(r'\b(?<!non-)fiction\b') # Approach with regex returning any line containing fiction with no non- prefix: final_output = [line.strip() for line in text.splitlines() if rx.search(line)] # => ['Line 2 with fiction'] # Non-regex approach that does not return lines that may contain non-fiction (if they contain fiction with no non- prefix): final_output = [line.strip() for line in text.splitlines() if "fiction" in line and "non-fiction" not in line] # => ['Line 2 with fiction', 'Line 4 with fiction and non-fiction'] See a Python demo.
What about a negative lookbehind? s = 'fiction non-fiction' res = re.findall("(?<!non-)fiction", s) res
Extract chunks of text from document and write them to new text file
I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well. What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible. Code: array = [] f = open('text.txt','r') as infile w = open(r'temp2.txt', 'w') as outfile for line in f: data = f.read() x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL) array.append(x) outfile.write(x) return array What the text may look like ( CAR: *random info* *random info* - could be many lines of this )
Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)). You can use the following regular expression: (CAR[\s\S]*?(?=\))) See explanation... Here you can visualize your regular expression...
We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms. Then we just have to remove the newline characters from the resulting matches and write them to a file. with open("text.txt", 'r') as f: matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL) with open("output.txt", 'w') as f: for match in matches: f.write(" ".join(match.split('\n'))) f.write('\n') The output file looks like this: CAR: *random info* *random info* - could be many lines of this EDIT: updated code to put newline between matches in output file
Python Make newline after character
I would like to make a newline after a dot in a file. For example: Hello. I am damn cool. Lol Output: Hello. I am damn cool. Lol I tried it like that, but somehow it's not working: f2 = open(path, "w+") for line in f2.readlines(): f2.write("\n".join(line)) f2.close() Could your help me there? I want not just a newline, I want a newline after every dot in a single file. It should iterate through the whole file and make newlines after every single dot. Thank you in advance!
This should be enough to do the trick: with open('file.txt', 'r') as f: contents = f.read() with open('file.txt', 'w') as f: f.write(contents.replace('. ', '.\n'))
You could split your string based on . and store in a list, then just print out the list. s = 'Hello. I am damn cool. Lol' lines = s.split('.') for line in lines: print(line) If you do this, the output will be: Hello I am damn cool Lol To remove leading spaces, you could split based on . (with a space), or else use lstrip() when printing. So, to do this for a file: # open file for reading with open('file.txt') as fr: # get the text in the file text = fr.read() # split up the file into lines based on '.' lines = text.split('.') # open the file for writing with open('file.txt', 'w') as fw: # loop over each line for line in lines: # remove leading whitespace, and write to the file with a newline fw.write(line.lstrip() + '\n')
Splitting lines in python based on some character
Input: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22. Output: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22. '!' is the starting character and +0013 should be the ending of each line (if present). Problem which I am getting: Output is like : !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W Any help would be highly appreciated...!!! My code: file_open= open('sample.txt','r') file_read= file_open.read() file_open2= open('output.txt','w+') counter =0 for i in file_read: if '!' in i: if counter == 1: file_open2.write('\n') counter= counter -1 counter= counter +1 file_open2.write(i)
You can try something like this: with open("abc.txt") as f: data=f.read().replace("\r\n","") #replace the newlines with "" #the newline can be "\n" in your system instead of "\r\n" ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines for x in ans: print "!"+x #or write to some other file .....: !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split? lines = file_read.split('!') Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file: file_open2.writelines('!{0}\n'.format(line) for line in lines) You might need: file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines) if you find that you're getting more newlines than you wanted in the output. A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly: with open('inputfile') as fin: lines = fin.read() with open('outputfile','w') as fout: fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line: In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1 2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000. 0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W 55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56 281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34 :18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '') In [15]: print data.replace('+0013!', "+0013\n!") !,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 !,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer: import re outputFile = open('output.txt', 'w+') with open('sample.txt', 'r') as f: for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL): outputFile.write(line.replace("\n", "") + '\n') outputFile.close() It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4 After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) : file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files >>> def split_on_stream(it,sep="!"): prev = "" for line in it: line = (prev + line.strip()).split(sep) for parts in line[:-1]: yield parts prev = line[-1] yield prev >>> with open("test.txt") as fin: for parts in split_on_stream(fin): print parts ,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013 ,A,56281,12/12/19,19:34:19,000.0,0,37N22.