Python 2.7 Search Line if match pattern and replace string

Python 2.7 Search Line if match pattern and replace string - python

How can I read the file and find all lines match pattern start with \d+\s. And the replace the write space to , . Some of lines are contain English character. But some of line are Chinese. I guest the write space in chinese encoding is different with english?
Example (text.txt)
asdfasdf
1 abcd
2 asdfajklsd
3 asdfasdf
4 ...
asdfasdf
66 ...
aasdfasdf
99 ...
100 中文
101 中文
102 asdfga
103 中文
My Test Code:
with open('text.txt', 'r') as t:
with open('newtext.txt', 'w') as nt:
content = t.readlines()
for line in content:
okline = re.compile('^[\d+]\s')
if okline:
ntext = re.sub('\s', ',', okline)
nt.write(ntext)

With single re.subn() function:
with open('text.txt', 'r') as text, open('newtext.txt', 'w') as new_text:
lines = text.read().splitlines()
for l in lines:
rpl = re.subn(r'^(\d+)\s+', '\\1,', l)
if rpl[1]:
new_text.write(rpl[0] + '\n')
The main advantage of this is that re.subn will return a tuple (new_string, number_of_subs_made) where number_of_subs_made is the crucial value pointing to the substitution made upon the needed matched line

You could do this:
# Reading lines from input file
with open('text.txt', 'r') as t:
content = t.readlines()
# Opening file for writing
with open('newtext.txt', 'w') as nt:
# For each line
for line in content:
# We search for regular expression
if re.search('^\d+\s', line):
# If we found pattern inside line only then can continue
# and substitute white spaces with commas and write to output file
ntext = re.sub('\s', ',', line)
nt.write(ntext)
There were multiple problems with your code, for starters \d is character class, basically \d is same as [0-9] so you don't need to put it inside square brackets. You can see regex demo here. Also you were checking if compile object is True, since compile operation succeeds compile object will always be True.
Also, you should avoid nested with statements, more Pythonic way is to open files using with, read it, and then close it.

Compact code
import re
with open('esempio.txt', 'r') as original, open('newtext2.txt', 'w') as newtext:
for l in original.read().split('\n'):
if re.search("^\d+\s",l):
newtext.write(re.sub('\s', ',', l)+'\n')

Related

why count of record from file is return abnormal

I have file in which their are lot's of records
In that few empty lines are their in middle and even with spaces , tabs as well
File content :
ABC
GSHJSKK
jjj
ajjk
So the count should be : 4 but it return 6 from file using below code
My code:
num_lines = sum(1 for line in open('myfile.txt'))

I suggest you to try to read the lines using regular expressions. Regular expressions can help you filtering the lines with the content you say is relevant.
From what you wrote, I understand that you want to count only the lines containing alphanumeric strings, and ignore everything else.
You can filter alphanumeric lines of the files by using this pattern ^\w+$ as explained here.
Your code could became something like:
import re
file = open("myfile.txt", "r")
pattern = r"^\w+$"
line_count = 0
for line in file: # for each line in file
if re.search(pattern, line) : # if the line read matches the pattern
line_count += 1
file.close()
If you're not so familiar with regular expressions (or you need to verify how your pattern works), you can use this website, I find it so useful!

sum([1 for i in open('myfile.txt',"r").readlines() if i.strip()])

Regex to exclude a specific pattern python

I'm trying to find any occurunce of "fiction" preceeded or followed by anything, except for "non-"
I tried :
.*[^(n-)]fiction.*
but it's not working as I want it to.
Can anyone help me out?

Check if this works for you:
.*(?<!non\-)fiction.*

You should avoid patterns starting with .*: they cause too many backtracking steps and slow down the code execution.
In Python, you may always get lines either by reading a file line by line, or by splitting a line with splitlines() and then get the necessary lines by testing them against a pattern without .*s.
Reading a file line by line:
final_output = []
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if "fiction" in line and "non-fiction" not in line:
final_output.append(line.strip())
Or, getting the lines even with non-fiction if there is fiction with no non- in front using a bit modified #jlesuffleur's regex:
import re
final_output = []
rx = re.compile(r'\b(?<!non-)fiction\b')
with open(filepath, 'r', newline="\n", encoding="utf8") as f:
for line in f:
if rx.search(line):
final_output.append(line.strip())
Getting lines from a multiline string (with both approaches mentioned above):
import re
text = "Your input string line 1\nLine 2 with fiction\nLine 3 with non-fiction\nLine 4 with fiction and non-fiction"
rx = re.compile(r'\b(?<!non-)fiction\b')
# Approach with regex returning any line containing fiction with no non- prefix:
final_output = [line.strip() for line in text.splitlines() if rx.search(line)]
# => ['Line 2 with fiction']
# Non-regex approach that does not return lines that may contain non-fiction (if they contain fiction with no non- prefix):
final_output = [line.strip() for line in text.splitlines() if "fiction" in line and "non-fiction" not in line]
# => ['Line 2 with fiction', 'Line 4 with fiction and non-fiction']
See a Python demo.

What about a negative lookbehind?
s = 'fiction non-fiction'
res = re.findall("(?<!non-)fiction", s)
res

Extract chunks of text from document and write them to new text file

I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well.
What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible.
Code:
array = []
f = open('text.txt','r') as infile
w = open(r'temp2.txt', 'w') as outfile
for line in f:
data = f.read()
x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL)
array.append(x)
outfile.write(x)
return array
What the text may look like
( CAR: *random info*
*random info* - could be many lines of this
)

Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)).
You can use the following regular expression: (CAR[\s\S]*?(?=\)))
See explanation...
Here you can visualize your regular expression...

We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms.
Then we just have to remove the newline characters from the resulting matches and write them to a file.
with open("text.txt", 'r') as f:
matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL)
with open("output.txt", 'w') as f:
for match in matches:
f.write(" ".join(match.split('\n')))
f.write('\n')
The output file looks like this:
CAR: *random info* *random info* - could be many lines of this
EDIT:
updated code to put newline between matches in output file

Python Make newline after character

I would like to make a newline after a dot in a file.
For example:
Hello. I am damn cool. Lol
Output:
Hello.
I am damn cool.
Lol
I tried it like that, but somehow it's not working:
f2 = open(path, "w+")
for line in f2.readlines():
f2.write("\n".join(line))
f2.close()
Could your help me there?
I want not just a newline, I want a newline after every dot in a single file. It should iterate through the whole file and make newlines after every single dot.
Thank you in advance!

This should be enough to do the trick:
with open('file.txt', 'r') as f:
contents = f.read()
with open('file.txt', 'w') as f:
f.write(contents.replace('. ', '.\n'))

You could split your string based on . and store in a list, then just print out the list.
s = 'Hello. I am damn cool. Lol'
lines = s.split('.')
for line in lines:
print(line)
If you do this, the output will be:
Hello
I am damn cool
Lol
To remove leading spaces, you could split based on . (with a space), or else use lstrip() when printing.
So, to do this for a file:
# open file for reading
with open('file.txt') as fr:
# get the text in the file
text = fr.read()
# split up the file into lines based on '.'
lines = text.split('.')
# open the file for writing
with open('file.txt', 'w') as fw:
# loop over each line
for line in lines:
# remove leading whitespace, and write to the file with a newline
fw.write(line.lstrip() + '\n')

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)

You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)

Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.

Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()

I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python 2.7 Search Line if match pattern and replace string - python

Compact code import re with open('esempio.txt', 'r') as original, open('newtext2.txt', 'w') as newtext: for l in original.read().split('\n'): if re.search("^\d+\s",l): newtext.write(re.sub('\s', ',', l)+'\n')

Related

why count of record from file is return abnormal

Regex to exclude a specific pattern python

Extract chunks of text from document and write them to new text file

Python Make newline after character

Splitting lines in python based on some character

Categories

Resources