Given that the infile contains:
aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB
aawerwefrewqa"pic02.jpg"bbbertebbbsize 100KB
atyrtyruraa"pic03.jpg"bbbwtrwtbbbsize 190KB
How to obtain the outfile as:
pic01.jpg 110KB
pic02.jpg 100KB
pic03.jpg 190KB
My code is:
with open ('test.txt', 'r') as infile, open ('outfile.txt', 'w') as outfile:
for line in infile:
lines_set1 = line.split ('"')
lines_set2 = line.split (' ')
for item_set1 in lines_set1:
for item_set2 in lines_set2:
if item_set1.endswith ('.jpg'):
if item_set2.endswith ('KB'):
outfile.write (item_set1 + ' ' + item_set2 + '\n')
What is wrong with my code, please help!
The problem has been solved here:
what is wrong in the code written inpython
Often you can solve string manipulation problems without regex as Python has an amazing string library. In your case, just calling str.split twice with different delimiters (quote and space) solves your issue
Demo
>>> st = """aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB
aawerwefrewqa"pic02.jpg"bbbertebbbsize 100KB
atyrtyruraa"pic03.jpg"bbbwtrwtbbbsize 190KB"""
>>> def foo(st):
#Split the string based on quotation mark
_, fname, rest = st.split('"')
#from the residual part split based on space
#and select the last part
rest = rest.split()[-1]
#join and return fname and the residue
return ' '.join([fname, rest])
>>> for e in st.splitlines():
print foo(e)
pic01.jpg 110KB
pic02.jpg 100KB
pic03.jpg 190KB
Regex would be easier:
with open ('test.txt', 'r') as infile, open ('outfile.txt', 'w') as outfile:
for line in infile:
m = re.search('"([^"]+)".*? (\d+.B)', line)
if m:
outfile.write(m.group(1) + ' ' + m.group(2) + '\n')
You can use regex and str.rsplit here, your code seems to be an overkill for this simple task:
>>> import re
>>> strs = 'aaaaaaa"pic01.jpg"bbbwrtwbbbsize 110KB\n'
>>> name = re.search(r'"(.*?)"', strs).group(1)
>>> size = strs.rsplit(None, 1)[-1]
>>> name, size
('pic01.jpg', '110KB')
or
>>> name, size = re.search(r'"(.*?)".*?(\w+)$', strs).groups()
>>> name, size
('pic01.jpg', '110KB')
Now use string formatting:
>>> "{} {}\n".format(name, size) #write this to file
'pic01.jpg 110KB\n'
Related
I am trying to find the best way to overwrite a file with zeros; every character in the file will be replaced by 0.
currently I have this working:
import fileinput
for line in fileinput.FileInput('/path/to/file', inplace =1):
for x in line:
x = 0
But this looks very inefficient; is there a better way to do it?
Instead of replacing the characters one by one, I prefer to create a new file with the same name and same size:
Obtaining size of current file:
>>> file_info = os.stat("/path/to/file")
>>> size = file_info.st_size
Creating another file containing 0x00 with the same size:
>>> f = open("/path/to/file", "w")
>>> f.seek(size - 1)
>>> f.write("\x00")
>>> f.close()
>>>
I assumed by 0, you meant 0x00 byte value
Use regex replacement, maybe?
import re
path = "test.txt"
f = open(path, "r")
data = re.sub(".", "0", f.read())
f.close()
f = open(path, "w")
f.write(data)
f.close()
Using a regex is probably cleaner, but here is a solution using fileinput:
import fileinput
import sys
for line in fileinput.FileInput('/path/to/file', inplace=True):
line = '0' * len(line)
sys.stdout.write(line + "\n")
Note, if you use the print function, extra newlines will be added - so I used sys.stdout.write
You can check this:
import fileinput
for line in fileinput.FileInput('/path/to/file', inplace =1):
print len(line)*'0'
I am trying an efficient way to strip numbers dates or any other characters present in a string until the first alphabet is found from the end.
string - '12.abd23yahoo 04/44 231'
Output - '12.abd23yahoo'
line_inp = "12.abd23yahoo 04/44 231"
line_out = line_inp.rstrip('0123456789./')
This rstrip() call doesn't seem to work as expected, I get '12.abd23yahoo 04/44 ' instead.
I am trying below and it doesn't seem to be working.
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line.rstrip('0123456789./ '))
You need to strip spaces too:
line_out = line_inp.rstrip('0123456789./ ')
Demo:
>>> line_inp = "12.abd23yahoo 04/44 231"
>>> line_inp.rstrip('0123456789./ ')
'12.abd23yahoo'
You need to strip the newlines and add it again before you write :
for fname in filenames:
with open(fname) as infile:
outfile.writelines(line.rstrip('0123456789./ \n') + "\n"
for line in infile)
If the format is always the same you can just split:
with open(fname) as infile:
outfile.writelines(line.split(None, 1)[0] + "\n"
for line in infile)
Here's a solution using a regular expression:
import re
line_inp = "12.abd23yahoo 04/44 231"
r = re.compile('^(.*[a-zA-Z])')
m = re.match(r, line_inp)
line_out = m.group(0) # 12.abd23yahoo
The regular expression matches a group of arbitrary characters which end in a letter.
Started with python after a long time:
Basically I am trying to read a line from a file:
MY_FILE ='test1.hgx'
Eventually I want to change this test1.hgx with:
test1_para1_para2_para3.hgx
Where para1,23 are the parameters I want to write.
I wrote a code below
add_name= '%s'%(filename)+'_'+'%s'%(para1)+'_'+'%s'%(para2)+'_'+'%s'%(para3)+'.hgx'
print "added_name:",add_name
with open(filename) as f: lines = f.read().splitlines()
with open(filename, 'w') as f:
for line in lines:
if line.startswith(' MY_FILE'):
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
else:
f.write(line + '\n')
f.close
The above code works as expected and writes out when I execute the python code once:
MY_FILE ='test1_01_02_03.hgx'
However when I run the python code once again for the second time it eats up the '=' and writes the following:
MY_FILE 'test1_01_02_03.hgx'
Can I add something to my existing code that would always preserve the writing of the 'test1_01_02_03.hgx'. I think there is problem with :
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
However I am not able to figure out the problem. Any ideas would be helpful. Thanks.
Change:
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
to
f.write(line.rsplit('=', 1)[0] + "=\'%s'\n"%add_name)
Incidentally, are you sure that in the original file, there wasn't a space after the =? If there is no space after the =, this code will always eat up the =. If there is a space, it won't eat it up until the second time the code is run.
You are splitting on ' ', which is before the =, but not adding another = back. There are many ways you can do this, but the easiest may be to simply add the = back in:
f.write(line.rsplit(' ', 1)[0] + "='%s'\n" % add_name)
Another, cleaner, way to do it would be to use replace:
f.write(line.replace(filename, new_name))
As an aside, you can write the first line much better as:
add_name = '%s_%s_%s_%s.hgx' % (filename, para1, para2, para3)
Try using the fileinput module. Also, use format() to write into strings.
# Using the fileinput module we can open a file and write to it using stdout.
import fileinput
# using stdout we avoid the formatting of print, and avoid extra newlines.
import sys
filename = 'testfile'
params = ['foo', 'bar', 'baz']
# Build the new replacement line.
newline = '{0}_{1}_(2)_{3}.hgx'.format(filename, params[0], params[1], params[2])
for line in fileinput.input(filename, inplace=1):
if line.startswith('MY_FILE'):
sys.stdout.write('MYFILE = {0}\n'.format(newline))
else:
sys.stdout.write(line)
This should replace any line starting with MYFILE with the lineMYFILE = 'testfile_foo_bar_baz.hgz
I am reading the text file which gives me output something like:
o hi! My name is Saurabh.
o I like python.
I have to convert the above output into:
*1 hi! My name is Saurabh
*2 I like python.
Simple string replace (replacing "\no" with "") followed by adding numbers in python gave me :
*1
o hi! My name is Saurabh
*2
o I like python.
Could anybody help me in getting the right output as
*1 hi! My name is Saurabh
*2 I like python.
with open('sample.txt', 'r') as fin:
lines = fin.readlines()
with open('sample_output.txt', 'w') as fout:
index = 1
for line in lines:
if line[0] == 'o':
line = '*' + str(index) + line[1:]
index += 1
fout.write(line.rstrip() + '\n')
IF you read line by line, replacing '\no' is not a solution, because '\n' would not be in the start of your line.
You will need to use regex in this case:
import re
f = open('test.txt')
h = open('op.txt','w')
gen = (line.strip() for line in f)
for line in enumerate(gen,1):
h.write(re.sub('^o','*'+str(line[0]),line[1]) + '\n')
f.close()
h.close()
PS: You might want to check if the line contains nothing, then, dont do anything; else write in the new file
this is my solution:
import re
f_in=open('data_in.txt', 'r')
f_out=open('data_out.txt', 'w')
ln=1
for line in f_in:
s = re.sub('^o+','*%-3i' % ln,line)
f_out.write(s)
if not line=='\n': ln += 1
f_in.close()
f_out.close()
I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be very helpful, am a newbie at this.
Thanks.
s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g
This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...
Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)
First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution
f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format