Need to remove line breaks from a text file with certain conditions - python

I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be very helpful, am a newbie at this.
Thanks.

s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g

This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...

Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)

First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution

f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format

Related

Extracting substrings from CSV table

I'm trying to clean up the data from a csv table that looks like this:
KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000
I want to extract just the "#" handles, such as:
#katyperry
#justinbieber
#BarackObama
This is the code I've put togheter, but all it does is repeat the second line of the table over and over:
import csv
import re
with open('C:\\Users\\TK\\Steemit\\Scripts\\twitter.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
if i.isalpha():
stringafterword = re.split('\\#\\',row)[-1]
print(stringafterword)
If you are willing to use re, you can get a list of strings in one line:
import re
#content string added to make it a working example
content = """KATY PERRY#katyperry
1,084,149,282,038,820
Justin Bieber#justinbieber
10,527,300,631,674,900,000
Barack Obama#BarackObama
9,959,243,562,511,110,000"""
#solution using 're':
m = re.findall('#.*', content)
print(m)
#option without 're' but using string.find() based on your loop:
for row in content.split():
pos_of_at = row.find('#')
if pos_of_at > -1: #-1 indicates "substring not found"
print(row[pos_of_at:])
You should of course replace the contentstring with the file content.
Firstly the "#" symbol is a symbol. Therefore the if i.isalpha(): will return False as it is NOT a alpha character. Your re.split() won't even be called.
Try this:
import csv
import re
with open('C:\\Users\\input.csv', 'rt', encoding='UTF-8') as inp:
read = csv.reader(inp)
for row in read:
for i in row:
stringafterword = re.findall('#.*',i)
print(stringafterword)
Here I have removed the if-condition and changed the re.split() index to 1 as that is the section you want.
Hope it works.

Removing punctuation and change to lowercase in python CSV file

The code below allow me to open the CSV file and change all the texts to lowercase. However, i have difficulties trying to also remove the punctuation in the CSV file. How can i do that? Do i use string.punctuation?
file = open('names.csv','r')
lines = [line.lower() for line in file]
with open('names.csv','w') as out
out.writelines(sorted(lines))
print (lines)
sample of my few lines from the file:
Justine_123
ANDY*#3
ADRIAN
hEnNy!
You can achieve this by importing strings and make use of the following example code below.
The other way you can achieve this is by using regex.
import string
str(lines).translate(None, string.punctuation)
Also you may want to learn more about how import string works and its features
The working example you requested for.
import string
with open("sample.csv") as csvfile:
lines = [line.lower() for line in csvfile]
print(lines)
will give you ['justine_123\n', 'andy*#3\n', 'adrian\n', 'henny!']
punc_table = str.maketrans({key: None for key in string.punctuation})
new_res = str(lines).translate(punc_table)
print(new_res)
new_s the result will give you justine123n andy3n adriann henny
Example with regular expressions.
import csv
import re
filename = ('names.csv')
def reg_test(name):
reg_result = ''
with open(name, 'r') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = re.sub('[^A-Za-z0-9]+', '', str(row))
reg_result += row + ','
return reg_result
print(reg_test(filename).lower())
justine123,andy3,adrian,henny,

Reading a structured text file in Python

I have a text file in the following format:
1. AUTHOR1
(blank line, with a carriage return)
Citation1
2. AUTHOR2
(blank line, with a carriage return)
Citation2
(...)
That is, in this file, some lines begin with an integer number, followed by a dot, a space, and text indicating an author's name; these lines are followed by a blank line (which includes a carriage return), and then for a line of text beginning with an alphabetic character (an article or book citation).
What I want is to read this file into a Python list, joining the author's names and citation, so that each list element is of the form:
['AUTHOR1 Citation1', 'AUTHOR2 Citation2', '...']
It looks like a simple programming problem, but I could not figure out a solution to it. What I attempted was as follows:
articles = []
with open("sample.txt", "rb") as infile:
while True:
text = infile.readline()
if not text: break
authors = ""
citation = ""
if text == '\n': continue
if text[0].isdigit():
authors = text.strip('\n')
else:
citation = text.strip('\n'
articles.append(authors+' '+citation)
but the articles list gets authors and citations stored as separate elements!
Thanks in advance for any help in solving this vexing problem... :-(
Assuming your input file structure:
"""
1. AUTHOR1
Citation1
2. AUTHOR2
Citation2
"""
is not going to change I would use readlines() and slicing:
with open('sample.txt', 'r') as infile:
lines = infile.readlines()
if lines:
lines = filter( lambda x : x != '\n', lines ) # remove empty lines
auth = map( lambda x : x.strip().split('.')[-1].strip(), lines[0::2] )
cita = map( lambda x : x.strip(), lines[1::2] )
result = [ '%s %s'%(auth[i], cita[i]) for i in xrange( len( auth )) ]
print result
# ['AUTHOR1 Citation1', 'AUTHOR2 Citation2']
The problem is that, in each looping iteration you are only getting one, author or citation and not both. So, when you do the append you only have one element.
One way to fix this is to read both in each looping iteration.
This should work:
articles = []
with open("sample.txt") as infile:
for raw_line in infile:
line = raw_line.strip()
if not line:
continue
if line[0].isdigit():
author = line.split(None, 1)[-1]
else:
articles.append('{} {}'.format(author, line))
Solution processing a full entry in each loop iteration:
citations = []
with open('sample.txt') as file:
for author in file: # Reads an author line
next(file) # Reads and ignores the empty line
citation = next(file).strip() # Reads the citation line
author = author.strip().split(' ', 1)[1]
citations.append(author + ' ' + citation)
print(citations)
Solution first reading all lines and then going through them:
citations = []
with open('sample.txt') as file:
lines = list(map(str.strip, file))
for author, citation in zip(lines[::3], lines[2::3]):
author = author.split(' ', 1)[1]
citations.append(author + ' ' + citation)
print(citations)
The solutions based on slicing are pretty neat, but if there's just one blank line out of place, it throws the whole thing off. Here's a solution using regex which should work even if there's a variation in the structure:
import re
pattern = re.compile(r'(^\d\..*$)\n*(^\w.*$)', re.MULTILINE)
with open("sample.txt", "rb") as infile:
lines = infile.readlines()
matches = pattern.findall(lines)
formatted_output = [author + ' ' + citation for author, citation in matches]
You can use readline to skip empty lines.
Here's your loop body:
author = infile.readline().strip().split(' ')[1]
infile.readline()
citation = infile.readline()
articles.append("{} {}".format(author, citation))

extract float numbers from data file

I'm trying to extract the values (floats) from my datafile. I only want to extract the first value on the line, the second one is the error. (eg. xo # 9.95322254_0.00108217853
means 9.953... is value, 0.0010.. is error)
Here is my code:
import sys
import re
inf = sys.argv[1]
out = sys.argv[2]
f = inf
outf = open(out, 'w')
intensity = []
with open(inf) as f:
pattern = re.compile(r"[^-\d]*([\-]{0,1}\d+\.\d+)[^-\d]*")
for line in f:
f.split("\n")
match = pattern.match(line)
if match:
intensity.append(match.group(0))
for k in range(len(intensity)):
outf.write(intensity[k])
but it doesn't work. The output file is empty.
the lines in data file look like:
xo_Is
xo # 9.95322254`_0.00108217853
SPVII_to_PVII_Peak_type
PVII_m(#, 1.61879`_0.08117)
PVII_h(#, 0.11649`_0.00216)
I # 0.101760618`_0.00190314017
each time the first number is the value I want to extract and the second one is the error.
You were almost there, but your code contains errors preventing it from running. The following works:
pattern = re.compile(r"[^-\d]*(-?\d+\.\d+)[^-\d]*")
with open(inf) as f, open(out, 'w') as outf:
for line in f:
match = pattern.match(line)
if match:
outf.write(match.group(1) + '\n')
I think you should test your pattern on a simple string instead of file. This will show where is the error: in pattern or in code which parsing file. Pattern looks good. Additionally in most languages i know group(0) is all captured data and for your number you need to use group(1)
Are you sure that f.slit('\n') must be inside for?

Read a multi-line string as one line in Python

I am writing a program that analyzes a large directory text file line-by-line. In doing so, I am trying to extract different parts of the file and categorize them as 'Name', 'Address', etc. However, due to the format of the file, I am running into a problem. Some of the text i have is split into two lines, such as:
'123 ABCDEF ST
APT 456'
How can I make it so that even through line-by-line analysis, Python returns this as a single-line string in the form of
'123 ABCDEF ST APT 456'?
if you want to remove newlines:
"".join( my_string.splitlines())
Assuming you are using windows if you do a print of the file to your screen you will see
'123 ABCDEF ST\nAPT 456\n'
the \n represent the line breaks.
so there are a number of ways to get rid of the new lines in the file. One easy way is to split the string on the newline characters and then rejoin the items from the list that will be created when you do the split
myList = [item for item in myFile.split('\n')]
newString = ' '.join(myList)
To replace the newlines with a space:
address = '123 ABCDEF ST\nAPT 456\n'
address.replace("\n", " ")
import re
def mergeline(c, l):
if c: return c.rstrip() + " " + l
else: return l
def getline(fname):
qstart = re.compile(r'^\'[^\']*$')
qend = re.compile(r'.*\'$')
with open(fname) as f:
linecache, halfline = ("", False)
for line in f:
if not halfline: linecache = ""
linecache = mergeline(linecache, line)
if halfline: halfline = not re.match(qend, line)
else: halfline = re.match(qstart, line)
if not halfline:
yield linecache
if halfline:
yield linecache
for line in getline('input'):
print line.rstrip()
Assuming you're iterating through your file with something like this:
with open('myfile.txt') as fh:
for line in fh:
# Code here
And also assuming strings in your text file are delimited with single quotes, I would do this:
while not line.endswith("'"):
line += next(fh)
That's a lot of assuming though.
i think i might have found a easy solution just put .replace('\n', " ") to whatever string u want to convert
Example u have
my_string = "hi i am an programmer\nand i like to code in python"
like anything and if u want to convert it u can just do
my_string.replace('\n', " ")
hope it helps

Categories