Python CSV remove new lines denoted by &#x0D - python

I have a BCP file that contains lots of 
 carriage return symbols. They are not meant to be there and I have no control over the original output so am left with trying to parse the file to remove them.
A sample of the data looks like....
"test1","apples","this is
some sample","3877"
"test66","bananas","this represents more
wrong data","378"
I am trying to send up with...
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Is there a simple way to do this prefereably using python CSV?

You can try:
import re
with open("old.csv") as f, open("new.csv", "w") as w:
for line in f:
line = re.sub(r"
\s*", "", line)
w.write(line)
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Demo

Related

read the header and replace a column value with another one in Python

I am a new bie to Python and I am trying to read in a file with the below format
ORDER_NUMBER!Speed_Status!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
And the output to be written to the same file is
ORDER_NUMBER!STATUS!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
so far I tried
# a file named "repo", will be opened with the reading mode.
file = open('repo.dat', 'r+')
# This will print every line one by one in the file
for line in file:
if line.startswith('ORDER_NUMBER'):
words = [w.replace('Speed_Status', 'STATUS') for w in line.partition('!')]
file.write(words)
input()
But somehow its not working. what am I missing.
Read file ⇒ replace content ⇒ write to file:
with open('repo.dat', 'r') as f:
data = f.read()
data = data.replace('Speed_Status', 'STATUS')
with open('repo.dat', 'w') as f:
f.write(data)
The ideal way would be to use the fileinput module to replace the file contents in-place instead of opening the file in update mode r+
from __future__ import print_function
import fileinput
for line in fileinput.input("repo.dat", inplace=True):
if line.startswith('ORDER_NUMBER'):
print (line.replace("Speed_Status", "STATUS"), end="")
else:
print (line, end="")
As for why your attempt didn't work, the logic to form the words is quite incorrect, when you partition the line based on !, the list you formed back is in out of order as ['ORDER_NUMBER', '!', 'STATUS!Days!\n'] with the embedded new-line. Also your write() call would never take a non-character buffer object. You need to have cast it into a string format to print it.

Replacing character in JSON file with Python. Problem with editing due to big file (over 1 GB)

I have a big JSON file (exported via Azure Data Factory). If DataFactory finds an issue it adds $ signs between objects. For example, it looks like this:
<br>{...}<br>
{...}<br>
{...}${...}<br>
So I have an error for example json.decoder.JSONDecodeError: Extra data: line 1 column 21994 (char 21993)
I was dealing with it in an easy way - notepad++ replacing $ to \n and it was good ;) but now my file is about 1.3 GB and I didn't have a tool to edit such a big file.
I use python to export data from all JSON objects in the file and export them to XMLs.
Now I'm looking for some solution to replace all of the $ signs to newlines \n and clean the file.
The beginning of my code is:
a = open('test.json', 'r', encoding = 'UTF8')
data1 = a.readlines()
a.close()
for i in range(len(data1)):
print('Done%d/%d'%(i,len(data1)))
jsI = json.loads(data1[i])
and there if file occurs to $ sign it is over.
May I ask for some advice on how to replace $ signs with newlines in a file using Python?
The problem is probably on a.readlines() because it brings the entire file to your memory. When dealing with huge files, it's way more interesting to read it line by line, like this:
with open(fname) as f:
for line in f:
# Do your magic here, on this loop
# No need to close it, since the `with` will take care of that.
If your objective is to replace every $ with a \n, it will be like this:
with open(fname, "r+") as f:
for line in f:
line.replace("$", "\n")
To be able to handle possible $ characters in strings within JSON objects, you can split the input string data1 with $ into fragments, join the fragments into a string one by one until it is parsable as JSON, at which point you output the string and clear it to move on to the next fragment:
import json
candidate = ''
for fragment in data1.split('$'):
candidate += fragment
try:
json.loads(candidate)
print(candidate)
candidate = ''
except json.decoder.JSONDecodeError:
candidate += '$'
continue
Given data1 = '''{}${"a":"$"}${"b":{"c":2}}''', for example, this outputs:
{}
{"a":"$"}
{"b":{"c":2}}

python search for string in file return entire line + next line into new text file

I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient
Thanks
Phil
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E
$INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50
$INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F
$INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C
$INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D
$INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D
$INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61
$INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A
$INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B
$INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68
$INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58
$INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79
$INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D
$INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56
$INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77
$INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40
$INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F
$INHDT,42.4,T*17
You can just read a line of file and write to another new file.
Like this:
import re
#open new file with append
nf = open('newfile', 'at')
#open file with read
with open('file', 'rt') as f:
for line in f:
r = re.match(r'\$INGGA', line)
if r is not None:
nf.write(line)
nf.write("$INHDT,31.9,T*1E" + '\n')
You can use at to append write and wt to read line!
I have 150,000 lines file, It's run well!
I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:
(\$INGGA.*\n\$INHDT.*\n)
https://regex101.com/r/tK1hF0/3
As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.
I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.
Here is some starter python example code:
import re
test_str = # load your file here
p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
In the example PuTTY log you give, its all one line separated with space.
So in this case you can use this to replace the space with new line and gets new file -
cat large_file | sed 's/ /\n/g' > new_large_file
To iterate over the file separated with new line, run this -
cat new_large_file | python your_script.py
Your script get line by line so your computer should not crash.
your_script.py -
import sys
INGGA_line = ""
for line in sys.stdin:
line_striped = line.strip()
if line_striped.startswith("$INGGA"):
INGGA_line = line_striped
elif line_striped.startswith("$INZDA"):
print line_striped, INGGA_line
else:
print line_striped
This answer is aimed at python 3.
According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:
with open(filename, 'r') as f:
for line in f:
...process...
An example of how you could fulfill your above criteria could be
# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
# Flag for whether we are looking for 1st or 2nd part
look_for_ingga = True
for line in sf:
if look_for_ingga:
if line.startswith('$INGGA,'):
tf.write(line)
look_for_ingga = False
elif line.startswith('$INHDT,'):
tf.write(line)
look_for_ingga = True
In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.
Refer to the docs for introductions to with-statements and file reading/writing.

Using python to parse a text file without delimiters

I have searched thoroughly, possibly with incorrect search terms, for a way to use Python to parse a text file WITHOUT the use of delimiters. All prior discussion found assumes the use of the CSV library (with comma delimited text) but since the input file does not use a comma-delimited format, csv does not seem to be the correct library to use.
For example, I would like to parse the 18th to 29th text character of each line regardless of context. The input file is general text, say, each line is 132 characters in length.
I could post an example input but don't see the point in it if the input is general text and is to be parsed without the use of any patterns to delimit.
Ideas?
The struct module can be used to parse fixed-length format files. Simply construct a format string using the appropriate length modifier for the s format character.
with open(filename, 'r') as f:
for line in f:
print line[18:30]
You can simply use something like this:
Res = [ ]
fo = open( filename) #open your file for reading ('r' by default)
for line in fo: # parse the file line by line
Res.append( line[ 18 : 30 ] ) # extract the desired text from the current line
fo.close()
print(Res)# exploit the extracted data
If you want the 18th to 29th characters of every line...
f = open(<path>, 'r')
results = [line[18:30] for line in f.readlines() if len(line) > 29]
f.close()
for r in results:
print r

How remove large space between the sentences in text file?

I am working with Unicode file after processing it. I am getting very large spacing between sentences for example
തൃശൂരില്‍ ഹര്‍ത്താല്‍ പൂര്‍ണം
തൃശൂവില്‍ ഇടതുമുന്നണി ഹര്‍ത്താലില്‍ ജനജീവിതം പൂര്‍ണമായും സ്‌...
ഡി.വൈ.എഫ്‌.ഐ. ഉപരോധം; കലക്‌ടറേറ്റ്‌ സ്‌തംഭിച്ചു
തൃശൂര്‍: നിയമനനിരോധനം, അഴിമതി, വിലക്കയറ്റം എന്നീ വിഷയങ്ങള്‍ മുന്‍...
ബൈക്ക്‌ പോസ്‌റ്റിലിടിച്ച്‌ പതിന്നേഴുകാരന്‍ മരിച്ചു
How to remove these large spaces ?
I have tried this
" ".join(raw.split())
It is not working at all. Any suggestions ?
The easiest way is to write the results another file, or rewrite it to your file. Most operating systems doesn't allow us to edit directly into a file, especially appending. For simple cases like this, rewriting is much simpler:
with open('f.txt') as raw:
data = ''.join(raw.read().split()) #If you want to remove newlines only, use split('\n')
with open('f.txt', 'w') as raw:
raw.write(data)
Hope this helps!
Assuming raw is your raw data, you need to split the raw data using str.splitlines, filter all the empty lines, and rejoin them using newline
print '\n'.join(line for line in raw.splitlines() if line.strip())
If you are open to use regex, you may also try
import re
print re.sub("\n+","\n", raw)
If instead raw is a file object, group all consecutive spaces as one
from itertools import groupby
with open("<some-file>") as raw:
data = ''.join(k for k, _ in groupby(raw))
assuming the lines are empty (only a newline) using python:
import re
import sys
f = sys.argv[1]
for line in open(f, 'r'):
if not re.search('^$', line):
print line
or if you prefer:
egrep -v "^$" <filename>

Categories