I have a BCP file that contains lots of
carriage return symbols. They are not meant to be there and I have no control over the original output so am left with trying to parse the file to remove them.
A sample of the data looks like....
"test1","apples","this is
some sample","3877"
"test66","bananas","this represents more
wrong data","378"
I am trying to send up with...
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Is there a simple way to do this prefereably using python CSV?
You can try:
import re
with open("old.csv") as f, open("new.csv", "w") as w:
for line in f:
line = re.sub(r"
\s*", "", line)
w.write(line)
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Demo
I have the following text in a csv file:
b'DataMart\n\nDate/Time Generated,11/7/16 8:54 PM\nReport Time Zone,America/New_York\nAccount ID,8967\nDate Range,10/8/16 - 11/6/16\n\nReport Fields\nSite (DCM),Creative\nGlobest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter\nGlobest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter'
Essentially there are multiple new line characters in this file instead of a single big string so you can picture the same text as follows
DataMart
Date/Time Generated,11/7/16 8:54 PM
Report Time Zone,America/New_York
Account ID,8967
Date Range,10/8/16 - 11/6/16
Report Fields
Site (DCM),Creative
Globest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter
Globest.com,2016-08_CB_018_1040x320_Globe St_16_PropertyFilter
I need to grab the last two lines, which is basically the data. I tried doing a for loop:
with open('file.csv','r') as f:
for line in f:
print(line)
It instead prints the entire line again with \n.
Just read the file and get the last two lines:
my_file = file("/path/to/file").read()
print(my_file.splitlines()[-2:])
The [-2:] is known as slicing: it creates a slice, starting from the second to last element, going to the end.
ok, after struggling around for a bit, i found out that i need to change the decoding of the file from binary to 'utf-8' and then i can apply the split functions. The problem was split functions are not applicable to the binary file.
This is the actual code that seems to be working for me now:
with open('BinaryFile.csv','rb') as f1:
data=f1.read()
text=data.decode('utf-8')
with open('TextFile.csv', 'w') as f2:
f2.write(text)
with open('TextFile.csv','r') as f3:
for line in f3:
print(line.split('\\n')[9:])
thanks for your help guys
I have a very large text file (50,000+ lines) that should always be in the same sequence. In python I want to search the text file for each of the $INGGA lines and join this line with the subsequent $INHDT to create a new text file. I need to do this without reading into memory as this causes it to crash every time. I can find return the $INGGA line but I'm not sure of the best way of then getting the next line and joining into a new string that is memory efficient
Thanks
Phil
=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2016.05.06 09:11:34 =~=~=~=~=~=~=~=~=~=~=~= > $PRDID,2.15,-0.10,31.87*6E
$INGGA,091124.00,5249.8336,N,00120.9619,W,1,20,0.6,95.0,M,49.4,M,,*50
$INHDT,31.9,T*1E $INZDA,091124.0055,06,05,2016,,*7F
$INVTG,22.0,T,,M,4.4,N,8.1,K,A*24 $PRDID,2.13,-0.06,34.09*6C
$INGGA,091124.20,5249.8338,N,00120.9618,W,1,20,0.6,95.0,M,49.4,M,,*5D
$INHDT,34.1,T*13 $INZDA,091124.2055,06,05,2016,,*7D
$INVTG,24.9,T,,M,4.4,N,8.1,K,A*2B $PRDID,2.16,-0.03,36.24*61
$INGGA,091124.40,5249.8340,N,00120.9616,W,1,20,0.6,95.0,M,49.4,M,,*5A
$INHDT,36.3,T*13 $INZDA,091124.4055,06,05,2016,,*7B
$INVTG,27.3,T,,M,4.4,N,8.1,K,A*22 $PRDID,2.11,-0.05,38.33*68
$INGGA,091124.60,5249.8343,N,00120.9614,W,1,20,0.6,95.1,M,49.4,M,,*58
$INHDT,38.4,T*1A $INZDA,091124.6055,06,05,2016,,*79
$INVTG,29.5,T,,M,4.4,N,8.1,K,A*2A $PRDID,2.09,-0.02,40.37*6D
$INGGA,091124.80,5249.8345,N,00120.9612,W,1,20,0.6,95.1,M,49.4,M,,*56
$INHDT,40.4,T*15 $INZDA,091124.8055,06,05,2016,,*77
$INVTG,31.7,T,,M,4.4,N,8.1,K,A*21 $PRDID,2.09,0.02,42.42*40
$INGGA,091125.00,5249.8347,N,00120.9610,W,1,20,0.6,95.1,M,49.4,M,,*5F
$INHDT,42.4,T*17
You can just read a line of file and write to another new file.
Like this:
import re
#open new file with append
nf = open('newfile', 'at')
#open file with read
with open('file', 'rt') as f:
for line in f:
r = re.match(r'\$INGGA', line)
if r is not None:
nf.write(line)
nf.write("$INHDT,31.9,T*1E" + '\n')
You can use at to append write and wt to read line!
I have 150,000 lines file, It's run well!
I suggest using a simple regex that will parse and capture the parts you care about. Here is an example that will capture the piece you care about:
(\$INGGA.*\n\$INHDT.*\n)
https://regex101.com/r/tK1hF0/3
As in my above link, you'll notice that I used the "global" g setting on the regex, telling it to capture all groups that match. Otherwise, it'll stop after the first match.
I also had trouble determining where the actual line breaks exist in your above example file, so you can tweak the above to match exactly where the breaks occur.
Here is some starter python example code:
import re
test_str = # load your file here
p = re.compile(ur'(\$INGGA.*\n\$INHDT.*\n)')
matches = re.findall(p, test_str)
In the example PuTTY log you give, its all one line separated with space.
So in this case you can use this to replace the space with new line and gets new file -
cat large_file | sed 's/ /\n/g' > new_large_file
To iterate over the file separated with new line, run this -
cat new_large_file | python your_script.py
Your script get line by line so your computer should not crash.
your_script.py -
import sys
INGGA_line = ""
for line in sys.stdin:
line_striped = line.strip()
if line_striped.startswith("$INGGA"):
INGGA_line = line_striped
elif line_striped.startswith("$INZDA"):
print line_striped, INGGA_line
else:
print line_striped
This answer is aimed at python 3.
According to this other answer (and the docs), you can iterate your file line-by-line memory-efficiently:
with open(filename, 'r') as f:
for line in f:
...process...
An example of how you could fulfill your above criteria could be
# Target file write-only, source file read-only
with open(targetfile, 'w') as tf, open(sourcefile, 'r') as sf:
# Flag for whether we are looking for 1st or 2nd part
look_for_ingga = True
for line in sf:
if look_for_ingga:
if line.startswith('$INGGA,'):
tf.write(line)
look_for_ingga = False
elif line.startswith('$INHDT,'):
tf.write(line)
look_for_ingga = True
In the case where you have multiple '$INGGA,' prior to the '$INHDT,', this grabs the first one and disregards the rest. In case you want to take only the last '$INGGA,' before the '$INHDT,', store the last '$INGGA,' in a variable instead of writing it to disk. Then, when you find your '$INHDT,', store both.
In case you meant that you want to write to a separate new file for each INGGA-INHDT pair, the target file with-statement should be nested inside for line in sf instead, or the results should be buffered in a list for later storage.
Refer to the docs for introductions to with-statements and file reading/writing.