Python - Replace slashes in CSV file

Python - Replace slashes in CSV file - python

I've got a csv file that has around 100 rows. Some of the cells in the 100 rows have filepaths like:
C:\\\\Users\\\Simon\\\\Desktop\\\\file.jpg
I want to open the csv file in python and change only the rows that have triple-slashes and convert them to a single backslash. Here is my code so far:
import csv
with open('myCsvFile', 'rb') as csvfile:
SysIndexTwo = csv.reader(csvfile)
for allRows in SysIndexTwo:
if '\\\\' in allRows:
writer.writerows(allRows.replace('\\\\', '\\'))
Tried the suggestions and get the following error:
Traceback (most recent call last):
File "SIPHON2.py", line 7, in <module>
for allRows in SysIndexTwo:
ValueError: I/O operation on closed file
simon#ubuntu:~/Desktop$ python SIPHON2.py
Traceback (most recent call last):
File "SIPHON2.py", line 7, in <module>
for allRows in SysIndexTwo:
ValueError: I/O operation on closed file
This doesn't seem to work. Any ideas?
Thanks

You need to indent your actual processing. Right now, you drop out of the context manager (the with statement where you define your CSV reader) before you try to use it. Thus, you get the "IO operation on closed file" error because the context manager closed the file when you left it.
You want this:
with open('myCsvFile', 'rb') as csvfile:
reader = csv.reader(csvfile) # Simple names are good, esp. in small scope!
for row in reader: # Indent me!
pass # Do stuff here.
The with statement is handy for automatically closing files (among other things) for you. However, this means that any work you do that requires the file you're using must be done before you leave the block, because once you leave, the file is closed!
The csv reader doesn't read the whole file when you initialize it: it reads it on demand. Thus, you need to still be inside the block when you read lines from the csv reader.
Other Notes
You've got a bunch of other problems. You seem to be unsure whether you're trying to clean three or four backslashes--make sure you know what you're doing before you try to do it!
Your actual row replacement is broken, because as you've written it, allRows is a list, not a string, so you're probably not going to find the backslash pattern you're looking for. Instead, you need an inner loop to look through each cell in each row:
for row in reader:
corrected = []
for cell in row:
corrected.append(cell.replace('\\\\\\', '\\')) # Gross! See below.
writer.writerow(corrected)
Note that I can't see where writer is defined, but it looks like it might be subject to the same problem as your reader, if it's defined in a context manager someplace else!
Finally, raw strings are your friends (though they may not help you much here). In general, anytime you want a literal backslash in your strings, put an r in front of the string to save yourself a lot of headache. However, replacing odd numbers of backslashes is still a problem, because even raw strings cannot end in an odd number of backslashes.
So, to replace \\\ with \ (replace three backslashes with one), you'll have to double up on the backslashes like I did in the example above. If you wanted to replace four backslashes with two, you could use raw strings to your advantage: cell.replace(r'\\\\', r'\\') works just fine.
For posterity: you could also do something just as ugly, but in a different way, by adding a space to the end of the pattern strings so they no longer end with backslashes, and then stripping off the extra space. The following line replaces three backslashes with one, but it's much hackier (and slower if you're doing it a whole lot):
s = r'This is a \\\ string with \\\ sets \ of \\ three backslash\\\es.'
print(s.replace(r'\\\ '.strip(), r'\ '.strip()))

The slashes you're trying to match are getting treated as escapes, so '\\\\' is actually looking for '\\'.
Try using raw strings, i.e. r'\\\\' (you'll want to use raw strings for both the matching as well as the replace).
You could also double up the slashes, so use \\ everytime you want \, but that gets cumbersome very quickly

Try:
allRows.replace('\\\\\\', '\\')
Please note that the \ symbols needs to be escaped, by doubling.
>>> d
'C:\\\\\\Users\\\\\\Simon\\\\\\Desktop\\\\\\file.jpg\n'
>>> d.replace('\\\\\\', '\\')
'C:\\Users\\Simon\\Desktop\\file.jpg\n'
>>> print d.replace('\\\\\\', '\\')
C:\Users\Simon\Desktop\file.jpg

Related

Trailing newline in file when reading JSON in Pyspark results in empty line

When loading JSON data into Spark (v2.4.2) on AWS EMR from S3 using Pyspark, I've observed that a trailing line separator (\n) in the file results in an empty row being created on the end of the Dataframe. Thus, a file with 10,000 lines in it will produce a Dataframe with 10,001 rows, the last of which is empty/all nulls.
The file looks like this:
{line of JSON}\n
{line of JSON}\n
... <-- 9996 similar lines
{line of JSON}\n
{line of JSON}\n
There are no newlines in the JSON itself, i.e. I don't need to read the JSON as multi-line. I am reading it with the following Pyspark command:
df = spark.read.json('s3://{bucket}/{filename}.json.gz')
df.count()
-> 10001
My understanding of this quote from http://jsonlines.org/:
The last character in the file may be a line separator, and it will be treated the same as if there was no line separator present.
... is that that last empty line should not be considered. Am I missing something? I haven't seen anyone else on SO or elsewhere having this problem, yet it seems very obvious in practice. I don't see an option in the Spark Python API docs for suppressing empty lines, nor have I been able to work around it by trying different line separators and specifying them in the load command.
I have verified that removing the final line separator results in a Dataframe that has the correct number of lines.

I found the problem. The file I was uploading had an unexpected encoding (UCS-2 LE BOM instead of UTF-8). I should have thought to check it, but didn't. After I switched the encoding to the expected one (UTF-8) the load worked as intended.

Why python2 shows \r (Raw escaped) and python3 does not?

I have been having a path error: No file or directory found for hours. After hours of debugging, I realised that python2 added an invisible '\r' at the end of each line.
The input: (trainval.txt)
Images/K0KKI1.jpg Labels/K0KKI1.xml
Images/2KVW51.jpg Labels/2KVW51.xml
Images/MMCPZY.jpg Labels/MMCPZY.xml
Images/LCW6RB.jpg Labels/LCW6RB.xml
The code I used to debug the error
with open('trainval.txt', "r") as lf:
for line in lf.readlines():
print ((line),repr(line))
img_file, anno = line.strip("\n").split(" ")
print(repr(img_file), repr(anno))
Python2 output:
("'Images/K0KKI1.jpg'", "'Labels/K0KKI1.xml\\r'")
('Images/2KVW51.jpg Labels/2KVW51.xml\r\n', "'Images/2KVW51.jpg Labels/2KVW51.xml\\r\\n'")
("'Images/2KVW51.jpg'", "'Labels/2KVW51.xml\\r'")
('Images/MMCPZY.jpg Labels/MMCPZY.xml\r\n', "'Images/MMCPZY.jpg Labels/MMCPZY.xml\\r\\n'")
("'Images/MMCPZY.jpg'", "'Labels/MMCPZY.xml\\r'")
('Images/LCW6RB.jpg Labels/LCW6RB.xml\r\n', "'Images/LCW6RB.jpg Labels/LCW6RB.xml\\r\\n'")
("'Images/LCW6RB.jpg'", "'Labels/LCW6RB.xml\\r'")
Python3 output:
Images/K0KKI1.jpg Labels/K0KKI1.xml
'Images/K0KKI1.jpg Labels/K0KKI1.xml\n'
'Images/K0KKI1.jpg' 'Labels/K0KKI1.xml'
Images/2KVW51.jpg Labels/2KVW51.xml
'Images/2KVW51.jpg Labels/2KVW51.xml\n'
'Images/2KVW51.jpg' 'Labels/2KVW51.xml'
Images/MMCPZY.jpg Labels/MMCPZY.xml
'Images/MMCPZY.jpg Labels/MMCPZY.xml\n'
'Images/MMCPZY.jpg' 'Labels/MMCPZY.xml'
Images/LCW6RB.jpg Labels/LCW6RB.xml
'Images/LCW6RB.jpg Labels/LCW6RB.xml\n'
'Images/LCW6RB.jpg' 'Labels/LCW6RB.xml'
As annoying as it was, it was that small '\r' who caused the path error. I could not see it in my console until I write the script above. My question is: Why is this '\r' even there? I did not create it. Something somewhere added it there. It would be helpful if someone could tell me what is the use of this small 'r' , why did it appear in python2 and not in python3 and how to avoid getting bugs due to it.

there's probably a subtle difference of processing between Windows text file in python 2 & 3 versions.
The issue here is that your file has a Windows text format, and contains one or several carriage return chars before the linefeed. A quick & generic fix would be to change:
img_file, anno = line.strip("\n").split(" ")
by just:
img_file, anno = line.split()
Without arguments str.split is very smart:
it splits according to any kind of whitespace (linefeed, space, carriage return, tab)
it removes empty fields (no need for strip after all)
So use that cross-platform/python version agnostic form unless you need really specific split operation, and your problems will be history.
As an aside, don't do for line in lf.readlines(): but just for line in lf:, it will read & yield the lines one by one, handy when the file is big so you don't consume too much memory.

Specifying filename in os.system call from python

I am creating a simple file in python to reorganize some text data I grabbed from a website. I put the data in a .txt file and then want to use the "tail" command to get rid of the first 5 lines. I'm able to make this work for a simple filename shown below, but when I try to change the filename (to what I'd actually like it to be) I get an error. My code:
start = 2010
end = 2010
for i in range(start,end+1)
year = str(i)
...write data to a file called file...
teamname=open(file).readline() # want to use this in the new filename
teamfname=teamname.replace(" ","") #getting rid of spaces
file2 = "gotdata2_"+year+".txt"
os.system("tail -n +5 gotdata_"+year+".txt > "+file2)
The above code works as intended, creating file, then creating file2 that excludes the first 5 lines of file. However, when I change the name of file2 to be:
file2 = teamfname+"_"+year+".txt"
I get the error:
sh: line 1: _2010.txt: command not found
It's as if the end of my file2 statement is getting chopped off and the .txt part isn't being recognized. In this case, my code outputs a file but is missing the _2010.txt at the end. I've double checked that both year and teamfname are strings. I've also tried it with and without spaces in the teamfname string. I get the same error when I try to include a os.system mv statement that would rename the file to what I want it to be, so there must be something wrong with my understanding of how to specify the string here.
Does anyone have any ideas about what causes this? I haven't been able to find a solution, but I've found this problem difficult to search for.

Without knowing what your actual strings are, it's impossible to be sure what the problem is. However, it's almost certainly something to do with failing to properly quote and/or escape arguments for the command line.
My first guess would be that you have a newline in the middle of your filename, and the shell is truncating the command at the newline. But I wouldn't bet too heavily on that. If you actually printed out the repr of the pathname, I could tell you for sure. But why go through all this headache?
The solution to almost any problem with os.system is to not use os.system.
If you look at the docs, they even tell you this:
The subprocess module provides more powerful facilities for spawning new processes and retrieving their results; using that module is preferable to using this function. See the Replacing Older Functions with the subprocess Module section in the subprocess documentation for some helpful recipes.
If you use subprocess instead of os.system, you can avoid the shell entirely. You can also pass arguments as a list instead of trying to figure out how to quote them and escape them properly. Which would completely avoid the exact problem you're having.
For example, if you do this:
file2 = "gotdata2_"+year+".txt"
with open(file2, 'wb') as f:
subprocess.check_call(['tail', '-n', '+5', "gotdata_"+year+".txt"], stdout=f)
Then, if you change that first line to this:
file2 = teamfname+"_"+year+".txt"
It will still work even if teamfname has a space or a quote or another special character in it.
That being said, I'm not sure why you want to use tail in the first place. You can skip the first 5 lines just as easily directly in Python.

Cannot read in files

I have a small problem with reading in my file. My code:
import csv as csv
import numpy
with open("train_data.csv","rb") as training:
csv_file_object = csv.reader(training)
header = csv_file_object.next()
data = []
for row in csv_file_object:
data.append(row)
data = numpy.array(data)
I get the error no such file "train_data.csv", so I know the problem lies with the location. But whenever I specify the pad like this: open("C:\Desktop...etc) it doesn't work either. What am I doing wrong?

If you give the full file path, your script should work. Since it is not, it must be that you have escape characters in your path. To fix this, use a raw-string to specify the file path:
# Put an 'r' at the start of the string to make it a raw-string.
with open(r"C:\path\to\file\train_data.csv","rb") as training:
Raw strings do not process escape characters.
Also, just a technical fact, not giving the full file path causes Python to look for the file in the directory that the script is launched from. If it is not there, an error is thrown.

When you use open() and Windows you need to deal with the backslashes properly.
Option 1.) Use the raw string, this will be the string prefixed with an r.
open(r'C:\Users\Me\Desktop\train_data.csv')
Option 2.) Escape the backslashes
open('C:\\Users\\Me\\Desktop\\train_data.csv')
Option 3.) Use forward slashes
open('C:/Users/Me/Desktop/train_data.csv')
As for finding the file you are using, if you just do open('train_data.csv') it is looking in the directory you are running the python script from. So, if you are running it from C:\Users\Me\Desktop\, your train_data.csv needs to be on the desktop as well.

Parse log file in python

I have a log file that has lines that look like this:
"1","2546857-23541","f_last","user","4:19 P.M.","11/02/2009","START","27","27","3","c2546857-23541",""
Each line in the log as 12 double quote sections and the 7th double quote section in the string comes from where the user typed something into the chat window:
"22","2546857-23541","f_last","john","4:38 P.M.","11/02/2009","
What's up","245","47","1","c2546857-23541",""
This string also shows the issue I'm having; There are areas in the chat log where the text the user typed is on a new line in the log file instead of the same line like the first example.
So basically I want the lines in the second example to look like the first example.
I've tried using Find/Replace in N++ and I am able to find each "orphaned" line but I was unable to make it join the line above it.
Then I thought of making a python file to automate it for me, but I'm kind of stuck about how to actually code it.
Python errors out at this line running unutbu's code
"1760","4746880-00129","bwhiteside","tom","11:47 A.M.","12/10/2009","I do not see ^"refresh your knowledge
^" on the screen","422","0","0","c4746871-00128",""

The csv module is smart enough to recognize when a quoted item is not finished (and thus must contain a newline character).
import csv
with open('data.log',"r") as fin:
with open('data2.log','w') as fout:
reader=csv.reader(fin,delimiter=',', quotechar='"', escapechar='^')
writer=csv.writer(fout, delimiter=',',
doublequote=False, quoting=csv.QUOTE_ALL)
for row in reader:
row[6]=row[6].replace('\n',' ')
writer.writerow(row)

If you data is valid CSV you can use Python's csv.reader class. It should work just fine with your sample data. It may not work correctly depending an what an embeded double-quote looks like from the source system. See: http://docs.python.org/library/csv.html#module-contents.

Unless I'm misunderstanding the problem. You simply need to read in the file and remove any newline characters that occur between double quote characters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.