Text replacement on different lines - python

I have multiple entries in a file as mentioned below.
"Item_1";"Item_1";"Products///Item///ABC///XYZ";"Item_1.jpg}";"";"Buy item
<br><strong>Items</strong>
<br><strong>Time</strong>";"";"";"";"";"";"Category: M[Item]";"";"";"Y";"N";"N";"None";""
"Item_2";....
In above text, there is a newline after "Buy item" in the first line & after '/strong>' in second line.
The change which I want to make is -
1. Replace Products///Item///ABC///XYZ with Products///ABC///XYZ
2. Replace "Category: M[Item]" with "Category: M[ABC]"
3. In case if Entry 1 is Products///Item///ABC or Products///ABC, I dont want to change "Category: M[Item]" with "Category: M[ABC]", just change Products///Item///ABC to Products///ABC
I am trying to read entire file line by line & then split by '///' storing number of entries & storing 3rd entry. But this creates issues as I have multiple newlines.
Is there a simpler way of doing it by using regex or something else?

Like #Casimir suggested, you can use csv module to parse your file (because it'll handle the newlines), like this
import csv
with open(your_filename) as f:
reader = csv.reader(f, delimeter=';', quotechar='"')
rows = list(reader)
and then do what you want to the parsed result (I'm not quite sure about what you want to achieve here, comment if that's not what you want)
for row in rows:
if 'Products///Item///ABC///XY' in row:
index = row.index('Products///Item///ABC///XY')
row[index] = 'Products///ABC///XYZ'
continue # If we replaced the first thing, skip to next row
elif 'Category: M[Item]' in row:
index = row.index('Category: M[Item]')
row[index] = 'Category: M[ABC]'

Related

Truncate a column of a csv file?

I'm new to Python and I have the following csv file (let's call it out.csv):
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27.363000+00:00,0.9987,1.0113
2017-01-15,13:03:46.660000+00:00,0.9987,1.0113
2017-01-15,21:25:07.320000+00:00,0.9987,1.0113
2017-01-15,21:26:46.164000+00:00,0.9987,1.0113
2017-01-16,12:40:11.593000+00:00,,1.0154
2017-01-16,12:40:11.593000+00:00,1.0004,
2017-01-16,12:43:34.696000+00:00,,1.0095
and I want to truncate the second column so the csv looks like:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095
This is what I have so far..
with open('out.csv','r+b') as nL, open('outy_3.csv','w+b') as nL3:
new_csv = []
reader = csv.reader(nL)
for row in reader:
time = row[1].split('.')
new_row = []
new_row.append(row[0])
new_row.append(time[0])
new_row.append(row[2])
new_row.append(row[3])
print new_row
nL3.writelines(new_row)
I can't seem to get a new line in after writing each line to the new csv file.
This definitely doesnt look or feel pythonic
Thanks
The missing newlines issue is because the file.writelines() method doesn't automatically add line separators to the elements of the argument it's passed, which it expects to be an sequence of strings. If these elements represent separate lines, then it's your responsibility to ensure each one ends in a newline.
However, your code is tries to use it to only output a single line of output. To fix that you should use file.write() instead because it expects its argument to be a single string—and if you want that string to be a separate line in the file, it must end with a newline or have one manually added to it.
Below is code that does what you want. It works by changing one of the elements of the list of strings that the csv.reader returns in-place, and then writes the modified list to the output file as single string by join()ing them all back together, and then manually adds a newline the end of the result (stored in new_row).
import csv
with open('out.csv','rb') as nL, open('outy_3.csv','wt') as nL3:
for row in csv.reader(nL):
time_col = row[1]
try:
period_location = time_col.index('.')
row[1] = time_col[:period_location] # only keep characters in front of period
except ValueError: # no period character found
pass # leave row unchanged
new_row = ','.join(row)
print(new_row)
nL3.write(new_row + '\n')
Printed (and file) output:
DATE,TIME,PRICE1,PRICE2
2017-01-15,05:44:27,0.9987,1.0113
2017-01-15,13:03:46,0.9987,1.0113
2017-01-15,21:25:07,0.9987,1.0113
2017-01-15,21:26:46,0.9987,1.0113
2017-01-16,12:40:11,,1.0154
2017-01-16,12:40:11,1.0004,
2017-01-16,12:43:34,,1.0095

Sort by ignoring the first column and whitespace in csv file in Python

I have a csv file which I want to sort by taking each row at a time. While sorting the row, I want to ignore the whitespace (or empty cell). Also, I want to ignore the first row and first column while sorting.
This is how my code looks like:
import csv, sys, operator
fname = "Source.csv"
new_fname = "Dest.csv"
data = csv.reader(open(fname,"rb"),delimiter=',')
num = 1
sortedlist = []
ind=0
for row in data:
if num==1:
sortedlist.append(row)
with open(new_fname,"wb") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
elif num > 1:
sortedlist.append(sorted(row))
with open(new_fname,"ab") as f:
filewriter = csv.writer(f,delimiter=",")
filewriter.writerow(sortedlist[ind])
ind+=1
num+=1
I was able to ignore the first row. But, I am not sure how to ignore the whitespace and the first column while sorting. Any suggestions are welcome.
I simplified your code significantly and here's what I got (although I didn't understand the part about empty columns, they are values as well... Did you mean that you wanted to keep empty columns in the same place instead of putting them at start?)
import csv
if __name__ == '__main__':
reader = csv.reader(open("Source.csv","r"),delimiter=',')
out_file = open("Dest.csv","w")
writer = csv.writer(out_file,delimiter=",")
writer.writerow(reader.next())
for row in reader:
writer.writerow([row[0]] + sorted(row[1:]))
out_file.close()
Always write executable code in if __name__ == '__main__':, this is done so that your code is not executed if your script was not run directly, but rather imported by another script.
We record the out_file variable to be able out_file.close() it cleanly later, code will work without it, but it's a clean way to write files.
Do not use "wb", "rb", "ab" for text files, the "b" part stands for "binary" and should be used for structured files.
reader.next() gets the first line of the csv file (or crashes if file is empty)
for row in reader: already runs starting from second line (because we ran reader.next() earlier), so we don't need any line number conditionals anymore.
row[0] gets the first element of the list, row[1:] gets all elements of the list, except the first one. For example, row[3:] would ignore first 3 elements and return the rest of the list. In this case, we only sort the row without its first element by doing sorted(row[1:])
EDIT: If you really want to remove empty columns from your csv, replace sorted(row[1:]) with sorted(filter(lambda x: x.strip()!='', row[1:])). This will remove empty columns from the list before sorting, but keep in mind that empty values in csv are still values.
EDIT2: As correctly pointed out by #user3468054 values will be sorted as strings, if you want them to be sorted as numbers, add a named parameter key=int to the sorted function, or key=float if your values are float.

Python 3.4 CSV Deleting items using the in function

This is my current code, the current issue I have is that search returns nothing. How do I achieve a string value for this variable.
count = 0
with open("userDatabase.csv","r") as myFile:
with open("newFile.csv","w") as newFile:
row_count = sum(1 for row in myFile)
print("aba")
for x in range(row_count):
print("aaa")
for row in myFile:
search = row[count].readline
print(search)
if self.delName.get("1.0","end-1c") in search:
count = count + 1
else:
newFile.write(row[count])
count = count + 1
The output is:
aba
aaa
aaa
So it runs through it twice, which is good as my userDatabase consists of two rows of data.
The file in question has this data:
"lukefinney","0000000","0000000","a"
"nictaylor","0000000","0000000","a"
You cannot just iterate over an open file more than once without rewinding the file object back to the start.
You'll need to add a file.seek(0) call to put the file reader back to the beginning each time you want to start reading from the first row again:
myFile.seek(0)
for row in myFile:
The rest of your code makes little sense; when iterating over a file you get individual lines from the file, so each row is a string object. Indexing into strings gives you new strings with just one character in it; 'foo'[1] is the character 'o', for example.
If you wanted to copy across rows that don't match a string, you don't need to know the row count up front at all. You are not handling a list of rows here, you can look at each row individually instead:
filter_string = self.delName.get("1.0","end-1c")
with open("userDatabase.csv","r") as myFile:
with open("newFile.csv","w") as newFile:
for row in myFile:
if filter_string not in row:
newFile.write(row)
This does a sub-string match. If you need to match whole columns, use the csv module to give you individual columns to match against. The module handles the quotes around column values:
import csv
filter_string = self.delName.get("1.0","end-1c")
with open("userDatabase.csv", "r", newline='') as myFile:
with open("newFile.csv", "w", newline='') as newFile:
writer = csv.writer(newFile)
for row in csv.reader(myFile):
# row is now a list of strings, like ['lukefinney', '0000000', '0000000', 'a']
if filter_string != row[0]: # test against the first column
# copied across if the first column does not match exactly.
writer.writerow(row)
One problem is that row_count = sum(1 for row in myFile) consumes all rows from myFile. Subsequent reads on myFile will return an empty string which signifies end of file. This means that for loop later in your code where you execute for row in myFile: is not entered because all rows have already been consumed.
A way around this is to add a call to myFile.seek(0) just before for row in myFile:. This will reset the file pointer and the for loop should then work.
It's not very clear from your code what it is that you are trying to do, but it kind of looks like you want to filter out rows that contain a certain string. Try this:
with open("userDatabase.csv","r") as myFile:
with open("newFile.csv","w") as newFile:
for row in myFile:
if self.delName.get("1.0","end-1c") not in row:
newFile.write(row)

Python csv reader: null/empty value at end of line not being parsed

I have a tab delimited file with lines of data as such:
8600tab8661tab000000000003148415tab10037-434tabXEOL
8600tab8662tab000000000003076447tab6134505tabEOL
8600tab8661tab000000000003426726tab470005-063tabXEOL
There should be 5 fields with the possibility of the last field having a value 'X' or being empty as shown above.
I am trying to parse this file in Python (2.7) using the csv reader module as such:
file = open(fname)
reader = csv.reader(file, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
for i in range(5): # there are 5 fields
print row[i] # this fails if there is no 'X' in the last column
# index out of bounds error
If the last column is empty the row structure will end up looking like:
list: ['8600', '8662', '000000000003076447', '6134505']
So when row[4] is called, the error follows..
I was hoping for something like this:
list: ['8600', '8662', '000000000003076447', '6134505', '']
This problem only seems to occur if the very last column is empty. I have been looking through the reader arguments and dialect options to see if the is a simple command to pass into the csv.reader to fix the way it handles an empty field at the end of the line. So far no luck.
Any help will be much appreciated!
The easiest option would be to check the length of the row beforehand. If the length is 4, append an empty string to your list.
for row in reader:
if len(row) == 4:
row.append('')
for i in range(5):
print row[i]
There was a minor PEBCAK on my part. I was going back and forth between editing the file in Notepad++ and Gvim. At some point I lost my last tab on the end. I fixed the file and it parsed as expected.

Trouble in saving a list to csv

I am saving a list to a csv using the writerow function from csv module. Something went wrong when I opened the final file in MS office Excel.
Before I encounter this issue, the main problem I was trying to deal with is getting the list saved to each row. It was saving each line into a cell in row1. I made some small changes, now this happened. I am certainly very confused as a novice python guy.
import csv
inputfile = open('small.csv', 'r')
header_list = []
header = inputfile.readline()
header_list.append(header)
input_lines = []
for line in inputfile:
input_lines.append(line)
inputfile.close()
AA_list = []
for i in range(0,len(input_lines)):
if (input_lines[i].split(',')[4]) == 'AA':#column4 has different names including 'AA'
AA_list.append(input_lines[i])
full_list = header_list+AA_list
resultFile = open("AA2013.csv",'w+')
wr = csv.writer(resultFile, delimiter = ',')
wr.writerow(full_list)
Thanks!
UPDATE:
The full_list look like this: ['1,2,3,"MEM",...]
UPDATE2(APR.22nd):
Now I got three cells of data(the header in A1 and the rest in A2 and A3 respectively) in the same row. Apparently, the newline signs are not working for three items in one big list. I think the more specific question now is how do I save a list of records with '\n' behind each record to csv.
UPDATE3(APR.23rd):
original file
Importing the csv module is not enough, you need to use it as well. Right now, you're appending each line as an entire string to your list instead of a list of fields.
Start with
with open('small.csv', 'rb') as inputfile:
reader = csv.reader(inputfile, delimiter=",")
header_list = next(reader)
input_lines = list(reader)
Now header_list contains all the headers, and input_lines contains a nested list of all the rows, each one split into columns.
I think the rest should be pretty straightforward.
append() appends a list at the end of another list. So when you write header_list.append(header), it takes header as a list of characters and appends to header_list. You should write
headers = header.split(',')
header_list.append(headers)
This would split the header row by commas and headers would be the list of header words, then append them properly after header_list.
The same thing goes for AA_list.append(input_lines[i]).
I figured it out.
The different between [val], val, and val.split(",") in the writerow bracket was:
[val]: a string containing everything taking only the first column in excel(header and "2013, 1, 2,..." in A1, B1, C1 and so on ).
val: each letter or comma or space(I forgot the technical terms) take a cell in excel.
val.split(","): comma split the string in [val], and put each string separated by comma into an excel cell.
Here is what I found out: 1.the right way to export the flat list to each line by using with syntax, 2.split the list when writing row
csvwriter.writerow(JD.split())
full_list = header_list+AA_list
with open("AA2013.csv",'w+') as resultFile:
wr = csv.writer(resultFile, delimiter= ",", lineterminator = '\n')
for val in full_list:
wr.writerow(val.split(','))
The wanted output
Please correct my mistakenly used term and syntax! Thanks.

Categories