I'm trying to extract series numbers (that is the number of bus stops) from a csv file and write in a new csv file. These series numbers usually take the form as follows: "Queen Street, Bus Station - Platform A3 [BT000998]". I only need the content enclosed by the brackets. I found that there are cases that unwanted comma exist (as the example above), and using csv module can avoid such issue. In order to do that I wrote the following code:
import csv
import re
fp = open(r'C:\data\input.csv')
fpw = open(r'C:\data\output.csv','w')
data = csv.reader(fp)
writer = csv.writer(fpw)
for row in data:
line = ','.join(row)
lst = line.split(',')
try:
stop = lst[11] # find the location that contains stop number
extr = re.search(r"\[([A-Za-z0-9_]+)\]", stop) # extract stop number enclosed by brackets
stop_id = str(extr.group(1))
lst[11] = stop_id # replace the original content with the extracted stop number
writer.writerow(lst) # write in the output file (fpw)
except Exception, e: # this part is in case there is error such as AttributeError
writer.writerow(row)
After running this code, while there is no error raised, only an empty csv file is generated. I'm quite new to python. Much appreciate if anyone can help me with this code to make it work.
Thank you in advance.
Sui
====UPDATE====
Based on everyone's reply, I revised the code as follows:
import csv
import re
fp = r'C:\data\input.csv'
fpw = r'C:\data\output.csv'
with open(fp, 'rb') as input, open(fpw, 'wb') as output:
for row in csv.reader(input):
try:
stop = row[11]
extr = re.search(r"\[([A-Za-z0-9_]+)\]", stop)
stop_id = str(extr.group(1))
row[11] = stop_id
repl_row = ','.join(row) + '\n'
output.write(repl_row)
except csv.Error:
pass
Now running the code seems working. HOWEVER, in the middle of running, an error 'line contains NULL byte' was raised, and python stopped even though I added try/except as shown above. So anyone has suggestion to deal with this issue and the let the code continue? By the way, the csv file I'm working on is over 2GB.
Many thanks, Sui
If that's the whole code, you need to close the file with fpw.close() after you are done with all the writer operations.
You can also try with keyword, as in official Python documentation
Related
I would like to read in delimited data (length unknown) embedded in a larger .txt file. The usual ways, using np.loadtxt, np.genfromtxt, or pd.read_csv don't seem to work as they throw an error when encountering a bad line. Of course, you can handle bad lines but I haven't found an option to just stop and return the already imported data.
Is there such an option which I overlooked, or do I have to go back and evaluate the file line by line.
Any suggestions would be appreciated :)
Something like this should work, though it might well be better to pre-process the file to fix whatever is causing the issue instead of only reading in data up to that point.
import csv
with open('try.csv', newline='') as csvfile:
rows = []
reader = csv.reader(csvfile)
try:
for row in reader:
rows.append(row)
# You should change Exception to be more specific
except Exception as e:
print("Caught", e)
# These are the rows that could be read
print(rows)
I started learning Python and I'm taking a Google course on Coursera about automation and IT using it. In the Practice Quiz: Reading & Writing CSV Files, the first question is:
We're working with a list of flowers and some information about each one. The create_file function writes this information to a CSV file. The contents_of_file function reads this file into records and returns the information in a nicely formatted block. Fill in the gaps of the contents_of_file function to turn the data in the CSV file into a dictionary using DictReader.
After giving an answer I receive "Incorrect. Something went wrong! Contact Coursera Support about this question!. I've found a page here and copied that code but the answer is always the same. So I contacted Coursera, but they say there's no problem on their end. That's the code I provided:
import os
import csv
# Create a file with data in it
def create_file(filename):
with open(filename, "w") as file:
file.write("name,color,type\n")
file.write("carnation,pink,annual\n")
file.write("daffodil,yellow,perennial\n")
file.write("iris,blue,perennial\n")
file.write("poinsettia,red,perennial\n")
file.write("sunflower,yellow,annual\n")
# Read the file contents and format the information about each row
def contents_of_file(filename):
return_string = ""
# Call the function to create the file
create_file(filename)
# Open the file
with open(filename) as f:
# Read the rows of the file into a dictionary
x = csv.DictReader(f)
# Process each item of the dictionary
for row in x:
return_string += "a {} {} is {}\n".format(row["color"], row["name"], row["type"])
return return_string
#Call the function
print(contents_of_file("flowers.csv"))
Has anyone encountered the same issues? Or can you explain to me why it doesn't work?
Adding the console log of the browser here. Tried with Firefox, Chrome and now on Opera.
Console Log
As it is an online evaluation platform, it might prohibit things like import OS for security reasons. Besides, it's not doing anything in your code. Did you try removing that line?
It seems you missed the some options(delimiter and newline='') in the reader function. Here is the working code:
import os
import csv
# Create a file with data in it
def create_file(filename):
with open(filename, "w") as file:
file.write("name,color,type\n")
file.write("carnation,pink,annual\n")
file.write("daffodil,yellow,perennial\n")
file.write("iris,blue,perennial\n")
file.write("poinsettia,red,perennial\n")
file.write("sunflower,yellow,annual\n")
# Read the file contents and format the information about each row
def contents_of_file(filename):
return_string = ""
# Call the function to create the file
create_file(filename)
# Open the file
with open(filename, "r", newline='') as f:
# Read the rows of the file into a dictionary
reader = csv.DictReader(f, delimiter=",")
# Process each item of the dictionary
for row in reader:
return_string += "a {} {} is {}\n".format(row["color"], row["name"], row["type"])
return return_string
#Call the function
print(contents_of_file("flowers.csv"))
and result is:
a pink carnation is annual
a yellow daffodil is perennial
a blue iris is perennial
a red poinsettia is perennial
a yellow sunflower is annual
Keep in mind that newline = '' is for python3 and the delimiter must be set in order to be read correctly.
This issue still persists. I reported it to Coursera today. There has to be an error on their side. Well, at least it's not a graded assessment, just a practice quiz. But frustrating nevertheless.
I am new to Python and struggle with the following. Users upload CSV file which I then parse. However, a lot of things can go wrong. The principal issues I have found are a) the files they upload aren't CSV files after all, or b) the files are not uploaded using UTF8 encoding (which is the default on our system).
The question is: where exactly should I check for these issues? This is my script:
with open(path) as f:
reader = csv.reader(f)
for row in reader:
(do stuff...)
I have tried adding this:
try:
reader = csv.reader(f)
except:
error = "There was an error..."
But if the user uploads a file with the wrong encoding then this is not caught. It only seems to be caught when the loop starts (for row in reader), and only for the particular row that causes trouble. Does this mean that I should have this kind of error checking inside the for statement? It seems much better to me to do it only once, not on every item, but I'm not sure what makes most sense here...
There may be more direct ways to catch these sorts of things with the CSV reader (I'm not sure), but when I have input files coming from other users with the potential for these sorts of errors I just parse the files manually. E.g.:
import sys
NumFields = 3
inf = open(path, "rU")
for line in inf:
line = line.strip() #get rid of weird end-of-line characters from bad encoding
ls = line.split(",")
if len(ls) != NumFields:
ls = line.split("\t") ##if you can't get the number of fields with comma split, try tab
if len(ls) != NumFields:
sys.exit(1)
print ls
so as i'm out of ideas I've turned to geniuses on this site.
What I want to be able to do is to have two separate csv files. One of which has a bunch of store names on it, and the other to have black listed stores.
I'd like to be able to run a python script that reads the 'black listed' sheet, then checks if those specific names are within the other sheet, and if they are, then delete those off the main sheet.
I've tried for about two days straight and cannot for the life of me get it to work. So i'm coming to you guys to help me out.
Thanks so much in advance.
p.s If you can comment the hell out out of the script so I know what's going on it would be greatly appreciated.
EDIT: I deleted the code I originally had but hopefully this will give you an idea of what I was trying to do. (I also realise it's completely incorrect)
import csv
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in reader:
if line in readern:
with open('Destinations.csv', 'w'):
del(line)
The first thing you need to be aware of is that you can't update the file you are reading. Textfiles (which include .csv files) don't work like that. So you have to read the whole of Destinations.csv into memory, and then write it out again, under a new name, but skipping the rows you don't want. (You can overwrite your input file, but you will very quickly discover that is a bad idea.)
import csv
blacklist_rows = []
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
for line in reader:
blacklist_rows.append(line)
destination_rows = []
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in readern:
destination_rows.append(line)
Now at this point you need to loop through destination_rows and drop any that match something in blacklist_rows, and write out the rest. I can't suggest what the matching test should look like, because you haven't shown us your input data, so I don't actually know that blacklist_rows and destination_rows contain.
with open('FilteredDestinations.csv', 'w') as output:
writer = csv.writer(output)
for r in destination_rows:
if not r: # trap for blank rows in the input
continue
if r *matches something in blacklist_rows*: # you have to code this
continue
writer.writerow(r)
You could try Pandas
import pandas as pd
df1 = pd.read_csv("Destinations.csv")
df2 = pd.read_csv("Black List.csv")
blacklist = df2["column_name_in_blacklist_file"].tolist()
df3 = df2[~df2['destination_column_name'].isin(blacklist)]
df3.to_csv("results.csv")
print(df3)
I am trying to remove duplicates of 3-column tab-delimited txt file, but as long as the first two columns are duplicates, then it should be removed even if the two has different 3rd column.
from operator import itemgetter
import sys
input = sys.argv[1]
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
for line in input.splitlines():
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
file = open(output, "w")
file.write(data)
file.close()
First, I get error
key = ig(line.split())
IndexError: list index out of range
Also, I can't see how to save the result to output.txt
People say saving to output.txt is a really basic matter. But no tutorial helped.
I tried methods that use codec, those that use with, those that use file.write(data) and all didn't help.
I could learn MatLab quite easily. The online tutorial was fantastic and a series of Googling always helped a lot.
But I can't find a helpful tutorial of Python yet. This is obviously because I am a complete novice. For complete novices like me, what would be the best tutorial with 1) comprehensiveness AND 2) lots of examples 3) line by line explanation that dosen't leave any line without explanation?
And why is the above code causing error and not saving result?
I'm assuming since you assign input to the first command line argument with input = sys.argv[1] and output to the second, you intend those to be your input and output file names. But you're never opening any file for the input data, so you're callling .splitlines() on a file name, not on file contents.
Next, splitlines() is the wrong approach here anyway. To iterate over a file line-by-line, simply use for line in f, where f is an open file. Those lines will include the newline at the end of the line, so it needs to be stripped if it's not supposed to be part of the third columns data.
Then you're opening and closing the file inside your loop, which means you'll try to write the entire contents of data to the file every iteration, effectively overwriting any data written to the file before. Therefore I moved that block out of the loop.
It's good practice to use the with statement for opening files. with open(out_fn, "w") as outfile will open the file named out_fn and assign the open file to outfile, and close it for you as soon as you exit that indented block.
input is a builtin function in Python. I therefore renamed your variables so no builtin names get shadowed.
You're trying to directly write data to the output file. This won't work since data is a list of lines. You need to join those lines first in order to turn them in a single string again before writing it to a file.
So here's your code with all those issues addressed:
from operator import itemgetter
import sys
in_fn = sys.argv[1]
out_fn = sys.argv[2]
getkey = itemgetter(0, 1)
seen = set()
data = []
with open(in_fn, 'r') as infile:
for line in infile:
line = line.strip()
key = getkey(line.split())
if key not in seen:
data.append(line)
seen.add(key)
with open(out_fn, "w") as outfile:
outfile.write('\n'.join(data))
Why is the above code causing error?
Because you haven't opened the file, you are trying to work with the string input.txtrather than with the file. Then when you try to access your item, you get a list index out of range because line.split() returns ['input.txt'].
How to fix that: open the file and then work with it, not with its name.
For example, you can do (I tried to stay as close to your code as possible)
input = sys.argv[1]
infile = open(input, 'r')
(...)
lines = infile.readlines()
infile.close()
for line in lines:
(...)
Why is this not saving result?
Because you are opening/closing the file inside the loop. What you need to do is write the data once you're out of the loop. Also, you cannot write directly a list to a file. Hence, you need to do something like (outside of your loop):
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
All together
There are other ways of reading/writing files, and it is pretty well documented on the internet but I tried to stay close to your code so that you would understand better what was wrong with it
from operator import itemgetter
import sys
input = sys.argv[1]
infile = open(input, 'r')
output = sys.argv[2]
#Pass any column number you want, note that indexing starts at 0
ig = itemgetter(0,1)
seen = set()
data = []
lines = infile.readlines()
infile.close()
for line in lines:
print line
key = ig(line.split())
if key not in seen:
data.append(line)
seen.add(key)
print data
outfile = open(output, "w")
for item in data:
outfile.write(item)
outfile.close()
PS: it seems to produce the result that you needed there Python to remove duplicates using only some, not all, columns