List from CSV column in Python - python

So I had written a little snip for Scrapy to search the country on a site by zip code, but it seems like a waste to go through all the nonexistent zip codes, so, first, this is what I had...
def start_requests(self):
for i in xrange(100000):
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % i)
The idea is obvious, but I downloaded a CSV with all of the US zip codes in a column - how would I easily use this as a list (or more efficient method than a list) to use in the above example? I have pandas if that would make things easier.

If I'm understanding you correctly, you have a file that is comma-delimited and formatted such that in a particular column (Perhaps titled 'ZipCodes') a zipcode is present on each row.
If there's a header line and different columns and you know the name of the column that contains the zipcodes you could do this:
def start_requests(self, filename, columnname):
with open(filename) as file:
headers = file.readline().strip().split(',')
for line in file.readlines():
zipcode = line.strip().split(',')[headers.index(columnname)]
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % zipcode)

Open file, read lines, grab zip codes, yield ...
for line in open('zipcodes.csv', 'r').readlines():
zipcode = line.split(',')[columnNumberOfTheZipCodesStartingFrom0]
yield self.make_requests_from_url("http://foo.com/blah/%s/search.php" % (zipcode,))

Just to round out the array of perfectly good suggestions, here's another. The main idea to this approach is that it doesn't require special libraries like pandas, but isn't just reading plain file contents either, in which case you have to re-invent the wheel as far as CSV markup goes (not the hardest thing, but why bother?). If your csv file is simple enough, it might be easier just to read out the file contents, as suggested by dg99
Use python's built-in csv library!
ziplist = []
import csv
with open('zipcodes.csv', 'rb') as csvfile:
zipreader = csv.reader(csvfile)
for row in zipreader:
ziplist.append(row[i])
Notes:
I have row[i] where i is the column index for the zipcodes in your csv file. If the file lists zip+4 codes, you might use row[i][:5]. Interestingly, if you don't know what column number the zipcodes will be in, but you do know the column header (field name), you can use
zipreader = csv.DictReader(csvfile)
for zipDict in zipreader:
ziplist.append(row['Zip Code Column Name Here'])
According to this post, getting info back out of a list is just as efficient as a tuple, so this seems like the way to go.

so you want to read in a csv to a list....well:
i think this should be easy:
import pandas
colname = ['zip code','city']
zipdata = pandas.read_csv('uszipcodes.csv', names=colname)
i hope i understood you right!

Maybe like this?
#!/usr/local/cpython-3.3/bin/python
import csv
import pprint
def gen_zipcodes(file_):
reader = csv.reader(file_, delimiter='|', quotechar='"')
for row in reader:
yield row[0]
def main():
with open('zipcodes_2006.txt', 'r') as file_:
zipcodes = list(gen_zipcodes(file_))
pprint.pprint(zipcodes[:10])
main()

Related

Cross referencing two csv files in python

so as i'm out of ideas I've turned to geniuses on this site.
What I want to be able to do is to have two separate csv files. One of which has a bunch of store names on it, and the other to have black listed stores.
I'd like to be able to run a python script that reads the 'black listed' sheet, then checks if those specific names are within the other sheet, and if they are, then delete those off the main sheet.
I've tried for about two days straight and cannot for the life of me get it to work. So i'm coming to you guys to help me out.
Thanks so much in advance.
p.s If you can comment the hell out out of the script so I know what's going on it would be greatly appreciated.
EDIT: I deleted the code I originally had but hopefully this will give you an idea of what I was trying to do. (I also realise it's completely incorrect)
import csv
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in reader:
if line in readern:
with open('Destinations.csv', 'w'):
del(line)
The first thing you need to be aware of is that you can't update the file you are reading. Textfiles (which include .csv files) don't work like that. So you have to read the whole of Destinations.csv into memory, and then write it out again, under a new name, but skipping the rows you don't want. (You can overwrite your input file, but you will very quickly discover that is a bad idea.)
import csv
blacklist_rows = []
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
for line in reader:
blacklist_rows.append(line)
destination_rows = []
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in readern:
destination_rows.append(line)
Now at this point you need to loop through destination_rows and drop any that match something in blacklist_rows, and write out the rest. I can't suggest what the matching test should look like, because you haven't shown us your input data, so I don't actually know that blacklist_rows and destination_rows contain.
with open('FilteredDestinations.csv', 'w') as output:
writer = csv.writer(output)
for r in destination_rows:
if not r: # trap for blank rows in the input
continue
if r *matches something in blacklist_rows*: # you have to code this
continue
writer.writerow(r)
You could try Pandas
import pandas as pd
df1 = pd.read_csv("Destinations.csv")
df2 = pd.read_csv("Black List.csv")
blacklist = df2["column_name_in_blacklist_file"].tolist()
df3 = df2[~df2['destination_column_name'].isin(blacklist)]
df3.to_csv("results.csv")
print(df3)

How can I open multiple csv files in a folder, take the average of a column and save in a separate file using python?

I am extremely new at python and need some help with this one. I've tried various codes and none seem to work, so suggestions would be awesome.
I have a folder with about 1500 csv files that each contain multiple columns of data. I need to take the average of the first column called "agr" and save this value in a different excel or csv file. It would be great if I could also somehow save the name of the file with its averaged value so that I can keep track of which file it came from. The name of the files are crop_city (e.g. corn_omaha).
import glob
import csv
import numpy as np
import pandas as pd
path = ('C:/test/*.csv')
for fname in glob.glob(path):
with open(fname) as csvfile:
agr = []
reader = csv.DictReader(fname)
print row['agr']
I know the code above is extremely rudimentary, so any help would be great thanks everyone!
Assuming the first column in these CSV files is a decimal or float, you don't really need to parse the entire line. Just split at the first separator and parse the first token. There is no real advantage to numpy or pandas either. Just use the builtin sum function.
import glob
import os
path = ('test/*.csv') # using local dir for test
outfile.write("Filename,Sum\r\n") # header for output
with open('output.csv', 'w', newline='') as outfile:
for fname in glob.glob(path):
with open(fname) as csvfile:
next(csvfile) # skip header
outfile.writelines("{},{}\r\n".format(os.path.basename(fname),
sum(float(line.split(',', 1)[0].strip())
for line in csvfile)))
Contrary to the answer by #tdelaney, I would not advise you to limit your code by relying on the fact that you are adding up the first column; what if you need to work with the third column next week? It's easy to do this properly by building on the code you provide. Parsing a couple of thousand text files is not going to slow you down.
The csv.DictReader constructor will automatically treat the first row of its input as a header (unless you explicitly specify a list of column names with the fieldnames parameter). So your code can look like this:
import csv
import glob
averages = []
for fname in glob.glob(path):
with open(fname, "rb") as csvfile:
reader = csv.DictReader(csvfile)
values = [ float(row["agr"]) for row in reader ]
avg = sum(values) / len(values)
averages.append((fname, avg))
The list averages now contains the numbers you want. This is how you write it out to another CSV file:
with open("avegages.csv", "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(["File", "Average agr"])
for row in averages:
writer.writerow(row)
PS. Since you included pandas in your imports, here's one way to do the same thing with pandas. However, I recommend sticking with csv for now. The pandas object model is complex, and hard to wrap your head around.
averages = []
for fname in glob.glob(path):
data = pd.DataFrame.from_csv(fname)
averages.append((fname, data["agr"].mean()))
df_out = pd.DataFrame.from_records(averages, columns=["File", "Average agr"])
df_out.to_csv("averages.csv", index=False)
As you can see the code is a lot shorter, since file i/o and calculations can be done with one statement.

How do I make the filename of the output csv file equal to the the content of a column

I have a huge csv file with all our student rosters inside of it. So,
1) I want to separate the rosters into smaller csv files based on the
course name. 2) If I can have the output csv file's name be equal to
the course name (example: Algebra1.csv), that would make my life so much
better. Is it possible to iterate through the courses_column of the csv file and when the name of the course changes it makes a new csv file for that course. I think I could read the keys of the dictionary 'read_rosters' and then do a while loop?
An example of the csv input file would look like this:
Student firstname, Student lastname, Class Instructor, Course name, primary learning center
johnny, doe, smith, algebra1, online
jane, doe, austin, geometry, campus
Here is what I have so far:
import os
import csv
path = "/PATH/TO/FILE"
with open(os.path.join(path, "student_rosters.csv"), "rU") as rosters:
read_rosters = csv.DictReader(rosters)
for row in read_rosters:
course_name = row['COURSES_COLUMN_HEADER']
csv_file = os.path.join(course_name, ".csv")
course_csv = csv.writer(open(csv_file, 'wb').next()
In your current code, you're opening an output csv file for each line you read. This will be slow, and, as you've currently written it, it won't work. That's because using the "wb" mode when you open the file erases everything that was in the file before. You might use an "a" mode, but this will still be slow.
How you can best solve the problem depends a bit on your data. If you can rely upon the input always having the rows with the same course next to one another, you could use groupby from the itertools module to easily write the appropriate lines out together:
from itertools import groupby
from operator import itemgetter
with open(os.path.join(path, "student_rosters.csv"), "rb") as rosters:
reader = csv.DictReader(rosters)
for course, rows in groupby(reader, itemgetter('COURSES_COLUMN_HEADER')):
with open(os.path.join(path, course + ".csv"), "wb") as outfile:
writer = csv.DictWriter(outfile, reader.fieldnames)
writer.writerows(rows)
If you can't rely upon the organization of the rows, you have a couple options. One would be to read all the rows into a list, then sort them by course and use itertools.groupby like in the code above.
Another option would be to keep reading just one line at a time, with each output row going into an appropriate file. I'd suggest keeping a dictionary of writer objects, indexed by course name. Here's what that could look like:
writers = {}
with open(os.path.join(path, "student_rosters.csv"), "rb") as rosters:
reader = csv.DictReader(rosters)
for row in reader:
course = row['COURSES_COLUMN_HEADER']
if course not in writers:
outfile = open(os.path.join(path, course + ".csv"), "wb")
writers[course] = csv.DictWriter(outfile, reader.fieldnames)
writers[course].writerow(row)
If you were using this in production, you'd probably want to add some code to close the files after you were done with them, since you can't use with statements to close them automatically.
In my example codes above, I've made the code write out the full rows, just as they were in the input. If you don't want that, you can change the second argument to DictWriter to a sequence of the column names you want to write. You'll also want to include the parameter extrasaction="ignore" so that the extra values in the row dicts will be ignored when the columns you do want are written.
First, this is not what you want:
csv_file = os.path.join(course_name, ".csv")
It will create a file named .csv in a subdirectory named course_name. You likely want something like:
csv_file = os.path.join(path, course_name + ".csv")
Also, the following has two issues: (a) unbalanced parens and (b) writer objects don't have a next method:
course_csv = csv.writer(open(csv_file, 'wb').next()
Try instead:
course_csv = csv.writer(open(csv_file, 'wb'))
And, then you need to write something of your choosing to the new file, probably using a writeheader, writerow or writerows method:
course_csv.writeheader(something_of_your_choosing)
course_csv.writerow(something_else_of_your_choosing)

Iteratively copy specific rows from CSV file to new file

I have a large tab-delimited csv file with the following format:
#mirbase_acc mirna_name gene_id gene_symbol transcript_id ext_transcript_id mirna_alignment gene_alignment mirna_start mirna_end gene_start gene_end genome_coordinates conservation align_score seed_cat energy mirsvr_score
What I would like to be able to do is iterate through rows and select items based on data (strings) in the "gene_id" field, then copy those rows to a new file.
I am a python noob, and thought it would be a good way to get my feet wet, but it is harder than it looks! I have been trying to use the csv package to manipulate the files, reading and writing basic stuff using dictreader and dictwriter. If anyone can help me out coming up with a template for the iterative searching aspect, I would be greatly indebted. So far I have:
import csv
f = open("C:\Documents and Settings\Administrator\Desktop\miRNA Scripting\mirna_predictions_short.txt", "r")
reader = csv.DictReader(f, delimiter='\t')
writer = open("output.txt",'wb')
writer = csv.writer(writer, delimiter='\t')
Then the iterative bit, bleurgh:
for row in reader:
if reader.gene_id == str(CG11710):
writer.writerow
This obviously doesnt work. Any ideas on better ways to structure this??
You're almost there! The code is nearly correct :)
Accessing dicts goes like this:
some_dict['some_key']
Instead of:
some_object.some_attribute
Creating a string isn't done with str(...) but with quotes like CG11710
In your case:
for row in reader:
if row['gene_id'] == 'CG11710':
writer.writerow(row)
Dictionaries in python are addressed like dictionary['key']. So for you it'd be reader['gene_id']. Also strings are declared in quotes "text", not like str(text). str(text) will try to cast whatever is stored in the variable text to a string, which is not what I think you want. Also writer.writerow is a function, and functions take arguments, so you need to do writer.writerow(row).

How to update a particular column of a row in a csv using python

I am having a csv file with the following contents
1,2,3,4
a,b,c,d
w,x,y,z
And i want to update this csv file contents to
1,2,3,4
a,b,c,d,k
w,x,y,z
Can someone please share Python code for this update process??
Using the csv library, we read in the data, convert it to a list of lists, append the k to the relevant list, then write it out again.
import csv
data = csv.reader(open("input.csv"))
l = list(data)
l[1].append("k")
our_writer = csv.writer(open("output.csv", "wb"))
our_writer.writerows(l)
While the csv library isn't completely necessary for your toy case, it's often good to use the approach that scales well.
You don't need csv for such a simple case:
# sed '2s/$/,k/'
import fileinput
for lineno, line in enumerate(fileinput.input(inplace=True), start=1):
if lineno == 2:
line = line.rstrip('\n') + ",k\n"
print line,
Example:
$ python update_2nd_line.py data_to_update.csv
It updates data_to_update.csv file inplace.

Categories