Cross referencing two csv files in python - python

so as i'm out of ideas I've turned to geniuses on this site.
What I want to be able to do is to have two separate csv files. One of which has a bunch of store names on it, and the other to have black listed stores.
I'd like to be able to run a python script that reads the 'black listed' sheet, then checks if those specific names are within the other sheet, and if they are, then delete those off the main sheet.
I've tried for about two days straight and cannot for the life of me get it to work. So i'm coming to you guys to help me out.
Thanks so much in advance.
p.s If you can comment the hell out out of the script so I know what's going on it would be greatly appreciated.
EDIT: I deleted the code I originally had but hopefully this will give you an idea of what I was trying to do. (I also realise it's completely incorrect)
import csv
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in reader:
if line in readern:
with open('Destinations.csv', 'w'):
del(line)

The first thing you need to be aware of is that you can't update the file you are reading. Textfiles (which include .csv files) don't work like that. So you have to read the whole of Destinations.csv into memory, and then write it out again, under a new name, but skipping the rows you don't want. (You can overwrite your input file, but you will very quickly discover that is a bad idea.)
import csv
blacklist_rows = []
with open('Black List.csv', 'r') as bl:
reader = csv.reader(bl)
for line in reader:
blacklist_rows.append(line)
destination_rows = []
with open('Destinations.csv', 'r') as dest:
readern = csv.reader(dest)
for line in readern:
destination_rows.append(line)
Now at this point you need to loop through destination_rows and drop any that match something in blacklist_rows, and write out the rest. I can't suggest what the matching test should look like, because you haven't shown us your input data, so I don't actually know that blacklist_rows and destination_rows contain.
with open('FilteredDestinations.csv', 'w') as output:
writer = csv.writer(output)
for r in destination_rows:
if not r: # trap for blank rows in the input
continue
if r *matches something in blacklist_rows*: # you have to code this
continue
writer.writerow(r)

You could try Pandas
import pandas as pd
df1 = pd.read_csv("Destinations.csv")
df2 = pd.read_csv("Black List.csv")
blacklist = df2["column_name_in_blacklist_file"].tolist()
df3 = df2[~df2['destination_column_name'].isin(blacklist)]
df3.to_csv("results.csv")
print(df3)

Related

Python import csv file and replace blank values

I have just started a data quality class in which I got zero instruction on Python but am expected to create a script. There are three instructions for my Python script:
Create a script that loads an entire CSV file and replace all the blank values to NAN
Use genfromtxt function
Write the results set into a different file
I have been working on this for a few hours, but with no previous experience with Python, I am completely stuck! This is what I have so far:
import csv
file = open(quality.csv, 'r')
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
rows = []
for row in csvreader:
rows.append(row)
print(rows)
My first problem is that when I tried using genfromtxt, it would not print out the headers or the entire csv file, it would only print out a few lines. If it matters, all of the values of the csv file are ints/floats, but the headers are strings.
See here
The next problem is I have tried several different ways to replace blank values, but I was not successful. All of the blank fields in this file are in the last column. When I print out the csv in full, this is what the line looks like (I've highlighted the empty value):
See here
Finally, I have no idea what instruction #3 means. I am completely new at this with zero Python knowledge! I think I am unsure of the Python syntax and rules - which I will look into more and learn, however I only had two days to complete this assignment and I do not know anything yet! Thank you in advance.
What you did with genfromtxt seems correct already. With big data like this, terminal only shows some data from the beginning and at the end, and those 3 dots in the middle also indicates the other records you're not seeing there!

From Excel to Python: from columns to strings

I have an Excel file with 2 columns (it is an CSV file). The first contains dates, while the second contains numbers.
I want to make a list out of this with Python, the following way: ['date1','number1','date2','number2']. At the moment I always get: ['date1;number1','date2;number2']. I basically want every element to be treated as a string on its own.
At the moment I'm using following code:
abc = []
with open('C:...doc.csv', 'r', encoding = "utf8") as f:
reader = csv.reader(f)
for row in reader:
abc.extend([row])
I've tried everything I could come up with, e.g. nested for etc, but nothing seems to work.
Can anyone help me out please? Thank you!

List from CSV column in Python

So I had written a little snip for Scrapy to search the country on a site by zip code, but it seems like a waste to go through all the nonexistent zip codes, so, first, this is what I had...
def start_requests(self):
for i in xrange(100000):
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % i)
The idea is obvious, but I downloaded a CSV with all of the US zip codes in a column - how would I easily use this as a list (or more efficient method than a list) to use in the above example? I have pandas if that would make things easier.
If I'm understanding you correctly, you have a file that is comma-delimited and formatted such that in a particular column (Perhaps titled 'ZipCodes') a zipcode is present on each row.
If there's a header line and different columns and you know the name of the column that contains the zipcodes you could do this:
def start_requests(self, filename, columnname):
with open(filename) as file:
headers = file.readline().strip().split(',')
for line in file.readlines():
zipcode = line.strip().split(',')[headers.index(columnname)]
yield self.make_requests_from_url("http://www.example.com/zipcode/%05d/search.php" % zipcode)
Open file, read lines, grab zip codes, yield ...
for line in open('zipcodes.csv', 'r').readlines():
zipcode = line.split(',')[columnNumberOfTheZipCodesStartingFrom0]
yield self.make_requests_from_url("http://foo.com/blah/%s/search.php" % (zipcode,))
Just to round out the array of perfectly good suggestions, here's another. The main idea to this approach is that it doesn't require special libraries like pandas, but isn't just reading plain file contents either, in which case you have to re-invent the wheel as far as CSV markup goes (not the hardest thing, but why bother?). If your csv file is simple enough, it might be easier just to read out the file contents, as suggested by dg99
Use python's built-in csv library!
ziplist = []
import csv
with open('zipcodes.csv', 'rb') as csvfile:
zipreader = csv.reader(csvfile)
for row in zipreader:
ziplist.append(row[i])
Notes:
I have row[i] where i is the column index for the zipcodes in your csv file. If the file lists zip+4 codes, you might use row[i][:5]. Interestingly, if you don't know what column number the zipcodes will be in, but you do know the column header (field name), you can use
zipreader = csv.DictReader(csvfile)
for zipDict in zipreader:
ziplist.append(row['Zip Code Column Name Here'])
According to this post, getting info back out of a list is just as efficient as a tuple, so this seems like the way to go.
so you want to read in a csv to a list....well:
i think this should be easy:
import pandas
colname = ['zip code','city']
zipdata = pandas.read_csv('uszipcodes.csv', names=colname)
i hope i understood you right!
Maybe like this?
#!/usr/local/cpython-3.3/bin/python
import csv
import pprint
def gen_zipcodes(file_):
reader = csv.reader(file_, delimiter='|', quotechar='"')
for row in reader:
yield row[0]
def main():
with open('zipcodes_2006.txt', 'r') as file_:
zipcodes = list(gen_zipcodes(file_))
pprint.pprint(zipcodes[:10])
main()

Writing to CSV from list, write.row seems to stop in a strange place

I am attempting to merge a number of CSV files. My Initial function is aimed to:
Look Inside a directory and count the number of files within (assume all are .csv)
Open the first CSV and append each row into a list
Clip the top three rows (there's some useless column title info I don't want)
Store these results in an a list I've called 'archive
Open the next CSV file and repeat(clip and append em to 'archive')
When we're out of CSV files I wanted to write the complete 'archive' to a file in separate folder.
So for instance if i were to start with three CSV files that look something like this.
CSV 1
[]
[['Title'],['Date'],['etc']]
[]
[['Spam'],['01/01/2013'],['Spam is the spammiest spam']]
[['Ham'],['01/04/2013'],['ham is ok']]
[['Lamb'],['04/01/2013'],['Welsh like lamb']]
[['Sam'],['01/12/2013'],["Sam doesn't taste as good and the last three"]]
CSV 2
[]
[['Title'],['Date'],['etc']]
[]
[['Dolphin'],['01/01/2013'],['People might get angry if you eat it']]
[['Bear'],['01/04/2013'],['Best of Luck']]
CSV 3
[]
[['Title'],['Date'],['etc']]
[]
[['Spinach'],['04/01/2013'],['Spinach has lots of iron']]
[['Melon'],['02/06/2013'],['Not a big fan of melon']]
At the end of which I'd home to get something like...
CSV OUTPUT
[['Spam'],['01/01/2013'],['Spam is the spammiest spam']]
[['Ham'],['01/04/2013'],['ham is ok']]
[['Lamb'],['04/01/2013'],['Welsh like lamb']]
[['Sam'],['01/12/2013'],["Sam doesn't taste as good and the last three"]]
[['Dolphin'],['01/01/2013'],['People might get angry if you eat it']]
[['Bear'],['01/04/2013'],['Best of Luck']]
[['Spinach'],['04/01/2013'],['Spinach has lots of iron']]
[['Melon'],['02/06/2013'],['Not a big fan of melon']]
So... I set about writing this:
import os
import csv
path = './Path/further/into/file/structure'
directory_list = os.listdir(path)
directory_list.sort()
archive = []
for file_name in directory_list:
temp_storage = []
path_to = path + '/' + file_name
file_data = open(path_to, 'r')
file_CSV = csv.reader(file_data)
for row in file_CSV:
temp_storage.append(row)
for row in temp_storage[3:-1]:
archive.append(row)
archive_file = open("./Path/elsewhere/in/file/structure/archive.csv", 'wb')
wr = csv.writer(archive_file)
for row in range(len(archive)):
lastrow = row
wr.writerow(archive[row])
print row
This seems to work... except when I check my output file it seems to have stopped writing at a strange point near the end"
eg:
[['Spam'],['01/01/2013'],['Spam is the spammiest spam']]
[['Ham'],['01/04/2013'],['ham is ok']]
[['Lamb'],['04/01/2013'],['Welsh like lamb']]
[['Sam'],['01/12/2013'],['Sam doesn't taste as good and the last three']]
[['Dolphin],['01/01/2013'],['People might get angry if you eat it']]
[['Bear'],['01/04/2013'],['Best of Luck']]
[['Spinach'],['04/0
It's really wierd, i can't work out what's gone wrong. Seemed to be writing fine but have decided to stop even half way through a list entry. Tracing it back I'm sure that this has something to do with my last write "for loop", but I'm not too familiar the csv methods. Have has a read through the documentation, and am still stumped.
Can anyone point out where I've gone wrong, how I might fix it and perhaps if there would be a better way of going about all this!
Many Thanks -Huw
Close the filehandle before the script ends. Closing the filehandle will also flush any strings waiting to be written. If you don't flush and the script ends, some output may never get written.
Using the with open(...) as f syntax is useful because it will close the file for you when Python leaves the with-suite. With with, you'll never omit closing a file again.
with open("./Path/elsewhere/in/file/structure/archive.csv", 'wb') as archive_file:
wr = csv.writer(archive_file)
for row in archive:
wr.writerow(row)
print row

Why can't I repeat the 'for' loop for csv.Reader?

I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards

Categories