I'm trying to find all the rows inside training_full.csv (two column,"macroclass" and "description") from contatti.csv (containing two columns, "name" and "surname").
I want to retrieve all the rows of "description", inside training_full.csv, in which there is "name" and "surname" contained in contatti.csv.
The script I've created seems to evaluate only the first row of training_full.csv and, for this reason, print only the first row of training_full.csv (in which the script finds a match).
If I modify training_full.csv in way that in the first row there isn't any match, the result is empty.
Here the code:
import csv
match=[]
with open('xxxxxxxxxxx/training_full1.csv', encoding='utf-8') as csvfile, open('output.csv', 'wb') as output, open('xxxxxxxxxxx/contatti.CSV') as contatti:
spamreader = csv.reader(csvfile)
spamreader_contacts = csv.reader(contatti, delimiter=';')
spamwriter = csv.writer(output)
for row_desc in spamreader:
#print(righe[0])
for row_cont in spamreader_contacts:
#print(row[0])
if (row_cont[0] + " " + row_cont[1]) in row_desc[0]:
match.append(row_desc[0])
print(match)
Thanks for any help,
Filippo.
Looking at your problem, it seems to be separable in three parts:
1) Read the names, and build a list
2) Compare the training file with the names list
3) Write the matches
Doing that, we can end up with a solution similar to:
import csv
names = []
with open('xxxxxxxxxxx/contatti.csv', 'rb') as f:
contatti = csv.reader(f, delimiter=';')
for row in contatti:
names.append("{} {}".format(row[0], row[1]))
matches=[]
with open('xxxxxxxxxxx/training_full1.csv', 'rb', encoding='utf-8') as f:
training = csv.reader(f)
for row in training:
for name in names:
if name in row[1]: # description being the second column
matches.append(row[1])
break
with open('output.csv', 'wb') as f:
output = csv.writer(f)
for match in matches:
output.writerow(match)
print(matches)
The main issue with your solution attempt, was, as pointed out in the comments, that once you looked for the first match, you exhausted your csv reader. In the solution I present, a list of names is first being built. This will ensure that we can search the for names multiple times.
Related
I started learning python and was wondering if there was a way to create multiple files from unique values of a column. I know there are 100's of ways of getting it done through pandas. But I am looking to have it done through inbuilt libraries. I couldn't find a single example where its done through inbuilt libraries.
Here is the sample csv file data:
uniquevalue|count
a|123
b|345
c|567
d|789
a|123
b|345
c|567
Sample output file:
a.csv
uniquevalue|count
a|123
a|123
b.csv
b|345
b|345
I am struggling with looping on unique values in a column and then print them out. Can someone explain with logic how to do it ? That will be much appreciated. Thanks.
import csv
from collections import defaultdict
header = []
data = defaultdict(list)
DELIMITER = "|"
with open("inputfile.csv", newline="") as csvfile:
reader = csv.reader(csvfile, delimiter=DELIMITER)
for i, row in enumerate(reader):
if i == 0:
header = row
else:
key = row[0]
data[key].append(row)
for key, value in data.items():
filename = f"{key}.csv"
with open(filename, "w", newline="") as f:
writer = csv.writer(f, delimiter=DELIMITER)
rows = [header] + value
writer.writerows(rows)
import csv
with open('sample.csv', newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
with open(f"{row[0]}.csv", 'a') as inner:
writer = csv.writer(
inner, delimiter='|',
fieldnames=('uniquevalue', 'count')
)
writer.writerow(row)
the task can also be done without using csv module. the lines of the file are read, and with read_file.read().splitlines()[1:] the newline characters are stripped off, also skipping the header line of the csv file. with a set a unique collection of inputdata is created, that is used to count number of duplicates and to create the output files.
with open("unique_sample.csv", "r") as read_file:
items = read_file.read().splitlines()[1:]
for line in set(items):
with open(line[:line.index('|')] + '.csv', 'w') as output:
output.write((line + '\n') * items.count(line))
import csv
with open('example.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if row not in client_email:
print row
Assume code is formatted in blocks properly, it's not translating properly when I copy paste. I've created a list of company email domain names (as seen in the example), and I've created a loop to print out every row in my CSV that is not present in the list. Other columns in the CSV file include first name, second name, company name etc. so it is not limited to only emails.
Problem is when Im testing, it is printing off rows with the emails in the list i.e jackson#example.co.uk.
Any ideas?
In your example, row refers to a list of strings. So each row is ['First name', 'Second name', 'Company Name'] etc.
You're currently checking whether any column is exactly one of the elements in your client_email.
I suspect you want to check whether the text of any column contains one of the elements in client_email.
You could use another loop:
for row in csvfile:
for column in row:
# check if the column contains any of the email domains here
# if it does:
print row
continue
To check if a string contains any strings in another list, I often find this approach useful:
s = "xxabcxx"
stop_list = ["abc", "def", "ghi"]
if any(elem in s for elem in stop_list):
pass
One way to check may be to see if set of client_email and set in row has common elements (by changing if condition in loop):
import csv
with open('example.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if (set(row) & set(client_email)):
print (row)
You can also use any as following:
import csv
with open('untitled.csv', 'r') as f:
csvfile = csv.reader(f, delimiter = ',')
client_email = ['#example.co.uk', '#moreexamples.com', 'lastexample.com']
for row in csvfile:
if any(item in row for item in client_email):
print (row)
Another possible way,
import csv
data = csv.reader(open('example.csv', 'r'))
emails = {'#example.co.uk', '#moreexamples.com', 'lastexample.com'}
for row in data:
if any(email in cell for cell in row for email in emails):
print(row)
The content of the csv is as follows:
"Washington-Arlington-Al, DC-VA-MD-WV (MSAD)" 47894 1976
"Grand-Forks, ND-MN" 24220 2006
"Abilene, TX" 10180 1977
The output required is read through the csv, find the content between ""
in column 1 and fetch only DC-VA-MD-WV , ND-MN , TX and
put this content in a new column. (For Normalization)
So far tried a lot of regex patterns in python, but could not get the right one.
sample=""" "Washington-Arlington-Al, DC-VA-MD-WV (MSAD)",47894,1976
"Grand-Forks, ND-MN",24220,2006
"Abilene, TX",10180,1977 """
open('sample.csv','w').write(sample)
with open('sample.csv') as sample, open('output.csv','w') as output:
reader = csv.reader(sample)
writer = csv.writer(output)
for comsplit in row[0].split(','):
writer.writerow([ comsplit, row[1]])
print open('output.csv').read()
Output Expected is:
DC-VA-MD-WV
ND-MN
TX
in a new row
There is no need to use regex here provided a couple of things:
The city (?) always has a comma after it followed by 1 space of whitespace (though I could add a modification to accept more than 1 bit of whitespace if needed)
There is a space after your letter sequence before encountering something like (MSAD).
This code gives your expected output against the sample input:
with open('sample.csv', 'r') as infile, open('expected_output.csv', 'wb') as outfile:
reader = csv.reader(infile)
expected_output = []
for row in reader:
split_by_comma = row[0].split(',')[1]
split_by_space = split_by_comma.split(' ')[1]
print split_by_space
expected_output.append([split_by_space])
writer = csv.writer(outfile)
writer.writerows(expected_output)
I'd do it like this:
with open('csv_file.csv', 'r') as f_in, open('output.csv', 'w') as f_out:
csv_reader = csv.reader(f_in, quotechar='"', delimiter=',',
quoting=csv.QUOTE_ALL, skipinitialspace=True)
csv_writer = csv.writer(f_out)
new_csv_list = []
for row in csv_reader:
first_entry = row[0].strip('"')
relevant_info= first_entry.split(',')[1].split(' ')[0]
row += [relevant_info]
new_csv_list += [row]
for row in new_csv_list:
csv_writer.writerow(row)
Let me know if you have any questions.
I believe you could use this regex pattern, which will extract any alphanumeric expression (with hyphen or not) between a comma and a parenthesis:
import re
BETWEEN_COMMA_PAR = re.compile(ur',\s+([\w-]+)\s+\(')
test_str = 'Washington-Arlington-Al, DC-VA-MD-WV (MSAD)'
result = BETWEEN_COMMA_PAR.search(test_str)
if result != None:
print result.group(1)
This will print as a result: DC-VA-MD-WV, as expected.
It seems that you are having troubles finding the right regex to use for finding the expected values.
I have created a small sample pythext which will satisfy your requirement.
Basically, when you check the content of every value of the first column, you could use a regex like /(TX|ND-MN|DC-VA-MD-WV)/
I hope this was useful! Let me know if you need further explanations.
I have a CSV file with one column that has a person's first and last name. I am trying to use a CSV to split each name into two columns, first and last. The code below splits all of the first names into one row and all of the last names into one row instead of having a first name into a row and the last name in the next column next the the first name. Thanks for your time.
Code:
import csv
with open('fullnames.csv','r') as f:
reader = csv.reader(f)
newcsvdict = {"first name": [], "last name": []}
for row in reader:
first = row[0].split()[0]
last = row[0].split()[1]
newcsvdict["first name"].append(first)
newcsvdict["last name"].append(last)
with open('new.csv','w') as f:
w = csv.DictWriter(f, newcsvdict.keys())
w.writeheader()
w.writerow(newcsvdict)
Output:
In this simple case there is little benefit in using a csv.DictWriter, just use csv.writer:
import csv
header = ['first name', 'last name']
with open('fullnames.csv', 'r') as infile, open('new.csv', 'w') as outfile:
writer = csv.writer(outfile)
writer.writerow(header)
writer.writerows(row[0].split() for row in csv.reader(infile))
This works fine provided that the name column in the input CSV always consists of exactly one first name and one surname separated by whitespace. However, if there can be double-barrelled surnames, e.g. Helena Bonham Carter, you need to be more careful about splitting the name. This might work:
row[0].split(' ', 1)
but it assumes that the separator is exactly one space.
You can use pandas to write your csv (you could actually use pandas for the whole problem), this will automatically transpose you data from a dict of columns to a list of rows:
import pandas as pd
df = pd.DataFrame(newcsvdict)
df.to_csv('new.csv', index=False)
You're creating a single list associated with key. Either use Pandas, as #maxymoo suggested, or write each line separately.
import csv
with open(r'~/Documents/names.csv', 'r') as fh:
reader = csv.reader(fh)
with open(r'~/Documents/output.csv', 'w+') as o:
writer = csv.writer(o)
for row in reader:
output = row[0].split(' ', 1)
writer.writerow(output)
I've seen a few related posts about the numpy module, etc. I need to use the csv module, and it should work for this. While a lot has been written on using the csv module here, I didn't quite find the answer I was looking for. Thanks so much in advance
Essentially I have the following function/pseudocode (tab didn't copy over well...):
import csv
def copy(inname, outname):
infile = open(inname, "r")
outfile = open(outname, "w")
copying = False ##not copying yet
# if the first string up to the first whitespace in the "name" column of a row
# equals the first string up to the first whitespace in the "name" column of
# the row directly below it AND the value in the "ID" column of the first row
# does NOT equal the value in the "ID" column of the second row, copy these two
# rows in full to a new table.
For example, if inname looks like this:
ID,NAME,YEAR, SPORTS_ALMANAC,NOTES
(first thousand rows)
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
1003,Boston Red Sox,1918,ESPN
1004,Washington Nationals,2010
(final large amount of rows until last row)
1231231231235,Detroit Tigers,1990,ESPN
Then I want my output to look like:
ID,NAME,YEAR,SPORTS_ALMANAC,NOTES
1001,New York Mets,1900,ESPN
1002,New York Yankees,1920,Guiness
Because the string "New" is the same string up to the first whitespace in the "Name" column, and the ID's are different. To be clear, I need the code to be as generalizable as possible, since a regular expression on "New" is not what I need, since the common first string could be really any string. And it doesn't matter what happens after the first whitespace (ie "Washington Nationals" and "Washington DC" should still give me a hit, as should the New York examples above...)
I'm confused because in R there is a way to do:
inname$name to search easily by values in a specific row. I tried writing my script in R first, but it got confusing. So I want to stick with Python.
Does this do what you want (Python 3)?
import csv
def first_word(value):
return value.split(" ", 1)[0]
with open(inname, "r") as infile:
with open(outname, "w", newline="") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile)
column_names = next(in_csv)
out_csv.writerow(column_names)
id_index = column_names.index("ID")
name_index = column_names.index("NAME")
try:
row_1 = next(in_csv)
written_row = False
for row_2 in in_csv:
if first_word(row_1[name_index]) == first_word(row_2[name_index]) and row_1[id_index] != row_2[id_index]:
if not written_row:
out_csv.writerow(row_1)
out_csv.writerow(row_2)
written_row = True
else:
written_row = False
row_1 = row_2
except StopIteration:
# No data rows!
pass
For Python 2, use:
with open(outname, "w") as outfile:
in_csv = csv.reader(infile)
out_csv = csv.writer(outfile, lineterminator="\n")