Reading and Overwriting .CSV file data - python

I am attempting to produce a read and write sequence for a python program. I am fairly new to Python; and as a result do not have amazing knowledge of the actual code.
I am struggling with reading data from a .CSV file, this file contains 3 columns, and the amount of rows depends on how many users use the program I have created. Now, I know how to locate rows, but the problem is that it returns the entire row, with all three columns of data within it. So - how do I isolate pieces of data? And subsequently, how do I turn these pieces of data into variables which can be read, written or overwritten.
Please bare in mind that the context of the program is the Computing A453 coursework. Also remember I am not asking you to do my work for me, I have already completed the other 2 tasks, and all the logic and planning for task 3. It's just I only have 2 weeks left until I have to hand this coursework in, and trying to work out the code that can read and overwrite data is extremely hard for a beginner like me.
with open('results.csv', 'a', newline='') as fp:
g = csv.writer(fp, delimiter=',')
data = [[name, score, classroom]]
g.writerows(data)
fp.close()
# Do the reading
result = open('results.csv', 'r')
reader = csv.reader(result)
new_rows_list = []
for row in reader:
if row[2] == name:
if row[2] < score:
result.close()
file2 = open('results.csv', 'wb')
writer = csv.writer(file2)
new_row = [row[2], row[2], name, score, classnumber]
new_rows_list.append(new_row)
file2.close()
At the moment, this code reads the file, but not in the way I want it too. I want it to isolate the "name" of the user on record (within the .csv file). Instead of doing so, it reads the entire row as a whole, which I do not know how to isolate down to just the name of the user.
Here is the data in the .CSV file:
Name Score Class number
jor 0 2
jor 0 2
jor 1 2

I'm assuming what you're getting looks like this:
Jacob, 14, Class B, Number 3
And that that is a string.
If that is the case, String.split() is your answer.
String.split() takes a character as an argument, in your case a comma (Comma Seperated Values), and returns an array of everything in between every instance of that character in the string.
From there, if you want to use the results as data in your program, you should cast the values to the datatype you want (Like float(x) or int(x))
Hope this helped

Related

Need to print CSV output into separate rows in Python, instead of one long string

I am trying to print the output of a webscrape project into a CSV file.
So for example I have this list of supplier names under a list called SUPP_NAME: (just an example, the actual list has 50 items inside it)
['"FULIAN\\u0020\\u0028M\\u0029\\u0020SENDIRIAN\\u0020BERHAD"', '"RISO\\u0020SEKKEN\\u0020SDN.\\u0020BHD."', '"NATURE\\u0020PROFUSION\\u0020SDN.\\u0020BHD."']
and a list of numbers indicated years, under a list called SUPP_YEARS:
['"9"', '"4"', '"1"', '"1"']
My plan is to put them into a CSV, and then read them back in as a pandas dataframe, then perform decoding to get a bunch of values.
Code so far:
import csv
with open('output3.csv' , 'w') as f:
writer = csv.writer(f)
headers = "Supplier_name,Years\n"
f.write(headers)
supp_names = re.findall(r'("supplierName"):("\w+.+")', results[17].text)
supp_years = re.findall(r'("supplierYear"):("\d+")', results[17].text)
SUPP_NAME = []
for title, name in supp_names:
print (name)
SUPP_NAME.append(name)
#f.write(name + "\n")
SUPP_YEAR = []
for year,number in supp_years:
print (number)
SUPP_YEAR.append(number)
#f.write(number + "\n")
writer.writerow([SUPP_NAME, SUPP_YEAR])
However, what I get is that under the Supplier_name and Years columns, one cell under each of these 2 columns is filled with a LONG list of items still contained in the list, instead of the items separated one by one.
What am I doing wrong? Thanks in advance for answering.
The two re.findall() calls are giving you lists of items (hopefully both the same length). The idea is to then then extract an element from each and write this to your output file. Python has a useful function called zip() to do this. You give it both of your lists and the loop with give you an item from each on each iteration:
import csv
with open('output3.csv', 'w' newline='') as f_output:
writer = csv.writer(f_output)
writer.writerow(["Supplier_name" , "Years"])
supp_names = re.findall(r'("supplierName"):("\w+.+")', results[17].text)
supp_years = re.findall(r'("supplierYear"):("\d+")', results[17].text)
for name, year in zip(supp_names, supp_years):
writer.writerow([name, year])
The csv.writer() object is designed to take a list of items and write them to your file with the desired (i.e. comma) delimiter automatically added between them.
I assume you are using Python 3.x? If not you should change the following:
with open('output3.csv', 'wb') as f_output:

Trying to copy column1 from a csv file to another empty file using python

I'm looking for a way using python to copy the first column from a csv into an empty file. I'm trying to learn python so any help would be great!
So if this is test.csv
A 32
D 21
C 2
B 20
I want this output
A
D
C
B
I've tried the following commands in python but the output file is empty
f= open("test.csv",'r')
import csv
reader = csv.reader(f,delimiter="\t")
names=""
for each_line in reader:
names=each_line[0]
First, you want to open your files. A good practice is to use the with statement (that, technically speaking, introduces a context manager) so that when your code exits from the with block all the files are automatically closed
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
next you want a loop on the lines of the input file (note the indentation, we are inside the with block), line splitting is automatic when you read a text file with lines separated by newlines…
for line in inpfile:
each line is a string, but you think of it as two fields separated by white space — this situation is so common that strings have a method to deal with this situation (note again the increasing indent, we are in the for loop block)
fields = line.split()
by default .split() splits on white space, but you can use, e.g., split(',') to split on commas, etc — that said, fields is a list of strings, for your first record it is equal to ['A', '32'] and you want to output just the first field in this list… for this purpose a file object has the .write() method, that writes a string, just a string, to the file, and fields[0] IS a string, but we have to add a newline character to it because, in this respect, .write() is different from print().
outfile.write(fields[0]+'\n')
That's all, but if you omit my comments it's 4 lines of code
with open('test.csv') as inpfile, open('out.csv', 'w') as outfile:
for line in inpfile:
fields = line.split()
outfile.write(fields[0]+'\n')
When you are done with learning (some) Python, ask for an explanation of this...
with open('test.csv') as ifl, open('out.csv', 'w') as ofl:
ofl.write('\n'.join(line.split()[0] for line in ifl))
Addendum
The csv module in such a simple case adds the additional conveniences of
auto-splitting each line into a list of strings
taking care of the details of output (newlines, etc)
and when learning Python it's more fruitful to see how these steps can be done using the bare language, or at least that it is my opinion…
The situation is different when your data file is complex, has headers, has quoted strings possibly containing quoted delimiters etc etc, in those cases the use of csv is recommended, as it takes into account all the gory details. For complex data analisys requirements you will need other packages, not included in the standard library, e.g., numpy and pandas, but that is another story.
This answer reads the CSV file, understanding a column to be demarked by a space character. You have to add the header=None otherwise the first row will be taken to be the header / names of columns.
ss is a slice - the 0th column, taking all rows as denoted by :
The last line writes the slice to a new filename.
import pandas as pd
df = pd.read_csv('test.csv', sep=' ', header=None)
ss = df.ix[:, 0]
ss.to_csv('new_path.csv', sep=' ', index=False)
import csv
reader = csv.reader(open("test.csv","rb"), delimiter='\t')
writer = csv.writer(open("output.csv","wb"))
for e in reader:
writer.writerow(e[0])
The best you can do is create a empty list and append the column and then write that new list into another csv for example:
import csv
def writetocsv(l):
#convert the set to the list
b = list(l)
print (b)
with open("newfile.csv",'w',newline='',) as f:
w = csv.writer(f, delimiter=',')
for value in b:
w.writerow([value])
adcb_list = []
f= open("test.csv",'r')
reader = csv.reader(f,delimiter="\t")
for each_line in reader:
adcb_list.append(each_line)
writetocsv(adcb_list)
hope this works for you :-)

Business student totally new to Python wants a script for strings fuzzy matching

I am a business student who just began to learn Python. My professor asked me to do fuzzy matching between two files: US Patent information and Company information downloaded from stock exchange website. My task is to compare the company names that showed up in US Patent documentation (column 1 from file 1) and names found on stock exchange website(column 1 from file 2) . From what I’ve known, the (1) first step is to change all the letters listed file 1 and file 2 to lower cases; (2) Pick each name from file 2 and match it with all the names in file 1 and return 15 closest matches. (3) Repeat step 2, run through all the names is file 2. (4) With every match, there is one similarity level.
I guess I will use the SequenceMatcher() object. I just learn how to import data from my csv file(I have 2 files), see below
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell
Sorry about my silly question but I am too new to replace the strings (“abcde”, “abcde”, as shown below) data with my own data. I have no idea how to change the data I imported to lower cases. And I don’t even know how to set the 15 closest matches standard. My professor told me this was an easy task, but I really felt defeated. Thank you for reading! Hopefully someone can give me some instructions. I am not that stupid :)
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcde', 'abcde').ratio()
1.0
To answer your questions one by one.
1) "I have no idea how to change the data I imported to lower cases."
In order to change the cell to lower case, you would use [string].lower()
The following code will print out each cell in lower case
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
print "------------------"
print row
print "------------------"
for cell in row:
print cell.lower();
So to change each cell to lower case you would do
import csv
with open('USPTO.csv', 'rb') as csvfile:
data = csv.reader(csvfile, delimiter=',')
for row in data:
for cell in row:
cell = cell.lower();
2) "I don’t even know how to set the 15 closest matches standard."
For this you should set up a dictionary, the key will be the first string, the value will be a list of pairs, (string2, the value from difflib.SequenceMatcher(None, string1, string2).ratio()).
Please attempt to write some code and we will help you fix it.
Look at https://docs.python.org/2/tutorial/datastructures.html for how to construct a dictionary

Writing multiple values in single cell in csv

For each user I have the list of events in which he participated.
e.g. bob : [event1,event2,...]
I want to write it in csv file. I created a dictionary (key - user & value - list of events)
I wrote it in csv. The following is the sample output
username, frnds
"abc" ['event1','event2']
where username is first col and frnds 2nd col
This is code
writer = csv.writer(open('eventlist.csv', 'ab'))
for key, value in evnt_list.items():
writer.writerow([key, value])
when I am reading the csv I am not getting the list directly. But I am getting it in following way
['e','v','e','n','t','1','','...]
I also tried to write the list directly in csv but while reading am getting the same output.
What I want is multiple values in a single cell so that when I read a column for a row I get list of all events.
e.g
colA colB
user1,event1,event2,...
I think it's not difficult but somehow I am not getting it.
###Reading
I am reading it with the help of following
codereader = csv.reader(open("eventlist.csv"))
reader.next()
for row in reader:
tmp=row[1]
print tmp # it is printing the whole list but
print tmp[0] #the output is [
print tmp[1] #output is 'e' it should have been 'event1'
print tmp[2] #output is 'v' it should have been 'event2'
you have to format your values into a single string:
with open('eventlist.csv', 'ab') as f:
writer = csv.writer(f, delimiter=' ')
for key, value in evnt_list.items():
writer.writerow([key, ','.join(value)])
exports as
key1 val11,val12,val13
key2 val21,val22,val23
READING: Here you have to keep in mind, that you converted your Python list into a formatted string. Therefore you cannot use standard csv tools to read it:
with open("eventlist.csv") as f:
csvr = csv.reader(f, delimiter=' ')
csvr.next()
for rec in csvr:
key, values_txt = rec
values = values_txt.split(',')
print key, values
works as awaited.
You seem to be saying that your evnt_list is a dictionary whose keys are strings and whose values are lists of strings. If so, then the CSV-writing code you've given in your question will write a string representation of a Python list into the second column. When you read anything in from CSV, it will just be a string, so once again you'll have a string representation of your list. For example, if you have a cell that contains "['event1', 'event2']" you will be reading in a string whose first character (at position 0) is [, second character is ', third character is e, etc. (I don't think your tmp[1] is right; I think it is really ', not e.)
It sounds like you want to reconstruct the Python object, in this case a list of strings. To do that, use ast.literal_eval:
import ast
cell_string_value = "['event1', 'event2']"
cell_object = ast.literal_eval(cell_string_value)
Incidentally, the reason to use ast.literal_eval instead of just eval is safety. eval allows arbitrary Python expressions and is thus a security risk.
Also, what is the purpose of the CSV, if you want to get the list back as a list? Will people be reading it (in Excel or something)? If not, then you may want to simply save the evnt_list object using pickle or json, and not bother with the CSV at all.
Edit: I should have read more carefully; the data from evnt_list is being appended to the CSV, and neither pickle nor json is easily appendable. So I suppose CSV is a reasonable and lightweight way to accumulate the data. A full-blown database might be better, but that would not be as lightweight.

Use Python to select rows with a particular range of values in one column

I know this is simple, but I'm a new user to Python so I'm having a bit of trouble here. I'm using Python 3 by the way.
I have multiple files that look something like this:
NAME DATE AGE SEX COLOR
Name Date Age Sex Color
Ray May 25.1 M Gray
Alex Apr 22.3 F Green
Ann Jun 15.7 F Blue
(Pretend this is tab delimited. I should add that the real file will have about 3,000 rows and 17-18 columns)
What I want to do is select all the rows which have a value in the age column which is less than 23.
In this example, the output would be:
Name Date Age Sex Color
Alex Apr 22.3 F Green
Ann Jun 15.7 F Blue
Here's what I tried to do:
f = open("addressbook1.txt",'r')
line = f.readlines()
file_data =[line.split("\t")]
f.close()
for name, date, age, sex, color in file_data:
if age in line_data < 23:
g = open("college_age.txt",'a')
g.write(line)
else:
h = open("adult_age.txt",'a')
h.write(line)
Now, ideally, I have 20-30 of these "addressbook" inputfiles and I wanted this script to loop through them all and add all the entries with an age under 23 to the same output file ("college_age.txt"). I really don't need to keep the other lines, but I didn't know what else to do with them.
This script, when I run it, generates an error.
AttributeError: 'list' object has no attribute 'split'
Then I change the third line to:
file_data=[line.split("\t") for line in f.readlines()]
And it no longer gives me an error, but simply does nothing at all. It just starts and then starts.
Any help? :) Remember I'm dumb with Python.
I should have added that my actual data has decimals and are not integers. I have edited the data above to reflect that.
The issue here is that you are using readlines() twice, which means that the data is read the first time, then nothing is left the second time.
You can iterate directly over the file without using readlines() - in fact, this is the better way, as it doesn't read the whole file in at once.
While you could do what you are trying to do by using str.split() as you have, the better option is to use the csv module, which is designed for the task.
import csv
with open("addressbook1.txt") as input, open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
reader = csv.DictReader(input, dialect="excel-tab")
fieldnames = reader.fieldnames
writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
writer_college.writeheader()
writer_adult.writeheader()
for row in reader:
if int(row["Age"]) < 23:
writer_college.writerow(row)
else:
writer_adult.writerow(row)
So what are we doing here? First of all we use the with statement for opening files. It's not only more pythonic and readable but handles closing for you, even when exceptions occur.
Next we create a DictReader that reads rows from the file as dictionaries, automatically using the first row as the field names. We then make writers to write back to our split files, and write the headers in. Using the DictReader is a matter of preference. It's generally used more where you access the data a lot (and when you don't know the order of the columns), but it makes the code nice a readable here. You could, however, just use a standard csv.reader().
Next we loop through the rows in the file, checking the age (which we convert to an int so we can do a numerical comparison) to know what file to write to. The with statement closes out files for us.
For multiple input files:
import csv
fieldnames = ["Name", "Date", "Age", "Sex", "Color"]
filenames = ["addressbook1.txt", "addressbook2.txt", ...]
with open("college_age.txt", "w") as college, open("adult_age.txt", "w") as adult:
writer_college = csv.DictWriter(college, fieldnames, dialect="excel-tab")
writer_adult = csv.DictWriter(adult, fieldnames, dialect="excel-tab")
writer_college.writeheader()
writer_adult.writeheader()
for filename in filenames:
with open(filename, "r") as input:
reader = csv.DictReader(input, dialect="excel-tab")
for row in reader:
if int(row["Age"]) < 23:
writer_college.writerow(row)
else:
writer_adult.writerow(row)
We just add a loop in to work over multiple files. Please note that I also added a list of field names. Before I just used the fields and order from the file, but as we have multiple files, I figured it would be more sensible to do that here. An alternative would be to use the first file to get the field names.
I think it is better to use csv module for reading such files http://docs.python.org/library/csv.html
ITYM
with open("addressbook1.txt", 'r') as f:
# with automatically closes
file_data = ((line, line.split("\t")) for line in f)
with open("college_age.txt", 'w') as g, open("adult_age.txt", 'w') as h:
for line, (name, date, age, sex, color) in file_data:
if int(age) < 23: # float() if it is not an integer...
g.write(line)
else:
h.write(line)
It might look like the file data is iterated through several times. But thanks to the generator expression, file data is just a generator handing out the next line of the file if asked to do so. And it is asked to do so in the for loop. That means, every item retrieved by the for loop comes from the generator file_data where on request each file line gets transformed into a tuple holding the complete line (for copying) as well as its components (for testing).
An alternative could be
file_data = ((line, line.split("\t")) for line in iter(f.readline, ''))
it is closer to readlines() than iterating over the file. As readline() acts behind the scenes slightly different from iteration over the file, it might be necessary to do so.
(If you don't like functional programming, you as well could create a generator function manually calling readline() until an empty string is returned.
And if you don't like nested generators at all, you can do
with open("addressbook1.txt", 'r') as f, open("college_age.txt", 'w') as g, open("adult_age.txt", 'w') as h:
for line in f:
name, date, age, sex, color = line.split("\t")
if int(age) < 23: # float() if it is not an integer...
g.write(line)
else:
h.write(line)
which does exactly the same.)

Categories