I am pretty new to python. I need to create a class that loads csv data into a dictionary.
I want to be able to control the keys and value
So let say the following code, I can pull out worker1.name or worker1.age anytime i want.
class ageName(object):
'''class to represent a person'''
def __init__(self, name, age):
self.name = name
self.age = age
worker1 = ageName('jon', 40)
worker2 = ageName('lise', 22)
#Now if we print this you see that it`s stored in a dictionary
print worker1.__dict__
print worker2.__dict__
#
'''
{'age': 40, 'name': 'jon'}
#
{'age': 22, 'name': 'lise'}
#
'''
#
#when we call (key)worker1.name we are getting the (value)
print worker1.name
#
'''
#
jon
#
'''
But I am stuck at loading my csv data into keys and value.
[1] I want to create my own keys
worker1 = ageName([name],[age],[id],[gender])
[2] each [name],[age],[id] and [gender] comes from specific a column in a csv data file
I really do not know how to work on this. I tried many methods but I failed. I need some helps to get started on this.
---- Edit
This is my original code
import csv
# let us first make student an object
class Student():
def __init__(self):
self.fname = []
self.lname = []
self.ID = []
self.sport = []
# let us read this file
for row in list(csv.reader(open("copy-john.csv", "rb")))[1:]:
self.fname.append(row[0])
self.lname.append(row[1])
self.ID.append(row[2])
self.sport.append(row[3])
def Tableformat(self):
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for (i, fname) in enumerate(self.fname):
print "%-14s|%-10s|%-5s|%3s" %(fname,self.lname[i],self.ID[i],self.sport[i])
def Table(self):
print self.lname
class Database(Student):
def __init__(self):
g = 0
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
data = student.sport
k = len(student.fname)
print k
freq = {}
for i in data:
freq[i] = freq.get(i, 0) + 1
for i in choice:
if i not in freq:
freq[i] = 0
print i, freq[i]
student = Student()
database = Database()
This is my current code (incomplete)
import csv
class Student(object):
'''class to represent a person'''
def __init__(self, lname, fname, ID, sport):
self.lname = lname
self.fname = fname
self.ID = ID
self.sport = sport
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
student = [Student(row[0], row[1], row[2], row[3]) for row in reader][1::]
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for i in range(len(student)):
print "%-14s|%-10s|%-5s|%3s" %(student[i].lname,student[i].fname,student[i].ID,student[i].sport)
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
lst = []
h = 0
k = len(student)
# 23
for i in range(len(student)):
lst.append(student[i].sport) # merge together
for a in set(lst):
print a, lst.count(a)
for i in set(choice):
if i not in set(lst):
lst.append(i)
lst.count(i) = 0
print lst.count(i)
import csv
reader = csv.reader(open('workers.csv', newline=''), delimiter=',', quotechar='"')
workers = [ageName(row[0], row[1]) for row in reader]
workers now has a list of all the workers
>>> workers[0].name
'jon'
added edit after question was altered
Is there any reason you're using old style classes? I'm using new style here.
class Student:
sports = []
def __init__(self, row):
self.lname, self.fname, self.ID, self.sport = row
self.sports.append(self.sport)
def get(self):
return (self.lname, self.fname, self.ID, self.sport)
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
print "%-14s|%-10s|%-5s|%-11s" % tuple(reader.next()) # read header line from csv
print "-" * 45
students = list(map(Student, reader)) # read all remaining lines
for student in students:
print "%-14s|%-10s|%-5s|%3s" % student.get()
# Printing all sports that are specified by students
for s in set(Student.sports): # class attribute
print s, Student.sports.count(s)
# Printing sports that are not picked
allsports = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
for s in set(allsports) - set(Student.sports):
print s, 0
Hope this gives you some ideas of the power of python sequences. ;)
edit 2, shortened as much as possible... just to show off :P
Ladies and gentlemen, 7(.5) lines.
allsports = ['Basketball','Football','Other','Baseball','Handball',
'Soccer','Volleyball','I do not like sport']
sports = []
reader = csv.reader(open('copy-john.csv'))
for row in reader:
if reader.line_num: sports.append(s[3])
print "%-14s|%-10s|%-5s|%-11s" % tuple(s)
for s in allsports: print s, sports.count(s)
I know this is a pretty old question, but it's impossible to read this, and not think of the amazing new(ish) Python library, pandas. Its main unit of analysis is a think called a DataFrame which is modelled after the way R handles data.
Let's say you have a (very silly) csv file called example.csv which looks like this:
day,fruit,sales
Monday,Banana,10
Monday,Orange,20
Tuesday,Banana,12
Tuesday,Orange,22
If you want to read in a csv in double-quick time, and do 'stuff' with it, you'd be hard pressed to beat the following code for either brevity or ease of use:
>>> import pandas as pd
>>> csv = pd.read_csv('example.csv')
>>> csv
day fruit sales
0 Monday Banana 10
1 Monday Orange 20
2 Tuesday Banana 12
3 Tuesday Orange 22
>>> csv[csv.fruit=='Banana']
day fruit sales
0 Monday Banana 10
2 Tuesday Banana 12
>>> csv[(csv.fruit=='Banana') & (csv.day=='Monday')]
day fruit sales
0 Monday Banana 10
In my opinion, this is really fantastic stuff. Never iterate over a csv.reader object again!
I second Mark's suggestion. In particular, look at DictReader from csv module that allows reading a comma separated (or delimited in general) file as a dictionary.
Look at PyMotW's coverage of csv module for a quick reference and examples of usage of DictReader, DictWriter
Have you looked at the csv module?
import csv
Related
I'm trying to tranform a class I created:
class LabeledSourceFeatures:
label = ''
features = FeatureSet()
def __init__(self, label, features):
self.label = label
self.features = features
# added this as part the of the workaround
def flat_features(self):
return self.features.__dict__
It is easier to create the FeatureSet then add it to this class (just saying that to mention that I don't want to just take the members of FeatureSet and put them in LabeledSourceFeatures), but the end result, I want to put into a pandas dataframe. The problem was, I was getting a data frame with 2 columns, one for the label string and one for the FeatureSet object. What I really want is take all the keys and values of my FeatureSet and make them their own columns.
This is what I've tried so far:
intermediate_data = [(t.__dict__, x.__dict__ for x in t.features) for t in labeledFeatures]
# This fails for syntax error.
A workaround is this:
intermediate_data = [(t.label, t.flat_features()) for t in labeledFeatures]
final_data = []
for row in intermediate_data:
new_row = row[1]
new_row['label'] = row[0]
final_data.append(new_row)
but this looks very inefficient.
Edit:
a FeatureSet looks like this:
class FeatureSet:
"""
Adapted from the CSFS presented in De-anonymizing Programmers via Code Stylometry
by:
Aylin Caliskan-Islam, Drexel University; Richard Harang, U.S. Army Research Laboratory;
Andrew Liu, University of Maryland; Arvind Narayanan, Princeton University;
Clare Voss, U.S. Army Research Laboratory; Fabian Yamaguchi, University of Goettingen;
Rachel Greenstadt, Drexel University
"""
# LEXICAL FEATURES
ln_keyword_length = 0
ln_unique_keyword_length = 0
ln_comments_length = 0
ln_token_length = 0
avg_line_length = 0
# LAYOUT FEATURES
ln_tabs_length = 0
ln_space_length = 0
ln_empty_length = 0
white_space_ratio = 0
is_brace_on_new_line = False
do_tabs_lead_lines = False
comment_text = ''
import PyPDF2
import re
import xlsxwriter
docsFile = open('image0001.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(docsFile)
loanNumberlist = []
loan2Matchlist = []
poolNumlist = []
borrowerNamelist = []
wb = xlsxwriter.Workbook('docInfo.xlsx')
ws = wb.add_worksheet('sheet2')
row = 0
columnHeaders = ['Borrower Name', 'Loan Number', 'LD Loan Number', 'Pool #']
for col, colname in enumerate(columnHeaders, start=0):
ws.write(row, col, colname)
class pdfExtract:
def __init__(self, pg):
self.pg = pg
def extractShit(self):
pageObj = pdfReader.getpage(self.pg)
pgData = pageObj.extractText()
loanNumber = re.split('\\bLoan #:\\b', pgData)[-1]
loanNumberlist.append(loanNumber)
loan2Match = re.match(r"?:/\d{0,10}", pgData)[-1]
loan2Matchlist.append(loan2Match)
poolNumber = re.split('\\bPool #:\\b',pgData)[-1]
poolNumlist.append(poolNumber)
borrowerName =re.split('\\bBorrower #:\\b',pgData)[-1]
borrowerNamelist.append(borrowerName)
for page in range(0, 223):
pdfExtract(page)
for row, rowvar in enumerate(borrowerNamelist, start=1):#write Borrower name
col = 0
ws.write_string(row, col, rowvar)
for row, lnNM in enumerate(loanNumberlist, start=1):#write loan number 1
col = 1
ws.write_number(row, col, lnNM)
for row, lnNM2 in enumerate(loan2Matchlist, start=1):#write loan number 2
col = 2
ws.write_number(row, col, lnNM2)
for row, plNm in enumerate(poolNumlist, start=1):#write pool number
col = 3
ws.write_number(row, col, plNm)
wb.close()
So, I wrote this program to grab data from a pdf file and return 4 things in each page and put them into an excel file. That looks like:
Loan #: 0065192080/3000009289
Pool#: AK1576
Borrower: David h Theman
I have to grab each page get the first loan number, then the second loan number(after the “/“). Then the rest.
It runs through, but all I get to see on the excel file is the headers no data and no errors.
I thought it was how I'm returning the data or how I'm writing it, but it has no changes. Would it have to do with my For loops? The regex code I got from different answers on here. I've changed how I write it to the excel file, but no luck.
I am trying to write a script to generate data. I am using random package for this. I execute the script and everything works fine. But when I check through the results, I found out that the script fails to generate the last 100+ rows for some reason.
Can someone suggest me why this could be happening?
from __future__ import print_function
from faker import Faker;
import random;
## Vaue declaration
population = 3;
product = 3;
years = 3;
months = 13;
days = 30;
tax= 3.5;
## Define Column Header
Column_Names = "Population_ID",";","Product_Name",";","Product_ID",";","Year",";",
"Month",";","Day","Quantity_sold",";","Sales_Price",";","Discount",
";","Actual_Sales_Price",tax;
## Function to generate sales related information
def sales_data():
for x in range(0,1):
quantity_sold = random.randint(5,20);
discount = random.choice(range(5,11));
sales_price = random.uniform(20,30);
return quantity_sold,round(sales_price,2),discount,round((sales_price)-(sales_price*discount)+(sales_price*tax));
## Format the month to quarter and return the value
def quarter(month):
if month >= 1 and month <= 3:
return "Q1";
elif month > 3 and month <= 6:
return "Q2";
elif month > 6 and month <= 9:
return "Q3";
else:
return "Q4";
## Generate product_id
def product_name():
str2 = "PROD";
sample2 = random.sample([1,2,3,4,5,6,7,8,9],5);
string_list = [];
for x in sample2:
string_list.append(str(x));
return (str2+''.join(string_list));
### Main starts here ###
result_log = open("C:/Users/Sangamesh.sangamad/Dropbox/Thesis/Data Preparation/GenData.csv",'w')
print (Column_Names, result_log);
### Loop and Generate Data ###
for pop in range(0,population):
pop = random.randint(55000,85000);
for prod_id in range(0,product):
product_name2 = product_name();
for year in range(1,years):
for month in range(1,months):
for day in range(1,31):
a = sales_data();
rows = str(pop)+";"+product_name2+";"+str(prod_id)+";"+str(year)+";"+str(month)+";"+quarter(month)+";"+str(day)+";"+str(a[0])+";"+str(a[1])+";"+str(a[2])+";"+str(tax)+";"+str(a[3]);
print(rows,file=result_log);
#print (rows);
tax = tax+1;
You need to close a file to have the buffers flushed:
result_log.close()
Better still, use the file object as a context manager and have the with statement close it for you when the block exits:
filename = "C:/Users/Sangamesh.sangamad/Dropbox/Thesis/Data Preparation/GenData.csv"
with result_log = open(filename, 'w'):
# code writing to result_log
Rather than manually writing strings with delimiters in between, you should really use the csv module:
import csv
# ..
column_names = (
"Population_ID", "Product_Name", "Product_ID", "Year",
"Month", "Day", "Quantity_sold", "Sales_Price", "Discount",
"Actual_Sales_Price", tax)
# ..
with result_log = open(filename, 'wb'):
writer = csv.writer(result_log, delimiter=';')
writer.writerow(column_names)
# looping
row = [pop, product_name2, prod_id, year, month, quarter(month), day,
a[0], a[1], a[2], tax, a[3]]
writer.writerow(row)
I made the following code which works but I want to improve it. I don't want to re-read the file, but if I delete sales_input.seek(0) it won't iterate throw each row in sales. How can i improve this?
def computeCritics(mode, cleaned_sales_input = "data/cleaned_sales.csv"):
if mode == 1:
print "creating customer.critics.recommendations"
critics_output = open("data/customer/customer.critics.recommendations",
"wb")
ID = getCustomerSet(cleaned_sales_input)
sales_dict = pickle.load(open("data/customer/books.dict.recommendations",
"r"))
else:
print "creating books.critics.recommendations"
critics_output = open("data/books/books.critics.recommendations",
"wb")
ID = getBookSet(cleaned_sales_input)
sales_dict = pickle.load(open("data/books/users.dict.recommendations",
"r"))
critics = {}
# make critics dict and pickle it
for i in ID:
with open(cleaned_sales_input, 'rb') as sales_input:
sales = csv.reader(sales_input) # read new
for j in sales:
if mode == 1:
if int(i) == int(j[2]):
sales_dict[int(j[6])] = 1
else:
if int(i) == int(j[6]):
sales_dict[int(j[2])] = 1
critics[int(i)] = sales_dict
pickle.dump(critics, critics_output)
print "done"
cleaned_sales_input looks like
6042772,2723,3546414,9782072488887,1,9.99,314968
6042769,2723,3546414,9782072488887,1,9.99,314968
...
where number 6 is the book ID and number 0 is the customer ID
I want to get a dict wich looks like
critics = {
CustomerID1: {
BookID1: 1,
BookID2: 0,
........
BookIDX: 0
},
CustomerID2: {
BookID1: 0,
BookID2: 1,
...
}
}
or
critics = {
BookID1: {
CustomerID1: 1,
CustomerID2: 0,
........
CustomerIDX: 0
},
BookID1: {
CustomerID1: 0,
CustomerID2: 1,
...
CustomerIDX: 0
}
}
I hope this isn't to much information
Here are some suggestions:
Let's first look at this code pattern:
for i in ID:
for j in sales:
if int(i) == int(j[2])
notice that i is only being compared with j[2]. That's its only purpose in the loop. int(i) == int(j[2]) can only be True at most once for each i.
So, we can completely remove the for i in ID loop by rewriting it as
for j in sales:
key = j[2]
if key in ID:
Based on the function names getCustomerSet and getBookSet, it sounds as if
ID is a set (as opposed to a list or tuple). We want ID to be a set since
testing membership in a set is O(1) (as opposed to O(n) for a list or tuple).
Next, consider this line:
critics[int(i)] = sales_dict
There is a potential pitfall here. This line is assigning sales_dict to
critics[int(i)] for each i in ID. Each key int(i) is being mapped to the very same dict. As we loop through sales and ID, we are modifying sales_dict like this, for example:
sales_dict[int(j[6])] = 1
But this will cause all values in critics to be modified simultaneously, since all keys in critics point to the same dict, sales_dict. I doubt that is what you want.
To avoid this pitfall, we need to make copies of the sales_dict:
critics = {i:sales_dict.copy() for i in ID}
def computeCritics(mode, cleaned_sales_input="data/cleaned_sales.csv"):
if mode == 1:
filename = 'customer.critics.recommendations'
path = os.path.join("data/customer", filename)
ID = getCustomerSet(cleaned_sales_input)
sales_dict = pickle.load(
open("data/customer/books.dict.recommendations", "r"))
key_idx, other_idx = 2, 6
else:
filename = 'books.critics.recommendations'
path = os.path.join("data/books", filename)
ID = getBookSet(cleaned_sales_input)
sales_dict = pickle.load(
open("data/books/users.dict.recommendations", "r"))
key_idx, other_idx = 6, 2
print "creating {}".format(filename)
ID = {int(item) for item in ID}
critics = {i:sales_dict.copy() for i in ID}
with open(path, "wb") as critics_output:
# make critics dict and pickle it
with open(cleaned_sales_input, 'rb') as sales_input:
sales = csv.reader(sales_input) # read new
for j in sales:
key = int(j[key_idx])
if key in ID:
other_key = int(j[other_idx])
critics[key][other_key] = 1
critics[key] = sales_dict
pickle.dump(dict(critics), critics_output)
print "done"
#unutbu's answer is better but if you are stuck with this structure you can put the whole file in memory:
sales = []
with open(cleaned_sales_input, 'rb') as sales_input:
sales_reader = csv.reader(sales_input)
[sales.append(line) for line in sales_reader]
for i in ID:
for j in sales:
#do stuff
I have a huge file, which has some missing rows. The data needs to be rooted at Country.
The input data is like:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
which needed to be:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,,
4,USA,WI,Dane,
4,USA,WI,Dane,Madison
"""
The key as per my logic is Type field, where if I cannot find a County (type 3) for a City (type 4), then insert a row with fields upto County.
Same with County. If I cannot find a State (type 2) for a County (type 3), then insert a row with fields upto State.
With my lack of understanding the facilities in python, I was trying more of a brute-force approach. It is bit problematic as I need lot of iteration over the same file.
I was also tried google-refine, but couldn't get it work. Doing manually is quite error prone.
Any help appreciated.
import csv
import io
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
found_county =[]
missing_county =[]
def check_missing_county(row):
found = False
for elm in found_county:
if elm.Type == row.Type:
found = True
if not found:
missing_county.append(row)
print(row)
reader = csv.reader(io.StringIO(csv_str))
for row in reader:
check_missing_county(row)
I've knocked up the following based on my understanding of the question:
import csv
import io
csv_str = u"""Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
counties = []
states = []
def handle_missing_data(row):
try:
rtype = int(row[0])
except ValueError:
return []
rtype = row[0]
country = row[1]
state = row[2]
county = row[3]
rows = []
# if a state is present and it hasn't a row of it's own
if state and state not in states:
rows.append([rtype, country, state, '', ''])
states.append(state)
# if a county is present and it hasn't a row of it's own
if county and county not in counties:
rows.append([rtype, country, state, county, ''])
counties.append(county)
# if the row hasn't already been added add it now
if row not in rows:
rows.append(row)
return rows
csvf = io.StringIO(csv_str)
reader = csv.reader(csvf)
for row in reader:
new_rows = handle_missing_data(row)
for new_row in new_rows:
print new_row