I have a huge file, which has some missing rows. The data needs to be rooted at Country.
The input data is like:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
which needed to be:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,,
4,USA,WI,Dane,
4,USA,WI,Dane,Madison
"""
The key as per my logic is Type field, where if I cannot find a County (type 3) for a City (type 4), then insert a row with fields upto County.
Same with County. If I cannot find a State (type 2) for a County (type 3), then insert a row with fields upto State.
With my lack of understanding the facilities in python, I was trying more of a brute-force approach. It is bit problematic as I need lot of iteration over the same file.
I was also tried google-refine, but couldn't get it work. Doing manually is quite error prone.
Any help appreciated.
import csv
import io
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
found_county =[]
missing_county =[]
def check_missing_county(row):
found = False
for elm in found_county:
if elm.Type == row.Type:
found = True
if not found:
missing_county.append(row)
print(row)
reader = csv.reader(io.StringIO(csv_str))
for row in reader:
check_missing_county(row)
I've knocked up the following based on my understanding of the question:
import csv
import io
csv_str = u"""Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
counties = []
states = []
def handle_missing_data(row):
try:
rtype = int(row[0])
except ValueError:
return []
rtype = row[0]
country = row[1]
state = row[2]
county = row[3]
rows = []
# if a state is present and it hasn't a row of it's own
if state and state not in states:
rows.append([rtype, country, state, '', ''])
states.append(state)
# if a county is present and it hasn't a row of it's own
if county and county not in counties:
rows.append([rtype, country, state, county, ''])
counties.append(county)
# if the row hasn't already been added add it now
if row not in rows:
rows.append(row)
return rows
csvf = io.StringIO(csv_str)
reader = csv.reader(csvf)
for row in reader:
new_rows = handle_missing_data(row)
for new_row in new_rows:
print new_row
Related
I've written a python program that takes some inputs and turns them into a matplotlib graph. Specifically, it displays wealth distributions by percentile for a country of the user's choosing. However, these inputs are currently given by changing variables in the program.
I want to put this code on a website, allowing users to choose any country and see the wealth distribution for that country, as well as how they compare. Essentially, I am trying to recreate this: https://wid.world/income-comparator/
The code in python is all done but I am struggling to incorporate it into an HTML file. I was trying to use pyscript but it currently loads forever and displays nothing. Would rather not rewrite it in javascript (mainly because I don't know js). My thoughts are that it has something to do with the code importing csv files from my device?
import csv
from typing import List
import matplotlib.pyplot as plt
import collections
import math
from forex_python.converter import CurrencyRates
# ---------------- #
# whether or not the graph includes the top 1 percent in the graph (makes the rest of the graph visible!)
one_percent = False # True or False
# pick which country(ies) you want to view
country = 'China' # String
# what currency should the graph use
currency_used = 'Canada' # String
# if you want to compare an income
compare_income = True # True or False
# what income do you want to compare
income = 100000 # Int
# ---------------- #
codes = {}
# get dictionary of monetary country codes
monetary_codes = {}
with open('codes-all.csv') as csv_file:
list = csv.reader(csv_file, delimiter=',')
for row in list:
if row[5] == "":
monetary_codes[row[0]] = (row[2], row[1])
# get dictionary of country names and codes for WID
with open('WID_countries.csv') as csv_file:
WID_codes = csv.reader(csv_file, delimiter=',')
next(WID_codes)
for row in WID_codes:
if len(row[0]) == 2:
if row[2] != "":
monetary_code = monetary_codes[row[1].upper()][0]
currency_name = monetary_codes[row[1].upper()][1]
codes[row[1].upper()] = (row[0], monetary_code, currency_name)
elif row[2] == "":
codes[row[1].upper()] = (row[0], 'USD', 'United States Dollar')
elif row[0][0] == 'U' and row[0][1] == 'S':
codes[row[1].upper()] = (row[0], 'USD', 'United States Dollar')
# converts user input to upper case
country = country.upper()
currency_used = currency_used.upper()
# gets conversion rate
c = CurrencyRates()
conversion_rate = c.get_rate(codes[country][1], codes[currency_used][1])
# convert money into correct currency
def convert_money(conversion_rate, value):
return float(value) * conversion_rate
# get and clean data
def get_data(country):
aptinc = {}
# cleaning the data
with open(f'country_data/WID_data_{codes[country][0]}.csv') as csv_file:
data = csv.reader(csv_file, delimiter=';')
for row in data:
# I only care about the year 2021 and the variable 'aptinc'
if 'aptinc992' in row[1] and row[3] == '2021':
# translates percentile string into a numerical value
index = 0
for i in row[2]:
# index 0 is always 'p', so we get rid of that
if index == 0:
row[2] = row[2][1:]
# each string has a p in the middle of the numbers we care about. I also only
# care about the rows which measure a single percentile
# (upper bound - lower bound <= 1)
elif i == 'p':
lb = float(row[2][:index - 1])
ub = float(row[2][index:])
# if the top one percent is being filtered out adds another requirement
if not one_percent:
if ub - lb <= 1 and ub <= 99:
row[2] = ub
else:
row[2] = 0
else:
if ub - lb <= 1:
row[2] = ub
else: row[2] = 0
index += 1
# adds wanted, cleaned data to a dictionary. Also converts all values to one currency
if row[2] != 0:
aptinc[row[2]] = convert_money(conversion_rate, row[4])
return aptinc
# find the closest percentile to an income
def closest_percentile(income, data):
closest = math.inf
percentile = float()
for i in data:
difference = income - data[i]
if abs(difference) < closest:
closest = difference
percentile = i
return percentile
# ---------------- #
unsorted_data = {}
percentiles = []
average_income = []
# gets data for the country
data = get_data(country)
for i in data:
unsorted_data[i] = data[i]
# sorts the data
sorted = collections.OrderedDict(sorted(unsorted_data.items()))
for i in sorted:
percentiles.append(i)
average_income.append(data[i])
# makes countries pretty for printing
country = country.lower()
country = country.capitalize()
# calculates where the income places against incomes from country(ies)
blurb = ""
if compare_income:
percentile = closest_percentile(income, sorted)
blurb = f"You are richer than {round(percentile)} percent of {country}'s population"
# plot this data!
plt.plot(percentiles,average_income)
plt.title(f'{country} Average Annual Income by Percentile')
plt.xlabel(f'Percentile\n{blurb}')
plt.ylabel(f'Average Annual Income of {country}({codes[currency_used][1]})')
plt.axvline(x = 99, color = 'r', label = '99th percentile', linestyle=':')
if compare_income:
plt.axvline(x = percentile, color = 'g', label = f'{income} {codes[currency_used][2]}')
plt.legend(bbox_to_anchor = (0, 1), loc = 'upper left')
plt.show()
I have a CSV, OutputA with format:
Position,Category,Name,Team,Points
1,A,James,Team 1,100
2,A,Mark,Team 2,95
3,A,Tom,Team 1,90
I am trying to get an output of a CSV which gets the total points for each team, the average points per team and the number of riders.
So output would be:
Team,Points,AvgPoints,NumOfRiders
Team1,190,95,2
Team2,95,95,1
I have this function to convert each row to a namedtuple:
fields = ("Position", "Category", "Name", "Team", "Points")
Results = namedtuple('CategoryResults', fields)
def csv_to_tuple(path):
with open(path, 'r', errors='ignore') as file:
reader = csv.reader(file)
for row in map(Results._make, reader):
yield row
Then this sorts the rows into a sorted list by there club:
moutputA = sorted(list(csv_to_tuple("Male/outputA.csv")), key=lambda k: k[3])
This returns a list like:
[CategoryResults(Position='13', Category='A', Name='Marek', Team='1', Points='48'), CategoryResults(Position='7', Category='A', Name='', Team='1', Points='70')]
I am confident that this so far is right although I could be wrong.
I am trying to create a new list of teams with the points (not yet added up).
For example:
[Team 1(1,2,3,4,5)]
[Team 2 (6,9,10)]
etc.
The idea is that I can find how many unique values of points there are (this equals the number of riders). However, when trying to group the list I have this code:
Clubs = []
Club_Points = []
for Names, Club in groupby(moutputA, lambda x: x[3]):
for Teams in Names:
Clubs.append(list(Teams))
for Club, Points in groupby(moutputA, lambda x: x[4]):
for Point in Clubs:
Club_Points.append(list(Point))
print(Clubs)
but this retuns this error:
Teams.append(list(Team))
AttributeError: 'itertools._grouper' object has no attribute 'append'
If data.csv contains:
Position,Category,Name,Team,Points
1,A,James,Team 1,100
2,A,Mark,Team 2,95
3,A,Tom,Team 1,90
Then this script:
import csv
from collections import namedtuple
from itertools import groupby
from statistics import mean
fields = ("Position", "Category", "Name", "Team", "Points")
Results = namedtuple('CategoryResults', fields)
def csv_to_tuple(path):
with open(path, 'r', errors='ignore') as file:
next(file) # skip header
reader = csv.reader(file)
for row in map(Results._make, reader):
yield row
moutputA = sorted(csv_to_tuple("data.csv"), key=lambda k: k.Team)
out = []
for team, group in groupby(moutputA, lambda x: x.Team):
group = list(group)
d = {}
d['Team'] = team
d['Points'] = sum(int(i.Points) for i in group)
d['AvgPoints'] = mean(int(i.Points) for i in group)
d['NumOfRider'] = len(group)
out.append(d)
with open('data_out.csv', 'w', newline='') as csvfile:
fieldnames = ['Team', 'Points', 'AvgPoints', 'NumOfRider']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for row in out:
writer.writerow(row)
Produces data_out.csv:
Team,Points,AvgPoints,NumOfRider
Team 1,190,95,2
Team 2,95,95,1
Screenshot from LibreOffice:
Here's a start. You should be able to figure out how to get what you want from this.
import csv, io
from collections import namedtuple
from itertools import groupby
data = '''\
Position,Category,Name,Team,Points
1,A,James,Team 1,100
2,A,Mark,Team 2,95
3,A,Tom,Team 1,90
'''
b = io.StringIO(data)
next(b)
fields = ("Position", "Category", "Name", "Team", "Points")
Results = namedtuple('CategoryResults', fields)
def csv_to_tuple(file):
reader = csv.reader(file)
for row in map(Results._make, reader):
yield row
rows = sorted(list(csv_to_tuple(b)), key=lambda k: k[3])
for TeamName, TeamRows in groupby(rows, lambda x: x[3]):
print(TeamName)
TeamPoints = [row.Points for row in TeamRows]
print(TeamPoints)
print()
All of this would be made easier by just using pandas. Check out the code below.
import pandas as pd
import numpy as np
df = pd.read_csv(input_path)
teams = list(set(df['Team'])) # unique list of all the teams
num_teams = len(teams)
points = np.empty(shape=num_teams)
avg_points = np.empty(shape=num_teams)
num_riders = np.empty(shape=num_teams)
for i in range(num_teams):
# find all rows where the entry in the 'Team' column
# is the same as teams[i]
req = df.loc[df['Team'] == teams[i]]
points[i] = np.sum(req['Points'])
num_riders[i] = len(req)
avg_points[i] = point[i]/num_riders[i]
dict_out = {
'Team':teams,
'Points':points,
'AvgPoints':avg_points,
'NumOfRiders':num_riders
}
df_out = pd.DataFrame(data=dict_out)
df_out.to_csv(output_path)
import PyPDF2
import re
import xlsxwriter
docsFile = open('image0001.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(docsFile)
loanNumberlist = []
loan2Matchlist = []
poolNumlist = []
borrowerNamelist = []
wb = xlsxwriter.Workbook('docInfo.xlsx')
ws = wb.add_worksheet('sheet2')
row = 0
columnHeaders = ['Borrower Name', 'Loan Number', 'LD Loan Number', 'Pool #']
for col, colname in enumerate(columnHeaders, start=0):
ws.write(row, col, colname)
class pdfExtract:
def __init__(self, pg):
self.pg = pg
def extractShit(self):
pageObj = pdfReader.getpage(self.pg)
pgData = pageObj.extractText()
loanNumber = re.split('\\bLoan #:\\b', pgData)[-1]
loanNumberlist.append(loanNumber)
loan2Match = re.match(r"?:/\d{0,10}", pgData)[-1]
loan2Matchlist.append(loan2Match)
poolNumber = re.split('\\bPool #:\\b',pgData)[-1]
poolNumlist.append(poolNumber)
borrowerName =re.split('\\bBorrower #:\\b',pgData)[-1]
borrowerNamelist.append(borrowerName)
for page in range(0, 223):
pdfExtract(page)
for row, rowvar in enumerate(borrowerNamelist, start=1):#write Borrower name
col = 0
ws.write_string(row, col, rowvar)
for row, lnNM in enumerate(loanNumberlist, start=1):#write loan number 1
col = 1
ws.write_number(row, col, lnNM)
for row, lnNM2 in enumerate(loan2Matchlist, start=1):#write loan number 2
col = 2
ws.write_number(row, col, lnNM2)
for row, plNm in enumerate(poolNumlist, start=1):#write pool number
col = 3
ws.write_number(row, col, plNm)
wb.close()
So, I wrote this program to grab data from a pdf file and return 4 things in each page and put them into an excel file. That looks like:
Loan #: 0065192080/3000009289
Pool#: AK1576
Borrower: David h Theman
I have to grab each page get the first loan number, then the second loan number(after the “/“). Then the rest.
It runs through, but all I get to see on the excel file is the headers no data and no errors.
I thought it was how I'm returning the data or how I'm writing it, but it has no changes. Would it have to do with my For loops? The regex code I got from different answers on here. I've changed how I write it to the excel file, but no luck.
I am working on a function to pull out of CSV specific rows. Every CSV row has a unique ID that identifies it to the function. Some IDs are missing. I want to somehow find after iterating these invalid IDs.
Example:
(a sample CSV db_short.csv with rows 1-52 and then 99)
import csv
def get_row(csvfile, row_id):
with open(csvfile, 'rb') as csvfile:
newfile = csv.DictReader(csvfile, delimiter=',', quotechar='|')
somevalue = 'default'
for row in newfile:
if row['id'] == str(row_id):
somevalue = 'id = {}'.format(row['id'])
else:
pass
return somevalue
db = "db_short.csv"
flatlist = [1, 18, 42, 51, 53, 99]
new_entries = []
for i in flatlist:
new_entries.append(get_row(db, i))
print new_entries
Note that flatlist includes a deliberately missing ID 53. This code predictably produces output where search for 'id' : 53' returns 'default'.
['id = 1', 'id = 18', 'id = 42', 'id = 51', 'default', 'id = 99']
I would however like to replace somevalue = 'default' with, say, a customized message alerting to a missing ID, that will only appear if DictReader went through the whole CSV and did not find any row that contains 'id' : '53' -- .
somevalue = '{} id missing!'.format(row_id)
So how do I have to change my code?
I am pretty new to python. I need to create a class that loads csv data into a dictionary.
I want to be able to control the keys and value
So let say the following code, I can pull out worker1.name or worker1.age anytime i want.
class ageName(object):
'''class to represent a person'''
def __init__(self, name, age):
self.name = name
self.age = age
worker1 = ageName('jon', 40)
worker2 = ageName('lise', 22)
#Now if we print this you see that it`s stored in a dictionary
print worker1.__dict__
print worker2.__dict__
#
'''
{'age': 40, 'name': 'jon'}
#
{'age': 22, 'name': 'lise'}
#
'''
#
#when we call (key)worker1.name we are getting the (value)
print worker1.name
#
'''
#
jon
#
'''
But I am stuck at loading my csv data into keys and value.
[1] I want to create my own keys
worker1 = ageName([name],[age],[id],[gender])
[2] each [name],[age],[id] and [gender] comes from specific a column in a csv data file
I really do not know how to work on this. I tried many methods but I failed. I need some helps to get started on this.
---- Edit
This is my original code
import csv
# let us first make student an object
class Student():
def __init__(self):
self.fname = []
self.lname = []
self.ID = []
self.sport = []
# let us read this file
for row in list(csv.reader(open("copy-john.csv", "rb")))[1:]:
self.fname.append(row[0])
self.lname.append(row[1])
self.ID.append(row[2])
self.sport.append(row[3])
def Tableformat(self):
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for (i, fname) in enumerate(self.fname):
print "%-14s|%-10s|%-5s|%3s" %(fname,self.lname[i],self.ID[i],self.sport[i])
def Table(self):
print self.lname
class Database(Student):
def __init__(self):
g = 0
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
data = student.sport
k = len(student.fname)
print k
freq = {}
for i in data:
freq[i] = freq.get(i, 0) + 1
for i in choice:
if i not in freq:
freq[i] = 0
print i, freq[i]
student = Student()
database = Database()
This is my current code (incomplete)
import csv
class Student(object):
'''class to represent a person'''
def __init__(self, lname, fname, ID, sport):
self.lname = lname
self.fname = fname
self.ID = ID
self.sport = sport
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
student = [Student(row[0], row[1], row[2], row[3]) for row in reader][1::]
print "%-14s|%-10s|%-5s|%-11s" %('First Name','Last Name','ID','Favorite Sport')
print "-" * 45
for i in range(len(student)):
print "%-14s|%-10s|%-5s|%3s" %(student[i].lname,student[i].fname,student[i].ID,student[i].sport)
choice = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
lst = []
h = 0
k = len(student)
# 23
for i in range(len(student)):
lst.append(student[i].sport) # merge together
for a in set(lst):
print a, lst.count(a)
for i in set(choice):
if i not in set(lst):
lst.append(i)
lst.count(i) = 0
print lst.count(i)
import csv
reader = csv.reader(open('workers.csv', newline=''), delimiter=',', quotechar='"')
workers = [ageName(row[0], row[1]) for row in reader]
workers now has a list of all the workers
>>> workers[0].name
'jon'
added edit after question was altered
Is there any reason you're using old style classes? I'm using new style here.
class Student:
sports = []
def __init__(self, row):
self.lname, self.fname, self.ID, self.sport = row
self.sports.append(self.sport)
def get(self):
return (self.lname, self.fname, self.ID, self.sport)
reader = csv.reader(open('copy-john.csv'), delimiter=',', quotechar='"')
print "%-14s|%-10s|%-5s|%-11s" % tuple(reader.next()) # read header line from csv
print "-" * 45
students = list(map(Student, reader)) # read all remaining lines
for student in students:
print "%-14s|%-10s|%-5s|%3s" % student.get()
# Printing all sports that are specified by students
for s in set(Student.sports): # class attribute
print s, Student.sports.count(s)
# Printing sports that are not picked
allsports = ['Basketball','Football','Other','Baseball','Handball','Soccer','Volleyball','I do not like sport']
for s in set(allsports) - set(Student.sports):
print s, 0
Hope this gives you some ideas of the power of python sequences. ;)
edit 2, shortened as much as possible... just to show off :P
Ladies and gentlemen, 7(.5) lines.
allsports = ['Basketball','Football','Other','Baseball','Handball',
'Soccer','Volleyball','I do not like sport']
sports = []
reader = csv.reader(open('copy-john.csv'))
for row in reader:
if reader.line_num: sports.append(s[3])
print "%-14s|%-10s|%-5s|%-11s" % tuple(s)
for s in allsports: print s, sports.count(s)
I know this is a pretty old question, but it's impossible to read this, and not think of the amazing new(ish) Python library, pandas. Its main unit of analysis is a think called a DataFrame which is modelled after the way R handles data.
Let's say you have a (very silly) csv file called example.csv which looks like this:
day,fruit,sales
Monday,Banana,10
Monday,Orange,20
Tuesday,Banana,12
Tuesday,Orange,22
If you want to read in a csv in double-quick time, and do 'stuff' with it, you'd be hard pressed to beat the following code for either brevity or ease of use:
>>> import pandas as pd
>>> csv = pd.read_csv('example.csv')
>>> csv
day fruit sales
0 Monday Banana 10
1 Monday Orange 20
2 Tuesday Banana 12
3 Tuesday Orange 22
>>> csv[csv.fruit=='Banana']
day fruit sales
0 Monday Banana 10
2 Tuesday Banana 12
>>> csv[(csv.fruit=='Banana') & (csv.day=='Monday')]
day fruit sales
0 Monday Banana 10
In my opinion, this is really fantastic stuff. Never iterate over a csv.reader object again!
I second Mark's suggestion. In particular, look at DictReader from csv module that allows reading a comma separated (or delimited in general) file as a dictionary.
Look at PyMotW's coverage of csv module for a quick reference and examples of usage of DictReader, DictWriter
Have you looked at the csv module?
import csv