Handling data from csv file with Python - python

I know Python is almost made for these kind of purposes, but I am really struggling to understand how I get access to specific values in the dataset, and I tried both with pandas and csv modules. It is probably a matter of syntax. Here's the thing: I have a csv file in the form of
Nation, Year, No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
...
...
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
...
and so on. What I would like to do is to get the total amount of refugees per year, i.e. selecting all the rows with a certain year and summing all the elements in the related "no. of refugees" column. I managed to do this:
import csv
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
headers = d_reader.fieldnames
print headers
#2013
list2013=[]
for line in d_reader:
if (line['Year']=='2013'):
list2013.append(line['Refugees'])
list2013=map(int,list2013) #I have str values in my file
ref13=sum(list2013)
but I am looking for a more elegant (and, above all, iterative) solution. Moreover, if I perform that procedure multiple times for different years, I always get 0: it works for 2013 only, not sure why.
Edit: I tried this as well, without success, but I think this could be totally wrong:
import csv
refugees_dict={}
a=range(2005,2014)
a=map(str, a)
with open('refugees.csv', 'r') as f:
d_reader = csv.DictReader(f)
for element in a:
for line in d_reader:
if (line['Year']==element):
print 'hello!'
temp_list=[]
temp_list.append(line['Refugees'])
temp_list=map(int, temp_list)
refugees_dict[a]=sum(temp_list)
print refugees_dict
The next step of my work will involve further studies on the dataset, eg I am probably gonna need to access data nation-wise instead of year-wise, and I really appreciate any hint so I understand how to manipulate data.
Thanks a lot.

Since you tagged pandas in the question, here's a pandas solution to getting the number of refugees per year.
Let's say my input csv looks like this (note that I've eliminated the extra space before the column names):
Nation,Year,No. of refugees
Afghanistan,2013,6657
Albania,2013,199
Algeria,2013,91
Angola,2013,47
Armenia,2013,156
Afghanistan,2012,6960
Albania,2012,157
Algeria,2012,67
Angola,2012,43
Armenia,2012,143
You can read that into a pandas DataFrame like this:
df = pd.read_csv('data.csv')
You can then get the total like this:
df.groupby(['Year']).sum()
This gives:
No. of refugees
Year
2012 7370
2013 7150

Consider:
from collections import defaultdict
by_year = defaultdict(int) # a dict that has a 0 under every key.
and then
by_year[line['year']] += int(line['Refugees'])
Now you can just look at by_year['2013'] and see your sum (same for other years).

To sum by year you can try this:
f = open('file.csv').readlines()
f = [i.strip('\n').split(',') for i in f]
years = {i[1]:0 for i in f}
for i in f:
years[i[1]] += int(i[-1])
Now, you have a dictionary that has the sum of all the refugees by year.
To access nation-wise:
nations = {i[0]:0 for i in f}
for i in f:
nations[i[0]] += int(i[-1])

Related

How to split csv data

I have a problem where I got a csv data like this:
AgeGroup Where do you hear our company from? How long have you using our platform?
18-24 Word of mouth; Supermarket Product 0-1 years
36-50 Social Media; Word of mouth 1-2 years
18-24 Advertisement +4 years
and I tried to make the file into this format through either jupyter notebook or from excel csv:
AgeGroup Where do you hear our company from?
18-24 Word of mouth 0-1 years
18-24 Supermarket Product 0-1 years
36-50 Social Media 1-2 years
36-50 Word of mouth 1-2 years
18-24 Advertisement +4 years
Let's say the csv file is Untitled form.csv and I import the data to jupyter notebook:
data = pd.read_csv('Untitled form.csv')
Can anyone tell me how should I do it?
I have tried doing it in excel csv using data-column but of course, they only separate the data into column while what I wanted is the data is separated into a row while still pertain the data from other column
Anyway... I found another way to do it which is more roundabout. First I edit the file through PowerSource excel and save it to different file... and then if utf-8 encoding appear... I just add encoding='cp1252'
So it would become like this:
import pandas as pd
data_split = pd.read_csv('Untitled form split.csv',
skipinitialspace=True,
usecols=range(1,7),
encoding='cp1252')
However if there's a more efficient way, please let me know. Thanks
I'm not 100% sure about your question since I think it might be two separate issues but hopefully this should fix it.
import pandas as pd
data = pd.read_fwf('Untitled form.csv')
cols = data.columns
data_long = pd.DataFrame(columns=data.columns)
for idx, row in data.iterrows():
hear_from = row['Where do you hear our company from?'].split(';')
hear_from_fmt = list(map(lambda x: x.strip(), hear_from))
n_items = len(hear_from_fmt)
d = {
cols[0] : [row[0]]*n_items,
cols[1] : hear_from_fmt,
cols[2] : [row[2]]*n_items,
}
data_long = pd.concat([data_long, pd.DataFrame(d)], ignore_index=True)
Let's brake it down.
This line data = pd.read_fwf('Untitled form.csv') reads the file inferring the spacing between columns. Now this is only useful because I am not sure your file is a proper CSV, if it is, you can open it normally, if not that this might help.
Now for the rest. We are iterating through each row and we are selecting the methods someone could have heard your company from. These are split using ; and then "stripped" to ensure there are no spaces. A new temp dataframe is created where first and last column are the same but you have as many rows as the number of elements in the hear_from_fmt list there are. The dataframes are then concatenated together.
Now there might be a more efficient solution, but this should work.

Working with data from CSV with Python without using Pandas

I am very new to using python to process data on CSV files. I have a CSV file with the data below. I want to take the averages of the time stamps for each Sprint, Jog, and Walk column by session. The below example has the subject John Doe and Session2 and Session3 that I would like to find the averages of separately and write them to a new CSV file. Is there a way not using PANDAS but other modules like CSV or Numpy to gather the data by the person (subject) and then by session. I have tried to make a dictionary but the keys get overwritten. I have also tried using a List but I cannot figure out how to target the sessions to average them out. Not sure what I am doing wrong. I also tried using dictReader to read the fieldnames and then to process the data but I cannot figure out how to group all the John Doe Session2 data to find the average of the times.
Subject, Session, Course, Size, Category, Sprint, Jog, Walk
John Doe, Session2, 17, 2, Bad, 25s, 36s, 55s
John Doe, Session2, 3, 2, Good, 26s, 35s, 45s
John Doe, Session2, 1, 2, Good, 22s, 31s, 47s
John Doe, Session3, 5, 2, Good, 16s, 32s, 55s
John Doe, Session3, 2, 2, Good, 13s, 24s, 52s
John Doe, Session3, 16, 2, Bad, 15s, 26s, 49s
PS I say no PANDAS because my groupmates are not adding this module since we have so many other dependencies.
Given your input, these built-in Python libraries can generate the output you want:
import csv
from itertools import groupby
from operator import itemgetter
from collections import defaultdict
with open('input.csv','r',newline='') as fin,open('output.csv','w',newline='') as fout:
# skip needed because sample data had spaces after comma delimiters.
reader = csv.DictReader(fin,skipinitialspace=True)
# Output file will have these fieldnames
writer = csv.DictWriter(fout,fieldnames='Subject Session Sprint Jog Walk'.split())
writer.writeheader()
# for each subject/session, groupby returns a 2-tuple of sort key and an
# iterator over the rows of that key. Data must be sorted by the key already!
for (subject,session),group in groupby(reader,key=itemgetter('Subject','Session')):
# built the row to output. defaultdict(int) assumes integer(0) if key doesn't exist.
row = defaultdict(int)
row['Subject'] = subject
row['Session'] = session
# Count the items for average.
count = 0
for item in group:
count += 1
# sum the rows, removing the 's'
for col in ('Sprint','Jog','Walk'):
row[col] += int(item[col][:-1])
# produce the average
for col in ('Sprint','Jog','Walk'):
row[col] /= count
writer.writerow(row)
Output:
Subject,Session,Sprint,Jog,Walk
John Doe,Session2,24.333333333333332,34.0,49.0
John Doe,Session3,14.666666666666666,27.333333333333332,52.0
Function links: itemgetter
groupby
defaultdict
If your data is not pre-sorted, you can use the following replacement lines to read in and sort the data by using the same key used in groupby. However, in this implementation the data must be small enough to load it all into memory at once.
sortkey = itemgetter('Subject','Session')
data = sorted(reader,key=sortkey)
for (subject,session),group in groupby(data,key=sortkey):
...
As you want the average grouped by subject and session, just compose unique keys out of that information:
import csv
times = {}
with open('yourfile.csv', 'r') as csvfile[1:]:
for row in csv.reader(csvfile, delimiter=','):
key = row[0]+row[1]
if key not in times.keys():
times[key] = row[-3:]
else:
times[key].extend(row[-3:])
average = {k: sum([int(entry[:-1]) for entry in v])/len(v) for k, v in times.items()}
This assumes that the first two entries do have regular structure as in your example and there is no ambiguity when composing the two first entries per row. To be sure one could insert a special delimiter between them in the key.
If you are also the person storing the data: Writing the unit of a column in the column header saves transformation effort later and avoids redundant information storage.

What is the most efficient way to group values in an array based on the month that they occur?

I have some data that I have read in the following way:
filename = 'minamORE.txt'
f1 = open(filename, 'r')
lines = f1.readlines()
mOREt = []
mOREdis = []
import pandas as pd
data = pd.read_csv('minamORE.txt',sep='\t',header=None,usecols=[2,3])
mOREdate = data[2].values
mOREdis = data[3].values
mOREdis = np.float64(mOREdis)
mOREdate = np.array(mOREdate, dtype = "datetime64")
The date array spans over 20 years and has an entry for each day. I would like to some how group all of the January measurements with all of the other January measurements and so on through December.
I'm not experienced enough with python to really think of any solution but to manually do it as follows:
(NOTE: October 1 is the first measurement)
OCTMeasurements = [mOREdis[0,31], mOREdis[0+365, 31+365], ..... [0+20*365, 31+20*365]
For obvious reasons, I'd like to avoid doing this if possible.
The dates are stored in the following format: YYYY-MM-DD.
If I could somehow refer to the values base don the MM value I feel this would be the most efficient way, but again, inexperience renders me unable to do so.

Matching cells in CSV to return calculation

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!
I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.
I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

Python data wrangling issues

I'm currently stumped by some basic issues with a small data set. Here are the first three lines to illustrate the format of the data:
"Sport","Entry","Contest_Date_EST","Place","Points","Winnings_Non_Ticket","Winnings_Ticket","Contest_Entries","Entry_Fee","Prize_Pool","Places_Paid"
"NBA","NBA 3K Crossover #3 [3,000 Guaranteed] (Early Only) (1/15)","2015-03-01 13:00:00",35,283.25,"13.33","0.00",171,"20.00","3,000.00",35
"NBA","NBA 1,500 Layup #4 [1,500 Guaranteed] (Early Only) (1/25)","2015-03-01 13:00:00",148,283.25,"3.00","0.00",862,"2.00","1,500.00",200
The issues I am having after using read_csv to create a DataFrame:
The presence of commas in certain categorical values (such as Prize_Pool) results in python considering these entries as strings. I need to convert these to floats in order to make certain calculations. I've used python's replace() function to get rid of the commas, but that's as far as I've gotten.
The category Contest_Date_EST contains timestamps, but some are repeated. I'd like to subset the entire dataset into one that has only unique timestamps. It would be nice to have a choice in which repeated entry or entries are removed, but at the moment I'd just like to be able to filter the data with unique timestamps.
Use thousands=',' argument for numbers that contain a comma
In [1]: from pandas import read_csv
In [2]: d = read_csv('data.csv', thousands=',')
You can check Prize_Pool is numerical
In [3]: type(d.ix[0, 'Prize_Pool'])
Out[3]: numpy.float64
To drop rows - take first observed, you can also take last
In [7]: d.drop_duplicates('Contest_Date_EST', take_last=False)
Out[7]:
Sport Entry \
0 NBA NBA 3K Crossover #3 [3,000 Guaranteed] (Early ...
Contest_Date_EST Place Points Winnings_Non_Ticket Winnings_Ticket \
0 2015-03-01 13:00:00 35 283.25 13.33 0
Contest_Entries Entry_Fee Prize_Pool Places_Paid
0 171 20 3000 35
Edit: Just realized you're using pandas - should have looked at that.
I'll leave this here for now in case it's applicable but if it gets
downvoted I'll take it down by virtue of peer pressure :)
I'll try and update it to use pandas later tonight
Seems like itertools.groupby() is the tool for this job;
Something like this?
import csv
import itertools
class CsvImport():
def Run(self, filename):
# Get the formatted rows from CSV file
rows = self.readCsv(filename)
for key in rows.keys():
print "\nKey: " + key
i = 1
for value in rows[key]:
print "\nValue {index} : {value}".format(index = i, value = value)
i += 1
def readCsv(self, fileName):
with open(fileName, 'rU') as csvfile:
reader = csv.DictReader(csvfile)
# Keys may or may not be pulled in with extra space by DictReader()
# The next line simply creates a small dict of stripped keys to original padded keys
keys = { key.strip(): key for (key) in reader.fieldnames }
# Format each row into the final string
groupedRows = {}
for k, g in itertools.groupby(reader, lambda x : x["Contest_Date_EST"]):
groupedRows[k] = [self.normalizeRow(v.values()) for v in g]
return groupedRows;
def normalizeRow(self, row):
row[1] = float(row[1].replace(',','')) # "Prize_Pool"
# and so on
return row
if __name__ == "__main__":
CsvImport().Run("./Test1.csv")
Output:
More info:
https://docs.python.org/2/library/itertools.html
Hope this helps :)

Categories