find average value from CSV columns that contain a specific character - python

I am trying to get a simple python function which will read in a CSV file and find the average for come columns and rows.
The function will examine the first row and for each column whose header
starts with the letter 'Q' it will calculate the average of values in
that column and then print it to the screen. Then for each row of the
data it will calculate the students average for all items in columns
that start with 'Q'. It will calulate this average normally and also
with the lowest quiz dropped. It will print out two values per student.
the CSV file contains grades for students and looks like this:
hw1 hw2 Quiz3 hw4 Quiz2 Quiz1
john 87 98 76 67 90 56
marie 45 67 65 98 78 67
paul 54 64 93 28 83 98
fred 67 87 45 98 56 87
the code I have so far is this but I have no idea how to continue:
import csv
def practice():
newlist=[]
afile= input('enter file name')
a = open(afile, 'r')
reader = csv.reader(a, delimiter = ",")
for each in reader:
newlist.append(each)
y=sum(int(x[2] for x in reader))
print (y)
filtered = []
total = 0
for i in range (0,len(newlist)):
if 'Q' in [i][1]:
filtered.append(newlist[i])
return filtered

May I suggest the use of Pandas:
>>> import pandas as pd
>>> data = pd.read_csv('file.csv', sep=' *')
>>> q_columns = [name for name in data.columns if name.startswith('Q')]
>>> reduced_data = data[q_columns].copy()
>>> reduced_data.mean()
Quiz3 69.75
Quiz2 76.75
Quiz1 77.00
dtype: float64
>>> reduced_data.mean(axis=1)
john 74.000000
marie 70.000000
paul 91.333333
fred 62.666667
dtype: float64
>>> import numpy as np
>>> for index, column in reduced_data.idxmin(axis=1).iteritems():
... reduced_data.ix[index, column] = np.nan
>>> reduced_data.mean(axis=1)
john 83.0
marie 72.5
paul 95.5
fred 71.5
dtype: float64

You would have a nicer code if you change your .csv format. Then we can use DictReader easily.
grades.csv:
name,hw1,hw2,Quiz3,hw4,Quiz2,Quiz1
john,87,98,76,67,90,56
marie,45,67,65,98,78,67
paul,54,64,93,28,83,98
fred,67,87,45,98,56,87
Code:
import numpy as np
from collections import defaultdict
import csv
result = defaultdict( list )
with open('grades.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for k in row:
if k.startswith('Q'):
result[ row['name'] ].append( int(row[k]) )
for name, lst in result.items():
print name, np.mean( sorted(lst)[1:] )
Output:
paul 95.5
john 83.0
marie 72.5
fred 71.5

Related

Reading file with delimiter using pandas

I have a data in a file I dont know if it is delimited by space or tab
Data In:
id Name year Age Score
123456 ALEX BROWNNIS VND 0 19 115
123457 MARIA BROWNNIS VND 0 57 170
123458 jORDAN BROWNNIS VND 0 27 191
I read it the data with read_csv and using the tab delimited
df = pd.read_csv(data.txt,sep='\t')
out:
id Name year Age Score
0 123456 ALEX BROWNNIS VND ... 0 19 115
1 123457 MARIA BROWNNIS VND ... 0 57 170
2 123458 jORDAN BROWNNIS VND ... 0 27 191
There is a lot of a white spaces between the column. Am I using delimiter correctly? and when I try to process the column name, I gotkey error so I basically think the fault is use of \t.
What are the possible way to fix this problem?
Since you have two columns and the second one has variable number of words, you need to read it as a regular file and then combine second to last words.
id = []
Name = []
year = []
Age = []
Score = []
with open('data.txt') as f:
text = f.read()
lines = text.split('\n')
for line in lines:
if len(line) < 3: continue
words = line.split()
id.append(words[0])
Name.append(' '.join(words[1:-3]))
year.append(words[-3])
Age.append(words[-2])
Score.append(words[-1])
df = pd.DataFrame.from_dict({'id': id, 'Name': Name,
'year': year, 'Age': Age, 'Score': Score})
Edit: you'd posted the overall data, so I'll change my answer to fit it.
You can use the skipinitialspace parameter like in the following example.
df2 = pd.read_csv('data.txt', sep='\t', delimiter=',', encoding="utf-8", skipinitialspace=True)
Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Problem Solved:
df = pd.read_csv('data.txt', sep='\t',engine="python")
I added this line of code to remove space between columns and it's work
df.columns = df.columns.str.strip()

How do I work with the results of pytrends?

so I'm new to python and ran into a problem using pytrends. I'm trying to compare 5 search terms and store the sum in a CSV.
The problem I'm having right now is I can't seem to isolate an individual element returned. I have the data, I can see it, but I can't seem to isolate an element to be able to do anything meaningful with it.
I found elsewhere a suggestion to use iloc, but that doesn't return anything for what's shown, and if I pass only one parameter it seems to display everything.
It feels really dumb, but I just can't figure this out, nor can I find anything online.
from pytrends.request import TrendReq
import csv
import pandas
import numpy
import time
# Login to Google. Only need to run this once, the rest of requests will use the same session.
pytrend = TrendReq(hl='en-US', tz=360)
with open('database.csv',"r") as f:
reader = csv.reader(f,delimiter = ",")
data = list(reader)
row_count = len(data)
comparator_string = data[1][0] + " opening"
print("comparator: ",comparator_string,"\n")
#Initialize search term list including comparator_string as the first item, plus 4 search terms
kw_list=[]
kw_list.append(comparator_string)
for x in range(1, 5, 1):
search_string = data[x][0] + " opening"
kw_list.append(search_string)
# Create payload and capture API tokens. Only needed for interest_over_time(), interest_by_region() & related_queries()
pytrend.build_payload(kw_list, cat=0, timeframe='today 3-m',geo='',gprop='')
# Interest Over Time
interest_over_time_df = pytrend.interest_over_time()
#time.sleep(randint(5, 10))
#printer = interest_over_time_df.sum()
printer = interest_over_time_df.iloc[1,1]
print("printer: \n",printer)
pytrends returns pandas.DataFrame objects, and there are a number of ways to go about indexing and selecting data.
Let's take this following bit of code, for example:
kw_list = ['apples', 'oranges', 'bananas']
interest_over_time_df = pytrend.interest_over_time()
If you run print(interest_over_time_df) you will see something like this:
apples oranges bananas isPartial
date
2017-10-23 77 15 43 False
2017-10-24 77 15 46 False
2017-10-25 78 14 41 False
2017-10-26 78 14 43 False
2017-10-27 81 17 42 False
2017-10-28 91 17 39 False
...
You'll see an index column date on the left, as well as the four data columns apples, oranges, bananas, and isPartial. You can ignore isPartial for now: that field lets you know if the data point is complete for that particular date.
At this point you can select data by column, by columns + index, etc.:
>>> interest_over_time_df['apples']
date
2017-10-23 77
2017-10-24 77
2017-10-25 78
2017-10-26 78
2017-10-27 81
>>> interest_over_time_df['apples']['2017-10-26']
78
>>> interest_over_time_df.iloc[4] # Give me row 4
apples 81
oranges 17
bananas 42
isPartial False
Name: 2017-10-27 00:00:00, dtype: object
>>> interest_over_time_df.iloc[4, 0] # Give me row 4, value 0
81
You may be interested in pandas.DataFrame.loc, which selects rows by label, as opposed to pandas.DataFrame.iloc, which selects rows by integer:
>>> interest_over_time_df.loc['2017-10-26']
apples 78
oranges 14
bananas 43
isPartial False
Name: 2017-10-26 00:00:00, dtype: object
>>> interest_over_time_df.loc['2017-10-26', 'apples']
78
Hope that helps.

Summating CSV rows in Python

I have a csv file with data like this:
Name Value Value2 Value3 Rating
ddf 34 45 46 ok
ddf 67 23 11 ok
ghd 23 11 78 bad
ghd 56 33 78 bad
.....
WHat I want to do is loop through my csv and add together the rows that have the same name, the string at the end of each row wil always remain the same for that name so there is no fear of it changing. How would I go about changing it to this in python?
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
EDIT:
In my code, the first thing I did was sort the list into order so the same names would be near each other, then I tried to use a for loop to add the numbered lines together by checking if the name value is the same on the first column. It's a very ugly way of doing it and I am at my wits end.
sortedList = csv.reader(open("keywordReport.csv"))
editedFile = open("output.csv",'w')
wr = csv.writer(editedFile, delimiter = ',')
name = ""
sortedList = sorted(sortedList, key=operator.itemgetter(0), reverse=True)
newKeyword = ["","","","","",""]
for row in sortedList:
if row[0] != name:
wr.writerow(newKeyword)
name = row[0]
else:
newKeyword[0] = row[0] #Name
newKeyword[1] = str(float(newKeyword[1]) + float(row[1]))
newKeyword[2] = str(float(newKeyword[2]) + float(row[2]))
newKeyword[3] = str(float(newKeyword[3]) + float(row[3]))
The pandas way is very simple:
import pandas as pd
aframe = pd.read_csv('thefile.csv')
Out[19]:
Name Value Value2 Value3 Rating
0 ddf 34 45 46 ok
1 ddf 67 23 11 ok
2 ghd 23 11 78 bad
3 ghd 56 33 78 bad
r = aframe.groupby(['Name','Rating'],as_index=False).sum()
Out[40]:
Name Rating Value Value2 Value3
0 ddf ok 101 68 57
1 ghd bad 79 44 156
If you need to do further analysis and statistics Pandas will take you a long way with little effort. For the use case here is like using a hammer to kill a fly, but I wanted to provide this alternative.
file.csv
Name,Value,Value2,Value3,Rating
ddf,34,45,46,ok
ddf,67,23,11,ok
ghd,23,11,78,bad
ghd,56,33,78,bad
code
import csv
def map_csv_rows(f):
c = [x for x in csv.reader(f)]
return [dict(zip(c[0], map(lambda p: int(p) if p.isdigit() else p, x))) for x in c[1:]]
my_csv = map_csv_rows(open('file.csv', 'rb'))
output = {}
for row in my_csv:
output.setdefault(row.get('Name'), {'Name': row.get('Name'), 'Value': 0,'Value2': 0, 'Value3': 0, 'Rating': row.get('Rating')})
for val in ['Value', 'Value2', 'Value3']:
output[row.get('Name')][val] = output[row.get('Name')][val] + row.get(val)
with open('out.csv', 'wb') as f:
fieldnames = ['Name', 'Value', 'Value2', 'Value3', 'Rating']
writer = csv.DictWriter(f, fieldnames = fieldnames)
writer.writeheader()
for out in output.values():
writer.writerow(out)
for comparison purposes, equivalent awk program
$ awk -v OFS="\t" '
NR==1{$1=$1;print;next}
{k=$1;a[k]+=$2;b[k]+=$3;c[k]+=$4;d[k]=$5}
END{for(i in a) print i,a[i],b[i],c[i],d[i]}' input
will print
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
if it's a csv input and you want csv output, need to add -F, argument and change to OFS=,

How to get average from two columns in csv file

I have a csv file with 2 columns
rw1, 24
rw2, 34
rw3, 56
rw1, 78
rw2, 56
rw2, 45
rw2, 64
rw3, 32
rw1, 28
Now i want to have average.py file which calculates average of all rw1, rw2 and rw3 respectively and write that to average.txt file
rw1 - average value,
rw2 - average value,
rw3 - average value
With pandas, it's kind of short :
import pandas as pd
df = pd.read_csv(file, header=None)
In [1]: df
Out[1]:
0 1
0 rw1 24
1 rw2 34
2 rw3 56
3 rw1 78
4 rw2 56
5 rw2 45
6 rw2 64
7 rw3 32
8 rw1 28
In [2]: df.groupby(df[0]).mean() # it groups on the column "0", and calculates the mean on the different group
Out[2]:
1
0
rw1 43.333333
rw2 49.750000
rw3 44.000000
Hope this helps !
given read csv and convert them to tuple. then sort them to use in Groupby
import itertools
import csv
fileLocation = 'newslot.csv'
with open(fileLocation,'rb') as f:
r = csv.reader(f)
lis=sorted([(i[0],i[1]) for i in r])
for k,g in itertools.groupby(lis,key=lambda x:x[0]):
g=list(g)
print k,sum(int(i[1]) for i in g)/len(g)
from itertools import groupby
from operator import itemgetter
import csv
def avg(lst):
return sum(map(float, lst)) / len(lst)
def avgcsv(filename, k=0, v=1):
with open(filename) as f:
data = sorted(csv.reader(f, skipinitialspace=True), key=itemgetter(k))
return ['%s - %g' % (name, avg(map(itemgetter(v), group)))
for name, group in groupby(data, key=itemgetter(k))]
with open('average.txt', 'w') as f:
f.write(',\n'.join(avgcsv('filename', 0, 1)))
Output
rw1 - 43.3333,
rw2 - 49.75,
rw3 - 44

Python: How to write values to a csv file from another csv file

For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.
This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.
with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)

Categories