I have a csv file with data like this:
Name Value Value2 Value3 Rating
ddf 34 45 46 ok
ddf 67 23 11 ok
ghd 23 11 78 bad
ghd 56 33 78 bad
.....
WHat I want to do is loop through my csv and add together the rows that have the same name, the string at the end of each row wil always remain the same for that name so there is no fear of it changing. How would I go about changing it to this in python?
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
EDIT:
In my code, the first thing I did was sort the list into order so the same names would be near each other, then I tried to use a for loop to add the numbered lines together by checking if the name value is the same on the first column. It's a very ugly way of doing it and I am at my wits end.
sortedList = csv.reader(open("keywordReport.csv"))
editedFile = open("output.csv",'w')
wr = csv.writer(editedFile, delimiter = ',')
name = ""
sortedList = sorted(sortedList, key=operator.itemgetter(0), reverse=True)
newKeyword = ["","","","","",""]
for row in sortedList:
if row[0] != name:
wr.writerow(newKeyword)
name = row[0]
else:
newKeyword[0] = row[0] #Name
newKeyword[1] = str(float(newKeyword[1]) + float(row[1]))
newKeyword[2] = str(float(newKeyword[2]) + float(row[2]))
newKeyword[3] = str(float(newKeyword[3]) + float(row[3]))
The pandas way is very simple:
import pandas as pd
aframe = pd.read_csv('thefile.csv')
Out[19]:
Name Value Value2 Value3 Rating
0 ddf 34 45 46 ok
1 ddf 67 23 11 ok
2 ghd 23 11 78 bad
3 ghd 56 33 78 bad
r = aframe.groupby(['Name','Rating'],as_index=False).sum()
Out[40]:
Name Rating Value Value2 Value3
0 ddf ok 101 68 57
1 ghd bad 79 44 156
If you need to do further analysis and statistics Pandas will take you a long way with little effort. For the use case here is like using a hammer to kill a fly, but I wanted to provide this alternative.
file.csv
Name,Value,Value2,Value3,Rating
ddf,34,45,46,ok
ddf,67,23,11,ok
ghd,23,11,78,bad
ghd,56,33,78,bad
code
import csv
def map_csv_rows(f):
c = [x for x in csv.reader(f)]
return [dict(zip(c[0], map(lambda p: int(p) if p.isdigit() else p, x))) for x in c[1:]]
my_csv = map_csv_rows(open('file.csv', 'rb'))
output = {}
for row in my_csv:
output.setdefault(row.get('Name'), {'Name': row.get('Name'), 'Value': 0,'Value2': 0, 'Value3': 0, 'Rating': row.get('Rating')})
for val in ['Value', 'Value2', 'Value3']:
output[row.get('Name')][val] = output[row.get('Name')][val] + row.get(val)
with open('out.csv', 'wb') as f:
fieldnames = ['Name', 'Value', 'Value2', 'Value3', 'Rating']
writer = csv.DictWriter(f, fieldnames = fieldnames)
writer.writeheader()
for out in output.values():
writer.writerow(out)
for comparison purposes, equivalent awk program
$ awk -v OFS="\t" '
NR==1{$1=$1;print;next}
{k=$1;a[k]+=$2;b[k]+=$3;c[k]+=$4;d[k]=$5}
END{for(i in a) print i,a[i],b[i],c[i],d[i]}' input
will print
Name Value Value2 Value3 Rating
ddf 101 68 57 ok
ghd 79 44 156 bad
if it's a csv input and you want csv output, need to add -F, argument and change to OFS=,
Related
I have a data in a file I dont know if it is delimited by space or tab
Data In:
id Name year Age Score
123456 ALEX BROWNNIS VND 0 19 115
123457 MARIA BROWNNIS VND 0 57 170
123458 jORDAN BROWNNIS VND 0 27 191
I read it the data with read_csv and using the tab delimited
df = pd.read_csv(data.txt,sep='\t')
out:
id Name year Age Score
0 123456 ALEX BROWNNIS VND ... 0 19 115
1 123457 MARIA BROWNNIS VND ... 0 57 170
2 123458 jORDAN BROWNNIS VND ... 0 27 191
There is a lot of a white spaces between the column. Am I using delimiter correctly? and when I try to process the column name, I gotkey error so I basically think the fault is use of \t.
What are the possible way to fix this problem?
Since you have two columns and the second one has variable number of words, you need to read it as a regular file and then combine second to last words.
id = []
Name = []
year = []
Age = []
Score = []
with open('data.txt') as f:
text = f.read()
lines = text.split('\n')
for line in lines:
if len(line) < 3: continue
words = line.split()
id.append(words[0])
Name.append(' '.join(words[1:-3]))
year.append(words[-3])
Age.append(words[-2])
Score.append(words[-1])
df = pd.DataFrame.from_dict({'id': id, 'Name': Name,
'year': year, 'Age': Age, 'Score': Score})
Edit: you'd posted the overall data, so I'll change my answer to fit it.
You can use the skipinitialspace parameter like in the following example.
df2 = pd.read_csv('data.txt', sep='\t', delimiter=',', encoding="utf-8", skipinitialspace=True)
Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Problem Solved:
df = pd.read_csv('data.txt', sep='\t',engine="python")
I added this line of code to remove space between columns and it's work
df.columns = df.columns.str.strip()
I have a very large file with lines like follows:
....
0.040027 a b c d e 12 34 56 78 90 12 34 56
0.050027 f g h i l 12 34 56 78 90 12 34 56
0.060027 a b c d e 12 34 56 78 90 12 34 56
0.070027 f g h i l 12 34 56 78 90 12 34 56
0.080027 a b c d e 12 34 56 78 90 12 34 56
0.090027 f g h i l 12 34 56 78 90 12 34 56
....
I need to have a dictionary as follows in the fastest way possible.
I using the following code:
ascFile = open('C:\\eample.txt', 'r', encoding='UTF-8')
tag1 = ' a b c d e '
tag2 = ' f g h i l '
tags = [tag1, tag2]
temp = {'k1':[], 'k2':[]}
key_tag = {'k1':tag1, 'k2':tag2 }
t1 = time.time()
for line in ascFile:
for path, tag in key_tag.items():
if tag in line:
columns = line.strip().split(tag, 1)
temp[path].append([columns[0], columns[-1].replace(' ', '')])
t2 = time.time()
print(t2-t1)
I have the following result in 6 second parsing a file of 360MB, I'd like to improve the time.
temp = {'k1':[['0.040027', '1234567890123456'], ['0.060027', '1234567890123456'], ['0.080027', '1234567890123456']], 'k2':[['0.050027', '1234567890123456'], ['0.070027', '1234567890123456'], ['0.090027', '1234567890123456']]
}
I assume you have a fixed number of words in the file that are your keys. Use split to break the string, then take a slice of the split list to compute your key directly:
import collections
# raw strings don't need \\ for backslash:
FILESPEC = r'C:\example.txt'
lines_by_key = collections.defaultdict(list)
with open(FILESPEC, 'r', encoding='UTF-8') as f:
for line in f:
cols = line.split()
key = ' '.join(cols[1:6])
pair = (cols[0], ''.join(cols[6:]) # tuple, not list, could be changed
lines_by_key[key].append(pair)
print(lines_by_key)
I used partition instead of split so that the 'in' test and splitting can be done in a single pass.
for line in ascFile:
for path, tag in key_tag.items():
val0, tag_found, val1 = line.partition(tag)
if tag_found:
temp[path].append([val0, val1.replace(' ', '')])
break
Is this any better with your 360MB file?
You might also do a simple test where all you do is loop through the file a line at a time:
for line in ascFile:
pass
This will tell you what your best possible time will be.
I am trying to get a simple python function which will read in a CSV file and find the average for come columns and rows.
The function will examine the first row and for each column whose header
starts with the letter 'Q' it will calculate the average of values in
that column and then print it to the screen. Then for each row of the
data it will calculate the students average for all items in columns
that start with 'Q'. It will calulate this average normally and also
with the lowest quiz dropped. It will print out two values per student.
the CSV file contains grades for students and looks like this:
hw1 hw2 Quiz3 hw4 Quiz2 Quiz1
john 87 98 76 67 90 56
marie 45 67 65 98 78 67
paul 54 64 93 28 83 98
fred 67 87 45 98 56 87
the code I have so far is this but I have no idea how to continue:
import csv
def practice():
newlist=[]
afile= input('enter file name')
a = open(afile, 'r')
reader = csv.reader(a, delimiter = ",")
for each in reader:
newlist.append(each)
y=sum(int(x[2] for x in reader))
print (y)
filtered = []
total = 0
for i in range (0,len(newlist)):
if 'Q' in [i][1]:
filtered.append(newlist[i])
return filtered
May I suggest the use of Pandas:
>>> import pandas as pd
>>> data = pd.read_csv('file.csv', sep=' *')
>>> q_columns = [name for name in data.columns if name.startswith('Q')]
>>> reduced_data = data[q_columns].copy()
>>> reduced_data.mean()
Quiz3 69.75
Quiz2 76.75
Quiz1 77.00
dtype: float64
>>> reduced_data.mean(axis=1)
john 74.000000
marie 70.000000
paul 91.333333
fred 62.666667
dtype: float64
>>> import numpy as np
>>> for index, column in reduced_data.idxmin(axis=1).iteritems():
... reduced_data.ix[index, column] = np.nan
>>> reduced_data.mean(axis=1)
john 83.0
marie 72.5
paul 95.5
fred 71.5
dtype: float64
You would have a nicer code if you change your .csv format. Then we can use DictReader easily.
grades.csv:
name,hw1,hw2,Quiz3,hw4,Quiz2,Quiz1
john,87,98,76,67,90,56
marie,45,67,65,98,78,67
paul,54,64,93,28,83,98
fred,67,87,45,98,56,87
Code:
import numpy as np
from collections import defaultdict
import csv
result = defaultdict( list )
with open('grades.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for k in row:
if k.startswith('Q'):
result[ row['name'] ].append( int(row[k]) )
for name, lst in result.items():
print name, np.mean( sorted(lst)[1:] )
Output:
paul 95.5
john 83.0
marie 72.5
fred 71.5
For index.csv file, its fourth column has ten numbers ranging from 1-5. Each number can be regarded as an index, and each index corresponds with an array of numbers in filename.csv.
The row number of filename.csv represents the index, and each row has three numbers. My question is about using a nesting loop to transfer the numbers in filename.csv to index.csv.
from numpy import genfromtxt
import numpy as np
import csv
import collections
data1 = genfromtxt('filename.csv', delimiter=',')
data2 = genfromtxt('index.csv', delimiter=',')
out = np.zeros((len(data2),len(data1)))
for row in data2:
for ch_row in range(len(data1)):
if (row[3] == ch_row + 1):
out = row.tolist() + data1[ch_row].tolist()
print(out)
writer = csv.writer(open('dn.csv','w'), delimiter=',',quoting=csv.QUOTE_ALL)
writer.writerow(out)
For example, the fourth column of index.csv contains 1,2,5,3,4,1,4,5,2,3 and filename.csv contains:
# filename.csv
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
What I need is to write the indexed row from filename.csv to index.csv and store these number in 5th, 6th and 7th column:
# index.csv
# 4 5 6 7
... 1 20 30 50
... 2 70 60 45
... 5 13 08 55
... 3 35 26 77
... 4 93 37 68
... 1 20 30 50
... 4 93 37 68
... 5 13 08 55
... 2 70 60 45
... 3 35 26 77
If I do "print(out)", it comes out a correct answer. However, when I input "out" in the shell, there are only one row appears like [1.0, 1.0, 1.0, 1.0, 20.0, 30.0, 50.0]
What I need is to store all the values in the "out" variables and write them to the dn.csv file.
This ought to do the trick for you:
Code:
from csv import reader, writer
data = list(reader(open("filename.csv", "r"), delimiter=" "))
out = writer(open("output.csv", "w"), delimiter=" ")
for row in reader(open("index.csv", "r"), delimiter=" "):
out.writerow(row + data[int(row[3])])
index.csv:
0 0 0 1
0 0 0 2
0 0 0 3
filename.csv:
20 30 50
70 60 45
35 26 77
93 37 68
13 08 55
This produces the output:
0 0 0 1 70 60 45
0 0 0 2 35 26 77
0 0 0 3 93 37 68
Note: There's no need to use numpy here. The stadard library csv module will do most of the work for you.
I also had to modify your sample datasets a bit as what you showed had indexes out of bounds of the sample data in filename.csv.
Please also note that Python (like most languages) uses 0th indexes. So you may have to fiddle with the above code to exactly fit your needs.
with open('dn.csv','w') as f:
writer = csv.writer(f, delimiter=',',quoting=csv.QUOTE_ALL)
for row in data2:
idx = row[3]
out = [idx] + [x for x in data1[idx-1]]
writer.writerow(out)
Let's consider that I have two lists
Person 1 :
2012-08 person 1 23
2012-09 person 1 63
2012-10 person 1 99
2012-11 person 1 62
and
Person 2 :
2012-08 person 2 45
2012-09 person 2 69
2012-10 person 2 12
2012-11 person 2 53
What's your suggestion if I'de like to have a tabular data with the following pattern:
Date Person 1 Person 2
----- --------- ---------
2012-08 23 45
2012-09 63 69
2012-10 99 12
2012-11 62 53
UPDATE:
Here is the list :
List1 = [(u'201206', u'Customer_1', 0.19048299999999993), (u'201207', u'Customer_1', 15.409000999998593), (u'201208', u'Customer_1', 71.1695730000299), (u'201209', u'Customer_1', 135.73918600011424), (u'201210', u'Customer_1', 235.26299999991522), (u'201211', u'Customer_1', 271.768984999485), (u'201212', u'Customer_1', 355.90968299883934), (u'201301', u'Customer_1', 508.39194049821526), (u'201302', u'Customer_1', 631.136656500077), (u'201303', u'Customer_1', 901.9127695088399), (u'201304', u'Customer_1', 951.9143960094264)]
List 2 = [(None, None, None), (None, None, None), (None, None, None), (None, None, None), (None, None, None), (None, None, None), (None, None, None), (u'201301', u'Customer_2', 3.7276289999999657), (u'201302', u'Customer_2', 25.39122749999623), (u'201303', u'Customer_2', 186.77777299985306), (u'201304', u'Customer_2', 387.97834699805617)]
Use itertools.izip() to combine two input sequences while processing:
import itertools
reader1 = csv.reader(file1)
reader2 = csv.reader(file2)
for row1, row2 in itertools.izip(reader1, reader2):
# process row1 and row2 together.
This will work with lists too; izip() makes merging of long lists efficient; it is the iterator version of the zip() function, which, in python 2, materializes the whole combined list in memory.
If you can possibly retool the functions that create your input lists into generators, use that:
def function_for_list1(inputfilename):
with open(inputfilename, 'rb') as f:
reader = csv.reader(f)
for row in reader:
# process row
yield row
def function_for_list2(inputfilename):
with open(inputfilename, 'rb') as f:
reader = csv.reader(f)
for row in reader:
# process row
yield row
for row1, row2 in itertools.izip(function_for_list1(somename), function_for_list2(someothername)):
# process row1 and row2 together
This arrangement makes that you can process gigabytes of information while only holding in memory what you need to process one small set of rows.
If Python is not a requirement, and the generation of the two CSV files happens in a plain old bash script, you can combine join and awk (or even cut).
Example:
Let's say this file is called one:
2012-08 person1 23
2012-09 person1 63
2012-10 person1 99
2012-11 person1 62
and this file is called two:
2012-08 person2 45
2012-09 person2 69
2012-10 person2 12
2012-11 person2 53
Then the command
join one two | awk '{print $1 " " $3 " " $5}'
will output:
2012-08 23 45
2012-09 63 69
2012-10 99 12
2012-11 62 53
To put the CSV headers on the output, or to choose a different delimiter, is not difficult.
Note that one caveat is that the two files must be sorted on the join column for this to work. But you probably already know this, because you say the two CSV files are massive. Therefore you do not want to read them all into memory at once, probably. Plain Unix tools are really good for this sort of thing, IMHO.
l1=[ ['2012-08','person 1',23], ['2012-09','person 1',63],
['2012-10','person 1',99], ['2012-11','person 1',62]]
l2=[ ['2012-08','person 2',45], ['2012-09','person 2',69],
['2012-10','person 2',12], ['2012-11','person 2',53]]
h1 = { x:z for x,y,z in l1}
h2 = { x:z for x,y,z in l2}
print "{:<10}{:<10}{:<10}".format("Date", "Person 1", "Person 2")
print "{:<10}{:<10}{:<10}".format('-'*5, '-'*8, '-'*8)
for d in sorted(h1): print "{:<10} {:<10}{:<10}".format(d,h1[d],h2[d])
Output
Date Person 1 Person 2
----- -------- --------
2012-08 23 45
2012-09 63 69
2012-10 99 12
2012-11 62 53