how to find avg of column of csv file

how to find avg of column of csv file - python

import csv
with open('Met.csv', 'r') as f:
reader = csv.reader(f, delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
I am not able to go ahead how to get a column from the csv file I tried
print row[:column_name]
name id name reccla mass (g) fall year GeoLocation
Aachen 1 Valid L5 21 Fell 01/01/1880 (50.775000, 6.083330)
Aarhus 2 Valid H6 720 Fell 1/1/1951 (53.775000, 6.586560)
Abee 6 Valid EH4 -- Fell 1/1/1952 (50.775000, 6.083330)
Acapul 10 Valid A 353 Fell 1/1/1952 (50.775000, 6.083330)
Acapul 1914 valid A -- Fell 1/1/1952 (50.775000, 6.083330)
AdhiK 379 Valid EH4 56655 Fell 1/1/1919 (50.775000, 6.083330)
and I want avg of mass (g)

Try pandas instead of reading from csv
import pandas as pd
data = pd.read_csv('Met.csv')
It is far easier to grab columns and perform operations using pandas.
Here I am loading the csv contents to a dataframe.
Loaded data : (sample data)
>>> data
name id nametype recclass mass
0 Aarhus 2 Valid H6 720
1 Abee 6 Valid EH4 107000
2 Acapulco 10 Valid Acapulcoite 914
3 Achiras 370 Valid L6 780
4 Adhi Kot 379 Valid EH4 4239
5 Adzhi 390 Valid LL3-6 910
6 Agen 392 Valid H5 30000
Just the Mass column :
You can access individual columns as data['column name']
>>> data['mass']
0 720
1 107000
2 914
3 780
4 4239
5 910
6 30000
Name: mass, dtype: int64
Average of Mass column :
>>> data['mass'].mean()
20651.857142857141

You can use csv.DictReader() instead of csv.reader(). The following code works fine to me
import csv
mass_list = []
with open("../data/Met.csv", "r") as f:
reader = csv.DictReader(f, delimiter="\t")
for row in reader:
mass = row["mass"]
if mass is not None and mass is not "--":
mass_list.append(float(row["mass"]))
avg_mass = sum(mass_list) / len(mass_list)
print "avg of mass: ", avg_mass
Hope it helps.

Related

How to read specific columns in the csv file?

I have lots of live data coming from sensor. Currently, I stored the data in a csv file as following:
0 2 1 437 464 385 171 0:44:4 dog.jpg
1 1 3 452 254 444 525 0:56:2 cat.jpg
2 3 2 552 525 785 522 0:52:8 car.jpg
3 8 4 552 525 233 555 0:52:8 car.jpg
4 7 5 552 525 433 522 1:52:8 phone.jpg
5 9 3 552 525 555 522 1:52:8 car.jpg
6 6 6 444 392 111 232 1:43:4 dog.jpg
7 1 1 234 322 191 112 1:43:4 dog.jpg
.
.
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 2 and 5 in the third columns. I also want to write them in another csv file line by line every 2 second, one line at the time.
I do so because I have another code which would go through the data and read the data from there. I was wondering how could I write the information for the lines that have 3 and 5 in their 3rd column? Please advise!
for example:
2 552 525
5 552 525
......
......
.....
.
import csv
with open('newfilename.csv', 'w') as f2:
with open('mydata.csv', mode='r') as infile:
reader = csv.reader(infile) # no conversion to list
header = next(reader) # get first line
for row in reader: # continue to read one line per loop
if row[5] == 2 & 5:

The third column has index 2 so you should be checking if row[2] is one of '2' or '5'. I have done this by defining the set select = {'2', '5'} and checking if row[2] in select.
I don't see what you are using header for but I assume you have more code that processes header somewhere. If you don't need header and just want to skip the first line, just do next(reader) without assigning it to header but I have kept header in my code under the assumption you use it later.
We can use time.sleep(2) from the time module to help us write a row every 2 seconds.
Below, "in.txt" is the csv file containing the sample input you provided and "out.txt" is the file we write to.
Code
import csv
import time
select = {'2', '5'}
with open("in.txt") as f_in, open("out.txt", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
header = next(reader)
for row in reader:
if row[2] in select:
print(f"Writing {row[2:5]} at {time.time()}")
writer.writerow(row[2:5])
# f_out.flush() may need to be run here
time.sleep(2)
Output
Writing ['2', '552', '525'] at 1650526118.9760585
Writing ['5', '552', '525'] at 1650526120.9763758
"out.txt"
2,552,525
5,552,525
Input
"in.txt"
0,2,1,437,464,385,171,0:44:4,dog.jpg
1,1,3,452,254,444,525,0:56:2,cat.jpg
2,3,2,552,525,785,522,0:52:8,car.jpg
3,8,4,552,525,233,555,0:52:8,car.jpg
4,7,5,552,525,433,522,1:52:8,phone.jpg
5,9,3,552,525,555,522,1:52:8,car.jpg
6,6,6,444,392,111,232,1:43:4,dog.jpg
7,1,1,234,322,191,112,1:43:4,dog.jpg

I think you'd just need to change your if statement to be able to get the rows you want.
for example:
import csv
with open('newfilename.csv', 'w') as f2:
with open('mydata.csv', mode='r') as infile:
reader = csv.reader(infile) # no conversion to list
header = next(reader) # get first line
for row in reader: # continue to read one line per loop
if row[5] in [2,5]:
inside the if, you'll get the rows that have 2 or 5

Python 3, how to read from txt into 4 arrays?

The text file looks like this:
421 2 1 8 34 27
421 0 0 8 37 27
435 0 1 9 8 44
435 4 0 9 10 50
for row in file_content[0:]:
id, place, inout, hour, min, sec = row.split(" ")
print (id)
In the code I wanted to separate the rows, the first column contains the ids of persons, the second is ids of places, third is the person go in or out (0/1), and the last 3 is time (hour:min:sec)
Could someone help me correct this code so I could continue the practicing for my exam? (I'm a beginner)

with open("Text.txt", "r") as f:
id, place, inout, hour, min, sec = zip(*map(str.split, f))
print(id)
# [OUT] ('421', '421', '435', '435')
Zip()

>>> filecontent =open("test.txt",'r')
>>> for row in filecontent:
... id, place, inout, hour, min, sec = row.split(" ")
... print("id is", id)
...
id is 421
id is 421
id is 435
id is 435

Finding a range of values two other columns from based on values in another column

I'm trying trying to use python to find value ranges in two columns (TableA) based on values from the first column of TableB. Column 1 and Column 2 in TableA represent ranges of values Whenever, a value from column 1 in TableB falls within a range, I want to extract such rows from TableA as shown in the output and also know how many they are.
TableA:
1 524
677 822
902 1103
1239 1790
2001 2321
3900 4567
TableB:
351 aux
1256 sle
4002 aim
Required output:
1 524
1239 1790
3900 4567
Total count = 3
Here's my attempt that didn't work:
datA = open('TableA.txt','r')
datB = open('TableB.txt','r')
count=0
for line1 in datB:
line1 = line1.strip().split()
for line2 in datA:
line2 = line2.strip().split('\t')
for col1, col2 in zip(line2[0], line2[1]):
if line2 > col1 and line2 < col2:
print(col1 + '\t' + col2)
count=+1
print(count)
datA.close()
datB.close()
Can someone please help? Thanks

You could try it this way:
tableBcol1=[int(i.split()[0]) for i in open('TableB.txt')]
tableA=[i.strip() for i in open('TableA.txt')]
count=0
for bcol1 in tableBcol1:
for line in tableA:
lbound,hbound=line.split()
if bcol1 in range(int(lbound),int(hbound)+1):
print(line.strip())
count+=1
print(count)
tableBcol1 contains all the values of column1 from TableB.txt in integer form (ie 351,1256,4002).
lbound and hbound contain the values from column1 and column2 from TableA.txt.
Finally, you check for membership in the if statement. If the value from column1 of TableB.txt is in the range then print the line from TableA.txt. Note, one is added to hbound in the range because the upper bound is non-inclusive.

TableA = """1 524
677 822
902 1103
1239 1790
2001 2321
3900 4567"""
TableB = """351 aux
1256 sle
4002 aim"""
#TableA = open('TableA.txt', 'r').read()
#TableB = open('TableB.txt', 'r').read()
ranksA = [map(int, e.split()) for e in TableA.split("\n")]
valuesB = [int(e.split()[0]) for e in TableB.split("\n")]
resultsraw = [(v, [(ri, rf) for ri, rf in ranksA if v >= ri and v <= rf]) for v in valuesB]
results = "\n".join(["%6s\t%6s" % e[1][0] for e in resultsraw])
print results
print "Total count: %s" % (len(resultsraw))
output:
1 524
1239 1790
3900 4567
Total count: 3

Report and total the range for every item in Table B that falls within a range in Table A.
Here is another approach using defaultdict as a counter, csv for reading tabular data and with statements for safely opening/closing files:
import csv
from collections import defaultdict
# Build reference dict
with open("Table A.txt", "r") as f:
reader = csv.reader(f)
# reference = defaultdict(list)
reference = defaultdict(int)
for row in reader:
reference[row[0]]
# Read data and tally
with open("Table B.txt", "r") as f:
reader2 = csv.reader(f)
# header = next(reader2)
for row in reader2:
col1 = int(row[0].split()[0])
for key in reference:
first, last = map(int, key.split())
if col1 >= first and col1 <= last:
# reference[key].append(row)
reference[key] += 1
reference
The result is a dictionary that tallies the entries that fit within the given range.
defaultdict(int,
{'1 524': 1,
'1239 1790': 1,
'2001 2321': 0,
'3900 4567': 1,
'677 822': 0,
'902 1103': 0})
defaultdict gives you the option to store integers or append values to a list (see commented lines). However, for your desired output:
for k, v in reference.items():
if v:
print(k)
print("Total:", total)
Final output:
1 524
1239 1790
3900 4567
Total: 3

find average value from CSV columns that contain a specific character

I am trying to get a simple python function which will read in a CSV file and find the average for come columns and rows.
The function will examine the first row and for each column whose header
starts with the letter 'Q' it will calculate the average of values in
that column and then print it to the screen. Then for each row of the
data it will calculate the students average for all items in columns
that start with 'Q'. It will calulate this average normally and also
with the lowest quiz dropped. It will print out two values per student.
the CSV file contains grades for students and looks like this:
hw1 hw2 Quiz3 hw4 Quiz2 Quiz1
john 87 98 76 67 90 56
marie 45 67 65 98 78 67
paul 54 64 93 28 83 98
fred 67 87 45 98 56 87
the code I have so far is this but I have no idea how to continue:
import csv
def practice():
newlist=[]
afile= input('enter file name')
a = open(afile, 'r')
reader = csv.reader(a, delimiter = ",")
for each in reader:
newlist.append(each)
y=sum(int(x[2] for x in reader))
print (y)
filtered = []
total = 0
for i in range (0,len(newlist)):
if 'Q' in [i][1]:
filtered.append(newlist[i])
return filtered

May I suggest the use of Pandas:
>>> import pandas as pd
>>> data = pd.read_csv('file.csv', sep=' *')
>>> q_columns = [name for name in data.columns if name.startswith('Q')]
>>> reduced_data = data[q_columns].copy()
>>> reduced_data.mean()
Quiz3 69.75
Quiz2 76.75
Quiz1 77.00
dtype: float64
>>> reduced_data.mean(axis=1)
john 74.000000
marie 70.000000
paul 91.333333
fred 62.666667
dtype: float64
>>> import numpy as np
>>> for index, column in reduced_data.idxmin(axis=1).iteritems():
... reduced_data.ix[index, column] = np.nan
>>> reduced_data.mean(axis=1)
john 83.0
marie 72.5
paul 95.5
fred 71.5
dtype: float64

You would have a nicer code if you change your .csv format. Then we can use DictReader easily.
grades.csv:
name,hw1,hw2,Quiz3,hw4,Quiz2,Quiz1
john,87,98,76,67,90,56
marie,45,67,65,98,78,67
paul,54,64,93,28,83,98
fred,67,87,45,98,56,87
Code:
import numpy as np
from collections import defaultdict
import csv
result = defaultdict( list )
with open('grades.csv', 'r') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
for k in row:
if k.startswith('Q'):
result[ row['name'] ].append( int(row[k]) )
for name, lst in result.items():
print name, np.mean( sorted(lst)[1:] )
Output:
paul 95.5
john 83.0
marie 72.5
fred 71.5

Efficiently finding intersecting regions in two huge dictionaries

I wrote a piece of code that finds common ID's in line[1] of two different files.My input file is huge (2 mln lines). If I split it into many small files it gives me more intersecting ID's, while if I throw the whole file to run, much less. I cannot figure out why, can you suggest me what is wrong and how to improve this code to avoid the problem?
fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')
dictA = dict()
for line1 in fileA:
listA = line1.split('\t')
dictA[listA[1]] = listA
dictB = dict()
for line1 in fileB:
listB = line1.split('\t')
dictB[listB[1]] = listB
for key in dictB:
if key in dictA:
output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])
My file1 is sorted by line[0] and has 0-15 lines,
contig17 GRMZM2G052619_P03 98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33 AT2G41790.1 98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98 GRMZM5G888620_P01 87 470 1 0 17 28 78.8 1 127 7 420 2 522 18
contig102 GRMZM5G886789_P02 73 115 1 0 34 45 78.8 0 134 5 421 0 456 50
contig123 AT3G57470.1 83 201 2 1 12 43 78.8 0 134 9 420 0 305 50
My file2 is not sorted and has 0-10 line,
GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525 1
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589 4
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0
My desired output,
contig17 GRMZM2G052619_P03 GO:0043531 ADP binding molecular_function PF07525
contig98 GRMZM5G888620_P01 GO:0011551 DNA binding molecular_function PF07589
contig102 GRMZM5G886789_P02 GO:0055516 ADP binding molecular_function PF07526

I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!

This is almost the same but within a function.
#Creates a function to do the reading for each file
def read_store(file_, dictio_):
"""Given a file name and a dictionary stores the values
of the file in a dictionary by its value on the column provided."""
import re
with open(file_,'r') as file_0:
lines_file_0 = fileA.readlines()
for line in lines_file_0:
ID = re.findall("^.+\s+(\w+)", line)
#I couldn't check it but it should match whatever is after a separate
# character that has letters, numbers or underscore
dictio_[ID] = line
To use do:
file1 = {}
read_store("file1.txt", file1)
And then compare it normally as you do, but I would to use \s instead of \t to split. Even though it will split also between words, but that is easy to rejoin with " ".join(DictA[1:5])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to find avg of column of csv file - python

Related

How to read specific columns in the csv file?

Python 3, how to read from txt into 4 arrays?

Finding a range of values two other columns from based on values in another column

find average value from CSV columns that contain a specific character

Efficiently finding intersecting regions in two huge dictionaries

Categories

Resources