Column statistics from given input file?

Column statistics from given input file? - python

I am given a .txt file of data:
1,2,3,0,0
1,0,4,5,0
1,1,1,1,1
3,4,5,6,0
1,0,1,0,3
3,3,4,0,0
My objective is to calculate the min,max,avg,range,median of the columns of given data and write it to an output .txt file.
My logic in approaching this question is as follows
Step 1) Read the data
infile = open("Data.txt", "r")
tempLine = infile.readline()
while tempLine:
print(tempLine.split(','))
tempLine = infile.readline()
Obviously it's not perfect but the idea is that the data can be read by this...
Step 2) Store the data into corresponding list variables? row1, row2,... row6
Step 3) Combine above lists all into one, giving a final list like this...
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]]
Step 4) Using nested for loop, access elements individually and store them into list variables
col1, col2, col3, ... , col5
Step 5) Calculate min, max etc and write to output file
My question is, with my rather beginner knowledge of computer science and python, is this logic inefficient, and could there possibly be an easier, and better logic towards solving this problem?
My main problem is probably steps 2 through 5. The rest I know how to do for sure.
Any advice would be helpful!

Try numpy. Numpy library provides a fast options when dealing with nested lists in a list, or simply, matrices.
To use numpy, you must import numpy at the beginning of your code.
numpy.matrix(1,2,3,0,0;1,0,4,5,0;....;3,3,4,0,0)
will give you
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]] straight off the bat.
Also, you may look through the axis(in this case, rows) and get mean, min, max easily using
max([axis, out]) Return the maximum value along an axis.
mean([axis, dtype, out]) Returns the average of the matrix elements along the given axis.
min([axis, out]) Return the minimum value along an axis.
This is from https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html, a numpy document, so for more information, please read the numpy document.

To get the data I would to something like this:
from statistics import median
infile = open("Data.txt", "r")
rows = [line.split(',') for line in infile.readlines()]
for row in rows:
minRow = min(row)
maxRow = max(row)
avgRow = sum(row) / len(row)
rangeRow = maxRow - minRow
medianRow = median(row)
#then write the data to the output file

You can use the pandas library for this (http://pandas.pydata.org/)
The code below worked for me:
import pandas as pd
df = pd.read_csv('data.txt',header=None)
somestats = df.describe()
somestats.to_csv('dataOut.txt')

This is how I ended up doing it if anyone is curious
import numpy
infile = open("Data1.txt", "r")
outfile = open("ColStats.txt", "w")
oMat = numpy.loadtxt(infile)
tMat = numpy.transpose(oMat) #Create new matrix where Columns of oMat becomes rows and rows become columns
#print(tMat)
for x in range (5):
tempM = tMat[x]
mn = min(tempM)
mx = max(tempM)
avg = sum(tempM)/6.0
rng = mx - mn
median = numpy.median(tempM)
out = ("[{} {} {} {} {}]".format(mn, mx, avg, rng, median))
outfile.write(out + '\n')
infile.close()
outfile.close()
#print(tMat)

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

I have a large semicolon-delimited text file that weighs in at a little over 100GB. It comprises ~18,000,000 rows of data and 772 columns.
The columns are: 'sc16' (int), 'cpid' (int), 'type' (str), 'pubyr' (int) and then 767 columns labeled 'dim_0', 'dim_1', 'dim_2' ... 'dim_767', that are all ints.
The file is already arranged/sorted by sc16 and pubyr so that each combination of sc16+pubyr are grouped together in ascending order.
What I'm trying to do is get the average of each 'dim_' column for each unique combination of sc16 & pubyr, then output the row to a new dataframe and save the final result to a new text file.
The problem is that in my script below, the processing gradually gets slower and slower until it's just creeping along by row 5,000,000. I'm working on a machine with 96GB of RAM, and I'm not used to working with a file so large I can't simply load it into memory. This is my first attempt trying to work with something like itertools, so no doubt I'm being really inefficient. Any help you can provide would be much appreciated!
import itertools
import pandas as pd
# Step 1: create an empty dataframe to store the mean values
mean_df = pd.DataFrame(columns=['sc16', 'pubyr'] + [f"dim_{i}" for i in range(768)])
# Step 2: open the file and iterate through the rows
with open('C:\Python_scratch\scibert_embeddings_sorted.txt') as f:
counter = 0
total_lines = sum(1 for line in f)
f.seek(0)
for key, group in itertools.groupby(f, key=lambda x: (x.split(';')[0], x.split(';')[3])): # group by the first (sc16) and fourth (pubyr) column
sc16, pubyr = key
rows = [row.strip().split(';') for row in group]
columns = rows[0]
rows = rows[1:]
# Step 3: convert the group of rows to a dataframe
group_df = pd.DataFrame(rows, columns=columns)
# Step 4: calculate the mean for the group
mean_row = {'sc16': sc16, 'pubyr': pubyr}
for col in group_df.columns:
if col.startswith('dim_'):
mean_row[col] = group_df[col].astype(float).mean()
# Step 5: append the mean row to the mean dataframe
mean_df = pd.concat([mean_df, pd.DataFrame([mean_row])], ignore_index=True)
counter += len(rows)
print(f"{counter} of {total_lines}")
# Step 6: save the mean dataframe to a new file
mean_df.to_csv('C:\Python_scratch\scibert_embeddings_mean.txt', sep=';', index=False)

You might not want to use Pandas at all, since your data is already neatly pre-sorted and all.
Try something like this; it's using numpy to make dim-wise averaging fast, but is plain Python otherwise. It processes a 43,000 line example file I generated in about 9 7.6 seconds on my machine and I don't see a reason why this should slow down over time. (If you know your file won't have a header line or empty lines, you could get rid of those checks.)
Your original code also spent extra time parsing the read lines over and over again; this uses a generator that does that only once.
import itertools
import operator
import numpy as np
def read_embeddings_file(filename):
# Read the (pre-sorted) embeddings file,
# yielding tuples of ((sc16, pubyr) and a list of dimensions).
with open(filename) as in_file:
for line in in_file:
if not line or line.startswith("sc16"): # Header or empty line
continue
line = line.split(";")
sc16, cpid, type, pubyr, *dims = line
# list(map(... is faster than the equivalent listcomp
yield (sc16, pubyr), list(map(int, dims))
def main():
output_name = "scibert_embeddings_mean.txt"
input_name = "scibert_embeddings_sorted.txt"
with open(output_name, "w") as out_f:
print("sc16", "pubyr", *[f"dim_{i}" for i in range(768)], sep=";", file=out_f)
counter = 0
for group, group_contents in itertools.groupby(
read_embeddings_file(input_name),
key=operator.itemgetter(0), # Group by (sc16, pubyr)
):
dims = [d[1] for d in group_contents]
# Calculate the mean of each dimension
mean_dims = np.mean(np.array(dims).astype(float), axis=0)
# Write group to output
print(*group, *mean_dims, sep=";", file=out_f)
# Print progress
counter += len(dims)
print(f"Processed: {counter}; group: {group}, entries in group: {len(dims)}")
if __name__ == "__main__":
main()

How to find a median from a list of values

I exported a CSV file to Python and organized it into lists.
I need to print the 'Median' carat for the 'Premium' category (yellow marked).
Here is my code:
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
print(splitted_line)
Please see the attached output of this code above:

I hope you can use external librares.
import statistics
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
print(statistics.median(values))
Whitout external lib.
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
n = 0
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
n += 1
print(sum(values)/n)

Read the csv into pandas...
import pandas as pd
df = pd.read_csv('diamonds.csv')
If the csv has no headers then select columns by index number (this is what I do below) or rename columns...and continue.
df_Premium = df[df[1] == 'Premium']
stats = df_Premium.describe()
display(stats)
The median will be in the stats printed out.

Please use Pandas library, it is a Data Analysis Library.
import pandas as pd
df = pd.read_csv("diamonds.csv")
And you can see the uniform table stored into a dataframe df.
Now you want median from a specific metric
df.groupby('cut').median()
Which shows all numerical metric's median.
Now, indicate specific column that you need:
df.groupby('cut').median()['cart']

def premiummedian(splitted_line):
premium_carat=[]
n=0
for line in splitted_line:
if line[1]=="Premium":
premium_carat.append(float(line[0]))
n+=1
# not sure if you have to sort for median. If yes then
#remove comment from the line below this
premium_carat.sort()
if n%2==0:# if length is even then return the middle element
return premium_carat[n//2]
else:#if length is odd then return the avg of 2 elements at the middle
return (premium_carat[n//2]+premium_carat[(n//2)+1])/2

How to print max and min value from a long file?

So Im having problem to print the max and min value from a file, the file has over 3000 lines and look like this:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
...
...
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
So this is my current code, I have no idea how to make it work, because now it only prints 3968 value because obviously it is the largest but I want the largest and smallest value from the second column (all the stock prices).
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[0:]
stock_file.close()
print(max(data))

Your current code outputs the "correct" output by chance, since it is using string comparison.
Consider this:
with open('test.txt') as f:
lines = [line.split(', ') for line in f.readlines()[1:]]
# lines is a list of lists, each sub-list represents a line in a format [date, value]
max_value_date, max_value = max(lines, key=lambda line: float(line[-1].strip()))
print(max_value_date, max_value)
# '2015-10-09' '112.120003'

Your current code is reading each line as a string and then finding max and min lines for your list. You can use pandas to read the file as CSV and load it as data frame and then do your min, max operations on data frame
Hope following answers your question
stocks = []
data=data[1:]
for d in data:
stocks.append(float(d.split(',')[1]))
print(max(stocks))
print( min(stocks))

I recommend Pandas module to work with tabular data and use read_csv function. Is very well documented, optimized and very popular for this purposes. You can install it with pip using pip install pandas.
I created a dumb file with your format and stored in a file called test.csv:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
Then, to parse the file you can do as follows. Names parameter defines the name of the columns. Skiprows allows you to skip the first line.
#import module
import pandas as pd
#load file
df = pd.read_csv('test.csv', names=['date', 'value'], skiprows=[0])
#get max and min values
max_value = df['value'].max()
min_value = df['value'].min()

You want to extract the second column into a float using float(datum.split(', ')[1].strip()), and ignore the first line.
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[1:] #ignore first line
stock_file.close()
data = [datum.split(', ') for datum in data]
max_value_date, max_value = max(data, key=lambda data: float(data[-1].strip()))
print(max_value_date, max_value)

or you can use do it in a simpler way: make a list of prices and then get the maximum and minimum. like this:
#as the first line in your txt is not data
datanew=data[1:]
prices=[]
line_after=[]
for line in datanew:
line_after=line.split(',')
price=line_after[1]
prices.append(float(price))
maxprice=max(prices)
minprice=min(prices)

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.

I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Reading columns of data into arrays in Python

I am new to Python. I use Fortran to generate the data file I wish to read. For many reasons I would like to use python to calculate averages and statistics on the data rather than fortan.
I need to read the entries in the the first three rows as strings, and then the data which begins in the fourth row onwards as numbers. I don't need the first column, but I do need the rest of the columns each as their own arrays.
# Instantaneous properties
# MC_STEP Density Pressure Energy_Total
# (molec/A^3) (bar) (kJ/mol)-Ext
0 0.34130959E-01 0.52255964E+05 0.26562549E+04
10 0.34130959E-01 0.52174646E+05 0.25835710E+04
20 0.34130959E-01 0.52050492E+05 0.25278775E+04
And the data goes on for thousands, and sometimes millions of lines.
I have tried the following, but run into problems since I can't analyze the lists I have made, and I can't seem to convert them to arrays. I would prefer however to just create arrays to begin with, but if I can convert my lists to arrays that would work too. In my method I get an error when i try to use an element in one of the lists, i.e. Energy(i)
with open('nvt_test_1.out.box1.prp1') as f:
Title = f.readline()
Properties = f.readline()
Units = f.readline()
Density = []
Pressure = []
Energy = []
for line in f:
row = line.split()
Density.append(row[1])
Pressure.append(row[2])
Energy.append(row[3])
I appreciate any help!

I would use pandas module for this task:
import pandas as pd
In [9]: df = pd.read_csv('a.csv', delim_whitespace=True,
comment='#', skiprows=3,header=None,
names=['MC_STEP','Density','Pressure','Energy_Total'])
Data Frame:
In [10]: df
Out[10]:
MC_STEP Density Pressure Energy_Total
0 0 0.034131 52255.964 2656.2549
1 10 0.034131 52174.646 2583.5710
2 20 0.034131 52050.492 2527.8775
Average values for all columns:
In [11]: df.mean()
Out[11]:
MC_STEP 10.000000
Density 0.034131
Pressure 52160.367333
Energy_Total 2589.234467
dtype: float64

You can consider a list in Python like an array in other languages and it's very optimised. If you have some special needs there is an array type available but rarely used, alternatively the numpy.array that is designed for scientific computation; you have to install the Numpy package for that.
Before performing calculations cast the string to a float, like in energy.append(float(row[3]))
Maybe do it at once using map function:
row = map(float, line.split())
Last, as #Hamms said, access the elements by using square brackets e = energy[i]

You can also use the csv module's DictReader to read each row into a dictionary, as follows:
with open('filename', 'r') as f:
reader = csv.DictReader(f, delimiter=r'\s+', fieldnames=('MC_STEP', 'DENSITY', 'PRESSURE', 'ENERGY_TOTAL')
for row in reader:
Density.append(float(row['DENSITY'])
Pressure.append(float(row['PRESSURE'])
Energy.append(float(row['ENERGY_TOTAL'])
Ofcourse this assumes that the file is formatted more like a CSV (that is, no comments). If the file does have comments at the top, you can skip them before initializing the DictReader as follows:
next(f)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Column statistics from given input file? - python

You can use the pandas library for this (http://pandas.pydata.org/) The code below worked for me: import pandas as pd df = pd.read_csv('data.txt',header=None) somestats = df.describe() somestats.to_csv('dataOut.txt')

Related

Calculate averages over subgroups of data in extremely large (100GB+) CSV file

How to find a median from a list of values

How to print max and min value from a long file?

Count and compare occurrences across different columns in different spreadsheets

Reading columns of data into arrays in Python

Categories

Resources