How to find a median from a list of values - python

I exported a CSV file to Python and organized it into lists.
I need to print the 'Median' carat for the 'Premium' category (yellow marked).
Here is my code:
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
print(splitted_line)
Please see the attached output of this code above:

I hope you can use external librares.
import statistics
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
print(statistics.median(values))
Whitout external lib.
diamonds_file = open('diamonds.csv', 'r')
lines = diamonds_file.readlines()
table=[]
values=[]
n = 0
for i in range(len(lines)):
lines[i]=lines[i].replace('\n', '')
splitted_line=lines[i].split(',')
if splitted_line[1] == '"Premium"':
values.append(float(splitted_line[0]))
n += 1
print(sum(values)/n)

Read the csv into pandas...
import pandas as pd
df = pd.read_csv('diamonds.csv')
If the csv has no headers then select columns by index number (this is what I do below) or rename columns...and continue.
df_Premium = df[df[1] == 'Premium']
stats = df_Premium.describe()
display(stats)
The median will be in the stats printed out.

Please use Pandas library, it is a Data Analysis Library.
import pandas as pd
df = pd.read_csv("diamonds.csv")
And you can see the uniform table stored into a dataframe df.
Now you want median from a specific metric
df.groupby('cut').median()
Which shows all numerical metric's median.
Now, indicate specific column that you need:
df.groupby('cut').median()['cart']

def premiummedian(splitted_line):
premium_carat=[]
n=0
for line in splitted_line:
if line[1]=="Premium":
premium_carat.append(float(line[0]))
n+=1
# not sure if you have to sort for median. If yes then
#remove comment from the line below this
premium_carat.sort()
if n%2==0:# if length is even then return the middle element
return premium_carat[n//2]
else:#if length is odd then return the avg of 2 elements at the middle
return (premium_carat[n//2]+premium_carat[(n//2)+1])/2

Related

sort one giant string into 7 columns

I have a file which I read in as a string. In sublime the file looks like this:
Filename
Dataset
Level
Duration
Accuracy
Speed Ratio
Completed
file_001.mp3
datasetname_here
value
00:09:29
0.00%
7.36x
2019-07-18
file_002.mp3
datasetname_here
value
00:22:01
...etc.
in Bash:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...etc.
I want to split this into a 7 column csv. As you can see, the values repeat every 8th line. I know I can use a for loop and modulus to read each line. I have done this successfully before.
How can I use pandas to read things into columns?
I don't know how to approach the Pandas library. I have looked at other examples and all seem to start with csv.
import sys
parser = argparse.ArgumentParser()
parser.add_argument('file' , help = "this is the file you want to open")
args = parser.parse_args()
print("file name:" , args.file)
with open(args.file , 'r') as word:
print(word.readlines()) ###here is where i was making sure it read in properly
###here is where I will start to manipulate the data
This is the Bash output:
['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', ...]
First remove '\n':
raw_data = ['Filename\n', 'Dataset\n', 'Level\n', 'Duration\n', 'Accuracy\n', 'Speed Ratio\n', 'Completed\n', 'file_001.mp3\n', 'datasetname_here\n', 'value\n', '00:09:29\n', '0.00%\n', '7.36x\n', '2019-07-18\n', 'file_002.mp3\n', 'datasetname_here\n', 'L1\n', '00:20:01\n', '0.01%\n', '7.39x\n', '2019-07-20\n']
raw_data = [string.replace('\n', '') for string in raw_data]
Then pack your data in 7-length arrays inside a big array:
data = [raw_data[x:x+7] for x in range(0, len(raw_data),7)]
Finally read your data as a DataFrame, the first row contains the name of the columns:
df = pd.DataFrame(data[1:], columns=data[0])
print(df.to_string())
Filename Dataset Level Duration Accuracy Speed Ratio Completed
0 file_001.mp3 datasetname_here value 00:09:29 0.00% 7.36x 2019-07-18
1 file_002.mp3 datasetname_here L1 00:20:01 0.01% 7.39x 2019-07-20
Try This
import numpy as np
import pandas as pd
with open ("data.txt") as f:
list_str = f.readlines()
list_str = map(lambda s: s.strip(), list_str) #Remove \n
n=7
list_str = [list_str[k:k+n] for k in range(0, len(list_str), n)]
df = pd.DataFrame(list_str[1:])
df.columns = list_str[0]
df.to_csv("Data_generated.csv",index=False)
Pandas is not a library to read into columns. It supports many formats to read and write (One of them is comma separated values) and mainly used as python based data analysis tool.
Best place to learn is see their documentation and practice.
Output of above code
I think you don't have to use pandas or any other library. My approach:
data = []
row = []
with open(args.file , 'r') as file:
for line in file:
row.append(line)
if len(row) == 7:
data.append(row)
row = []
How does it work?
The for loop reads the file line by line.
Add the line to row
When row's length is 7, it's completed and you can add the row to data
Create a new list for row
Repeat

How to print max and min value from a long file?

So Im having problem to print the max and min value from a file, the file has over 3000 lines and look like this:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
...
...
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
So this is my current code, I have no idea how to make it work, because now it only prints 3968 value because obviously it is the largest but I want the largest and smallest value from the second column (all the stock prices).
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[0:]
stock_file.close()
print(max(data))
Your current code outputs the "correct" output by chance, since it is using string comparison.
Consider this:
with open('test.txt') as f:
lines = [line.split(', ') for line in f.readlines()[1:]]
# lines is a list of lists, each sub-list represents a line in a format [date, value]
max_value_date, max_value = max(lines, key=lambda line: float(line[-1].strip()))
print(max_value_date, max_value)
# '2015-10-09' '112.120003'
Your current code is reading each line as a string and then finding max and min lines for your list. You can use pandas to read the file as CSV and load it as data frame and then do your min, max operations on data frame
Hope following answers your question
stocks = []
data=data[1:]
for d in data:
stocks.append(float(d.split(',')[1]))
print(max(stocks))
print( min(stocks))
I recommend Pandas module to work with tabular data and use read_csv function. Is very well documented, optimized and very popular for this purposes. You can install it with pip using pip install pandas.
I created a dumb file with your format and stored in a file called test.csv:
3968 #number of lines
2000-01-03, 3.738314
2000-01-04, 3.423135
2000-01-05, 3.473229
2015-10-07, 110.779999
2015-10-08, 109.50
2015-10-09, 112.120003
Then, to parse the file you can do as follows. Names parameter defines the name of the columns. Skiprows allows you to skip the first line.
#import module
import pandas as pd
#load file
df = pd.read_csv('test.csv', names=['date', 'value'], skiprows=[0])
#get max and min values
max_value = df['value'].max()
min_value = df['value'].min()
You want to extract the second column into a float using float(datum.split(', ')[1].strip()), and ignore the first line.
def apple():
stock_file = open('apple_USD.txt', 'r')
data = stock_file.readlines()
data = data[1:] #ignore first line
stock_file.close()
data = [datum.split(', ') for datum in data]
max_value_date, max_value = max(data, key=lambda data: float(data[-1].strip()))
print(max_value_date, max_value)
or you can use do it in a simpler way: make a list of prices and then get the maximum and minimum. like this:
#as the first line in your txt is not data
datanew=data[1:]
prices=[]
line_after=[]
for line in datanew:
line_after=line.split(',')
price=line_after[1]
prices.append(float(price))
maxprice=max(prices)
minprice=min(prices)

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.
I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Column statistics from given input file?

I am given a .txt file of data:
1,2,3,0,0
1,0,4,5,0
1,1,1,1,1
3,4,5,6,0
1,0,1,0,3
3,3,4,0,0
My objective is to calculate the min,max,avg,range,median of the columns of given data and write it to an output .txt file.
My logic in approaching this question is as follows
Step 1) Read the data
infile = open("Data.txt", "r")
tempLine = infile.readline()
while tempLine:
print(tempLine.split(','))
tempLine = infile.readline()
Obviously it's not perfect but the idea is that the data can be read by this...
Step 2) Store the data into corresponding list variables? row1, row2,... row6
Step 3) Combine above lists all into one, giving a final list like this...
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]]
Step 4) Using nested for loop, access elements individually and store them into list variables
col1, col2, col3, ... , col5
Step 5) Calculate min, max etc and write to output file
My question is, with my rather beginner knowledge of computer science and python, is this logic inefficient, and could there possibly be an easier, and better logic towards solving this problem?
My main problem is probably steps 2 through 5. The rest I know how to do for sure.
Any advice would be helpful!
Try numpy. Numpy library provides a fast options when dealing with nested lists in a list, or simply, matrices.
To use numpy, you must import numpy at the beginning of your code.
numpy.matrix(1,2,3,0,0;1,0,4,5,0;....;3,3,4,0,0)
will give you
flist =[[1,2,3,0,0],[1,0,4,5,0],[1,1,1,1,1],[3,4,5,6,0],[1,0,1,0,3],[3,3,4,0,0]] straight off the bat.
Also, you may look through the axis(in this case, rows) and get mean, min, max easily using
max([axis, out]) Return the maximum value along an axis.
mean([axis, dtype, out]) Returns the average of the matrix elements along the given axis.
min([axis, out]) Return the minimum value along an axis.
This is from https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html, a numpy document, so for more information, please read the numpy document.
To get the data I would to something like this:
from statistics import median
infile = open("Data.txt", "r")
rows = [line.split(',') for line in infile.readlines()]
for row in rows:
minRow = min(row)
maxRow = max(row)
avgRow = sum(row) / len(row)
rangeRow = maxRow - minRow
medianRow = median(row)
#then write the data to the output file
You can use the pandas library for this (http://pandas.pydata.org/)
The code below worked for me:
import pandas as pd
df = pd.read_csv('data.txt',header=None)
somestats = df.describe()
somestats.to_csv('dataOut.txt')
This is how I ended up doing it if anyone is curious
import numpy
infile = open("Data1.txt", "r")
outfile = open("ColStats.txt", "w")
oMat = numpy.loadtxt(infile)
tMat = numpy.transpose(oMat) #Create new matrix where Columns of oMat becomes rows and rows become columns
#print(tMat)
for x in range (5):
tempM = tMat[x]
mn = min(tempM)
mx = max(tempM)
avg = sum(tempM)/6.0
rng = mx - mn
median = numpy.median(tempM)
out = ("[{} {} {} {} {}]".format(mn, mx, avg, rng, median))
outfile.write(out + '\n')
infile.close()
outfile.close()
#print(tMat)

How to Perform Mathematical Operation on One Value of a CSV file?

I am dealing with a csv file that contains three columns and three rows containing numeric data. The csv data file simply looks like the following:
Colum1,Colum2,Colum3
1,2,3
1,2,3
1,2,3
My question is how to write a python code that take a single value of one of the column and perform a specific operation. For example, let say I want to take the first value in 'Colum1' and subtract it from the sum of all the values in the column.
Here is my attempt:
import csv
f = open('columns.csv')
rows = csv.DictReader(f)
value_of_single_row = 0.0
for i in rows:
value_of_single_Row += float(i) # trying to isolate a single value here!
print value_of_single_row - sum(float(r['Colum1']) for r in rows)
f.close()
Based on the code you provided, I suggest you take a look at the doc to see the preferred approach on how to read through a csv file. Take a look here:
How to use CsvReader
with that being said, you can modify the beginning of your code slightly to this:
import csv
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
# perform operation per row
From there you now have access to each row.
This should give you what you need to do proper row-by-row operations.
What I suggest you do is play around with printing out your rows to see what your data looks like. You will see that each row being outputted is a dictionary.
So if you were going through each row, you can just simply do something like this:
for row in rows:
row['Colum1'] # or row.get('Colum1')
# to do some math to add everything in Column1
s += float(row['Column1'])
So all of that will look like this:
import csv
s = 0
with open('data.csv', 'rb') as f:
rows = csv.DictReader(f)
for row in rows:
s += float(row['Colum1'])
You can do pretty much all of this with pandas
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
import sys
import os
Location = r'path/test.csv'
df = pd.read_csv(Location, names=['Colum1','Colum2','Colum3'])
df = df[1:] #Remove the headers since they're unnecessary
print df
df.xs(1)['Colum1']=int(df.loc[1,'Colum1'])+5
print df
You can write back to your csv using df.to_csv('File path', index=False,header=True) Having headers=True will add the headers back in.
To do this more along the lines of what you have you can do it like this
import csv
Location = r'C:/Users/tnabrelsfo/Documents/Programs/Stack/test.csv'
data = []
with open(Location, 'r') as f:
for line in f:
data.append(line.replace('\n','').replace(' ','').split(','))
data = data[1:]
print data
data[1][1] = 5
print data
it will read in each row, cut out the column names, and then you can modify the values by index
So here is my simple solution using pandas library. Suppose we have sample.csv file
import pandas as pd
df = pd.read_csv('sample.csv') # df is now a DataFrame
df['Colum1'] = df['Colum1'] - df['Colum1'].sum() # here we replace the column by subtracting sum of value in the column
print df
df.to_csv('sample.csv', index=False) # save dataframe back to csv file
You can also use map function to do operation to one column, for example,
import pandas as pd
df = pd.read_csv('sample.csv')
col_sum = df['Colum1'].sum() # sum of the first column
df['Colum1'] = df['Colum1'].map(lambda x: x - col_sum)

Categories