CSV Python list - python

Name Gender Physics Maths
A 45 55
X 22 64
C 0 86
I have a csv file like this, I have made some modification to get list with only the marks in the form [[45,55],[22,64]]
I want to find the minimum for each subject.
But when I run my code, I only get the minimum for the first subject and the other values are copied from the row
The answer I want - [0,55]
The answer I get - [0,86]
def find_min(marks,cols,rows):
minimum = []
temp = []
for list in marks:
min1 = min([x for x in list])
minimum.append(min1)
# for j in range(rows):
# for i in range(cols):
# temp.append(marks)
# x = min(temp)
# minimum.append(x)
return minimum
How do I modify my code
I cant use any other modules/libraries like csv or pandas
i tries using zip(*marks) - But that just prints my marks list as is.
Is there any way to separate the inner-lists from the larger lists

This will calculate the minimum per subject:
In [707]: marks = [[45,55],[22,64]]
In [697]: [min(idx) for idx in zip(*marks)]
Out[697]: [22, 55]

Try transposing the marks array (which is one student per row) so each list entry corresponds to a column ("subject") from your CSV:
def find_min(marks):
mt = zip(*marks)
mins = [min(row) for row in mt]
return mins
example usage:
marks = [[45,55],[22,64],[0,86]]
print(find_min(marks))
which prints:
[0, 55]

Related

How to select list elements based on crteria from other lists

I am new to Python, coming from SciLab (an open source MatLab ersatz), which I am using as a toolbox for my analyses (test data analysis, reliability, acoustics, ...); I am definitely not a computer science lad.
I have data in the form of lists of same length (vectors of same size in SciLab).
I use some of them as parameter in order to select data from another one; e.g.
t_v = [1:10]; // a parameter vector
p_v = [20:29]; another parameter vector
res_v(t_v > 5 & p_v < 28); // are the res_v vector elements of which "corresponding" p_v and t_v values comply with my criteria; i can use it for analyses.
This is very direct and simple in SciLab; I did not find the way to achieve the same with Python, either "Pythonically" or simply translated.
Any idea that could help me, please?
Have a nice day,
Patrick.
You could use numpy arrays. It's easy:
import numpy as np
par1 = np.array([1,1,5,5,5,1,1])
par2 = np.array([-1,1,1,-1,1,1,1])
data = np.array([1,2,3,4,5,6,7])
print(par1)
print(par2)
print(data)
bool_filter = (par1[:]>1) & (par2[:]<0)
# example to do it directly in the array
filtered_data = data[ par1[:]>1 ]
print( filtered_data )
#filtering with the two parameters
filtered_data_twice = data[ bool_filter==True ]
print( filtered_data_twice )
output:
[1 1 5 5 5 1 1]
[-1 1 1 -1 1 1 1]
[1 2 3 4 5 6 7]
[3 4 5]
[4]
Note that it does not keep the same number of elements.
Here's my modified solution according to your last comment.
t_v = list(range(1,10))
p_v = list(range(20,29))
res_v = list(range(30,39))
def first_idex_greater_than(search_number, lst):
for count, number in enumerate(lst):
if number > search_number:
return count
def first_idex_lower_than(search_number, lst):
for count, number in enumerate(lst[::-1]):
if number < search_number:
return len(lst) - count # since I searched lst from top to bottom,
# I need to also reverse count
t_v_index = first_idex_greater_than(5, t_v)
p_v_index = first_idex_lower_than(28, p_v)
print(res_v[min(t_v_index, p_v_index):max(t_v_index, p_v_index)])
It returns an array [35, 36, 37].
I'm sure you can optimize it better according to your needs.
The problem statement is not clearly defined, but this is what I interpret to be a likely solution.
import pandas as pd
tv = list(range(1, 11))
pv = list(range(20, 30))
res = list(range(30, 40))
df = pd.DataFrame({'tv': tv, 'pv': pv, 'res': res})
print(df)
def criteria(row, col1, a, col2, b):
if (row[col1] > a) & (row[col2] < b):
return True
else:
return False
df['select'] = df.apply(lambda row: criteria(row, 'tv', 5, 'pv', 28), axis=1)
selected_res = df.loc[df['select']]['res'].tolist()
print(selected_res)
# ... or another way ..
print(df.loc[(df.tv > 5) & (df.pv < 28)]['res'])
This produces a dataframe where each column is the original lists, and applies a selection criteria, based on columns tv and pv to identify the rows in which the criteria, applied dependently to the 2 lists, is satisfied (or not), and then creates a new column of booleans identifying the rows where the criteria is either True or False.
[35, 36, 37]
5 35
6 36
7 37

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

Python Pandas: Find a pattern in a DataFrame

I have the following Dataframe (1,2 millon rows):
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})`
Now I try to find a sequences. Each "beginn "should match the first "end"where the distance based on column B is at least 40
occur.
For the provided Dataframe that would mean:
The sould problem is that
Your help is highly appreciated.
I will assume that as your output you want a list of sequences with the starting and ending value. The second sequence that you identify in your picture has a distance lower to 40, so I also assumed that that was an error.
import pandas as pd
from collections import namedtuple
df_test_2 = pd.DataFrame({"A":["end","beginn","end","end","beginn","beginn","end","end","end","beginn","end"],"B":[1,10,50,60,70,80,90,100,110,111,112]})
sequence_list = []
Sequence = namedtuple('Sequence', ['beginn', 'end'])
beginn_flag = False
beginn_value = 0
for i, row in df_test_2.iterrows():
state = row['A']
value = row['B']
if not beginn_flag and state == 'beginn':
beginn_flag = True
beginn_value = value
elif beginn_flag and state == 'end':
if value >= beginn_value + 40:
new_seq = Sequence(beginn_value, value)
sequence_list.append(new_seq)
beginn_flag = False
print(sequence_list)
This code outputs the following:
[Sequence(beginn=10, end=50), Sequence(beginn=70, end=110)]
Two sequences, one starting at 10 and ending at 50 and the other one starting at 70 and ending at 110.

Math error using data from sqlite3 in python program?

from math import *
import sqlite3
conn = sqlite3.connect('person.sqlite3')
def main():
agelist = conn.execute("SELECT age from person where age!='NA'")
ages = []
for row in agelist: ages += [row [0]]
sumthis = []
for row in agelist:
sumthis += [row[0**2]]
sqrted=sum(sumthis)
print(sqrted)
I am trying to square every row of data in agelist, and find the sum of all of those squared numbers. Right now this is giving me 0 as an answer. I want sum(age^2 for each age in ages list)
How can I correct this?
I think, you should replace
sumthis += [row[0**2]]
With,
sumthis += [row[0]**2]
Or, more appropriately,
sumthis.append(row[0]**2)
That's because, forming a new list and add two list at every iteration isn't a good idea.
For the same reason, change
for row in agelist: ages += [row [0]]
To:
for row in agelist: ages.append(row [0])

Python - Shuffling a list with constraints

I've been working for a couple of months, on and off, on a script to shuffle a list in a textfile. I am a beginner in Python (the only language I sort of understand a bit), and after a while I have managed to come up with a few lines of code which do sort of what I need.
The input file I have is a tabbed list. it has 5 words per row, but I'll make it numbers so it looks clearer in the example:
01 02 03 04 05
06 07 08 09 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
Now, after a few efforts and a huge amount of work from SO users, I've managed to shuffle these elements so that they don't appear in the same line as their original "partners". This is the code I'm using:
import csv,StringIO
import random
from random import shuffle
datalist = open('lista.txt', 'r')
leyendo = datalist.read()
separando = csv.reader(StringIO.StringIO(leyendo), delimiter = '\t')
macrolist = list(separando)
l = [group[:] for group in macrolist]
random.shuffle(l)
nicendone = []
prev_i = -1
while any(a for a in l):
new_i = max(((i,a) for i,a in enumerate(l) if i != prev_i), key=lambda x: len(x[1]))[0]
nicendone.append(l[new_i].pop(random.randint(0, len(l[new_i]) - 1)))
prev_i = new_i
with open('randolista.txt', 'w') as newdoc:
for i, m in enumerate(nicendone, 1):
newdoc.write(m + [', ', '\n'][i % 5 == 0])
datalist.close()
This does the job, but what I actually need is a bit more complicated. I need to shuffle the list with the following restrictions:
The words in the first and second column should be shuffled ONLY within their own column.
The new randomised list should have no two elements appearing in the same line again.
What I'd like to get is something like the following:
01 17 25 19 13
16 22 13 03 20
etc
So that items in the first and second column are only shuffled within their own columns, and no two items are in the same row in the output that were in the same row in the input. I realise in a 5 row example this last constraint is constantly broken, but the real input file has 100 rows.
I really don't know how to even start doing this. My programming abilities are limited, but the problem is that I can't even come up with a pseudocode for it. How can I make Python identify the elements of the first two columns so that it only shuffles them vertically?
Thanks in advance
Shuffling the first two columns in such a way that two values that used to be on the same row do not appear on the same row can be accomplished by transposing the the columns with a random number. For example: you could push the first column 20 rows down and the second column 10 rows down where 20 and 10 are random integers less than the numbers of rows.
A sample code that randomizes the first two columns:
from random import sample
text = \
"""a b c d e
f g h i j
k l m n o
p q r s t"""
# Translate file to matrix (list of lists)
matrix = map(lambda x: x.split(" "), text.split("\n"))
# Determine height and height of matrix
height = len(matrix)
width = len(matrix[0])
# Choose two (unique) numbers for transposing the first two columns
transpose_list = sample(xrange(0, height), 2)
# Now build a new matrix, transposing only the first two
# columns.
new_matrix = []
for y in range(0, height):
row = []
for x in range(0, 2):
transpose = (y + transpose_list[x]) % height
row.append(matrix[transpose][x])
for x in range(2, width):
row.append(matrix[y][x])
new_matrix.append(row)
# And create a list again
new_text = "\n".join(map(lambda x: " ".join(x), new_matrix))
print new_text
This results in something like:
a l c d e
f q h i j
k b m n o
p g r s t
If I understand you post correctly, you already have an algorithm for randomizing the rest of the table?
I hope this is of any help :-).
Wout

Categories