Going through a DataFrame using .iloc

Going through a DataFrame using .iloc - python

What I want to do is save part of the dataframe into a list.
I have my DataFrame data_S and when it is printed it looks like this:
data_S
open high low close volume datetime
0 329.30 334.860 327.8800 333.74 5357973.0 1.578290e+12
1 334.26 344.190 330.7100 337.28 9942277.0 1.578377e+12
2 332.40 334.030 329.6000 331.37 8246250.0 1.578463e+12
3 334.95 341.730 332.0500 336.34 8183707.0 1.578550e+12
4 335.56 337.700 329.4548 329.92 7170124.0 1.578636e+12
.. ... ... ... ... ... ...
249 216.36 218.554 214.3650 216.67 10812617.0 1.609308e+12
250 216.24 216.900 212.7000 214.06 10487567.0 1.609394e+12
251 210.00 210.200 202.4911 202.72 21225594.0 1.609740e+12
252 204.74 213.350 204.6000 211.63 19338304.0 1.609826e+12
253 210.22 215.610 209.3400 211.03 16202157.0 1.609913e+12
I want to be able to replicate the code below and change the value of the bolded with a for loop value of nums.
list_of_five_day_range = []
#so then it starts with the first list being the most recent and then in
#[X,Y,Z] Z is the most recent high or is in data_S[253]['high']
list_of_max_value = []
bars = data_S.iloc[-**5**:]['high']
list_of_five_day_range.append(list(bars))
max_value = bars.max()
list_of_max_value.append(max_value)
bars1 = data_S.iloc[**-6:-1**]['high']
list_of_five_day_range.append(list(bars1))
max_value1 = bars1.max()
list_of_max_value.append(max_value1)
max_id = bars.max()
# [X,Y,Z] Z is the most recent with the list at [0] is the most recent data
print(str(list_of_five_day_range) + " this is last 5 days of data")
# with the first number in the list is for the most recent first day high.
print(str(list_of_max_value)+" this is maxium number in last 5 days")
returns
[[218.554, 216.9, 210.2, 213.35, 215.61], [221.68, 218.554, 216.9, 210.2, 213.35]] this is last 5 days of data
[218.554, 221.68] this is maxium number in last 5 days
but in a function and for it to go through the DataFrame. This is what I have so far
def five_day_range(price_history):
for nums in range(len(price_history.index) - 1):
list_of_five_day_range = []
#so then it starts with the first list being the most recent and then [X,Y,Z] Z is the most recent
list_of_max_value = []
bars = price_history.iloc[-5 + int(-nums): int(-nums)]["high"]
list_of_five_day_range.append(bars)
max_value = bars.max()
list_of_max_value.append(max_value)
# print( str(bars)+ " this is the veyr first list of range ")
return list_of_five_day_range, list_of_max_value
However, when I print(five_day_range(data_S)) this is what I get
([0 334.86
1 344.19
Name: high, dtype: float64], [344.19])
I dont understand why it is printing this when the nums value should be going up the DataFrame. This is how I thought the for loop would go through the DataFrame.
enter image description here
I thought it would first append the yellow data then the blue and green, and so on until it hits index[0]/

To answer your first question:
I believe data_S.drop(columns=['high', 'low']) returns another dataframe, which by default requires you to reassign the result into another variable.
With that in mind, you can either do:
data_S = data_S.drop(columns=["high", "low"])
or
data_S.drop(columns=["high", "low"], inplace=True)
UPDATED based on updated question:
Try this, and your list should be returned as expected.
You are defining list_of_five_day_range and list_of_max_value as [] on every iteration of the loop. Hope that makes sense.
def five_day_range(price_history):
list_of_five_day_range = []
list_of_max_value = []
for nums in range(len(price_history.index) - 1):
bars = price_history.iloc[-5 + int(-nums): int(-nums)]["high"]
list_of_five_day_range.append(bars)
max_value = bars.max()
list_of_max_value.append(max_value)
return list_of_five_day_range, list_of_max_value

Related

How to iterate over rows of each column in a dataframe

My current code functions and produces a graph if there is only 1 sensor, i.e. if col2, and col3 are deleted in the example data provided below, leaving one column.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)
index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3
baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = DataFrame(baseline)
baseline_num = DataFrame(baseline_num)
v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)
sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000
for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.
v[i] = dfn.loc[i].to_numpy().sum() - sensors
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points
dfx = DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)
dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)
Basically, instead of 1 graph as this produces, I would like to iterate over each column only for this loop:
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
I've tried some other solutions from other questions but none of them seem to solve this case.
As of now, I've tried multiple conversions to things like lists and tuples, and then calling them something like this:
baseline_num[i,column] += d[i - x,column]
as well as
baseline_num[i][column += d[i - x][column]
while iterating over the loop using
for column in columns
However no matter how I seem to arrange the solution, there is always some keyerror of expecting integer or slice indices, among other errors.
See pictures for expected/possible outputs of one column on actual data.with varying input parameters (sensitivity value, and off_require is varied in different cases.)
One such solution which didn't work was the looping method from this link:
https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/
I've also tried creating a loop using iteritems as the outer loop. This did not function as well.
Below are links to possible graph outputs for various sensitivity values, and windows in my actual dataset, with only one column. (i.e i manually deleted other columns, and plotted just the one using the current program)
sensitivity 1000, window 8
sensitivity 800, window 5
sensitivity 1500, window 5
If there's anything I've left out that would be helpful to solving this, please let me know so I can rectify it immediately.
See this picture for my original df.head:
df.head

Did you try,
for colname,colitems in df.iteritems():
for rownum,rowitem in colitems.iteritems():
print(df[colname][rownum])
The first loop iterates over all the columns, and the 2nd loops iterates over all the rows for that column.
edit:
From our conversation below, I think that your baseline and df dataframes don't have the same column names because of how you created them and how you are accessing the elements.
My suggestion is that you create the baseline dataframe to be a copy of your df dataframe and edit the information within it from there.
Edit:
I have managed to make your code work for 1 loop, but I run into an index error, I am not sure what your optimisation function does but i think that is what is causing it, take a look.
It is this part baseline_num[colname][rownum - 1], in the second loop i guess because you do rownum (0) -1, you get index -1. You need to change it so that in the first loop rownum is 1 or something, I am not sure what you are trying to do there.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}
df = pd.DataFrame(data=d)
sensors = 3
window_size = 5
dfn = df.rolling(window_size).corr(pairwise = True)
index = df.index #index of values in the data frame.
rows = len(index) #len(index) returns number of rows in the data.
sensors = 3
baseline_num = [0]*(rows) #baseline numerator, by default zero
baseline = [0]*(rows) #initialize baseline value
baseline = pd.DataFrame(df)
baseline_num = pd.DataFrame(df)
#print(baseline_num)
v = [None]*(rows) # Initialize an empty array v[] equal to amount of rows in .csv file
s = [None]*(rows) #Initialize another empty array for the slope values for detecting when there is an exposure
d = [0]*(rows)
sensors_on = True #Is the sensor detecting something (True) or not (False).
off_count = 0
off_require = 8 # how many offs until baseline is updated
sensitivity = 1000
for i in range(0, (rows)): #This iterates over each index value, i.e. each row, and sums the values and returns them in list format.
v[i] = dfn.loc[i].to_numpy().sum() - sensors
for colname,colitems in df.iteritems():
#print(colname)
for rownum,rowitem in colitems.iteritems():
#print(rownum)
#display(baseline[colname][rownum])
#d[rownum] = dfone.loc[rownum].to_numpy()
#d[colname][rownum] = df.loc[colname][rownum]
if v[rownum] >= sensitivity:
sensors_on = True
off_count = 0
baseline_num[rownum] = 0
else:
sensors_on = False
off_count += 1
if off_count == off_require:
for x in range(0, (off_require)):
baseline_num[colname][rownum] += df[colname][rownum - x]
elif off_count > off_require:
baseline_num[colname][rownum] += baseline_num[colname][rownum - 1] + df[colname][rownum] - (df[colname][rownum - off_require]) #this loop is just an optimization, one calculation per loop once the first calculation is established
baseline[colname][rownum] = ((baseline_num[colname][rownum])//(off_require)) #mean of the last "off_require" points
print(baseline[colname][rownum])
dfx = pd.DataFrame(v, columns =['Sensor Correlation']) #converts the summed correlation tables back from list format to a DataFrame, with the sole column name 'Sensor Correlation'
dft = pd.DataFrame(baseline, columns =['baseline'])
dft = dft.astype(float)
dfx.plot(figsize=(50,25), linewidth=5, fontsize=40) # plots dfx dataframe which contains correlated and summed data
dft.plot(figsize=(50,25), linewidth=5, fontsize=40)
My output looks like this,
-324.0
-238.0
-314.0
-276.0
-264.0
-306.0
-371.0
-806.0
638.0
-412.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
354 try:
--> 355 return self._range.index(new_key)
356 except ValueError as err:
ValueError: -1 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
3 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
355 return self._range.index(new_key)
356 except ValueError as err:
--> 357 raise KeyError(key) from err
358 raise KeyError(key)
359 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: -1

I don't have enough rep to comment, but below is what I was able to work out. Hope it helps!
I tried to use the to_list() function while working out an answer, and it threw me an error:
AttributeError: 'DataFrame' object has no attribute 'to_list'
So, I decided to circumvent that method and came up with this:
indexes = [x for x in df.index]
row_vals = []
for index in indexes :
for val in df.iloc[i].values:
row_vals.append(val)
The object row_vals will contain all values in row order.
If you only want to get the row values for a particular row or set of rows, you would need to do this:
indx_subset = [`list of row indices`] #(Ex. [1, 2, 5, 6, etc...])
row_vals = []
for indx in indx_subset:
for val in df.loc[indx].values:
row_vals.append(val)
row_vals will then have all the row values from the specified indices.

TypeError: zip argument #1 must support iteration in Python

I'm trying to write code to find two indices when a value changes from 0 to 1 and save that value in a variable called idx. Then, the two rows before and after the index should be extracted and processed. The code for extracting the rows is included below:
df1=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
N = 2
s = [x for s, e in zip(idx-N,idx) for x in range(s, e+1)]
df_before_2rows=df1.loc[df1.index.intersection(s)]
This works. But, if I run this in a for-loop that processes each index one-by-one then I get an error:
df1=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
for item in idx:
N = 2
s = [x for s, e in zip(item-N,item) for x in range(s, e+1)]
df_before_2rows=df1.loc[df1.index.intersection(s)]
TypeError: zip argument #1 must support iteration
Main goal is to get two rows before when flag change from 0 to 1 and process ,and then go next check if flag change from 0 to 1 then do same as above

IIUC, you can choose a different approach using groupby with cumsum of diff:
df = pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,0]})
for _, i in df.groupby(df["flag"].shift(1).diff().eq(1).cumsum()):
if i["flag"].eq(1).any(): # this is done to skip the last group with no flag of 1
print (i.tail(3))
# do your thing with i.tail(3)...
EDIT using your original method:
idx = [4,8] # I assume you retrieved the idx already
for item in idx:
N = 2
df_before_2rows=df.loc[range(item-N,item+1)]
print (df_before_2rows)

item is an element of idx; item-N is also just a number, hence the error.
for item in idx:
N = 2
s = [x for s, e in zip(item-N,item) for x in range(s, e+1)]
simplifies to:
for item in idx:
N = 2
# s = [x for x in range(item-N, item+1)]
s = list(range(item-N, item+1))
# s = np.arange(item-N, item+1)

CSV Python list

Name Gender Physics Maths
A 45 55
X 22 64
C 0 86
I have a csv file like this, I have made some modification to get list with only the marks in the form [[45,55],[22,64]]
I want to find the minimum for each subject.
But when I run my code, I only get the minimum for the first subject and the other values are copied from the row
The answer I want - [0,55]
The answer I get - [0,86]
def find_min(marks,cols,rows):
minimum = []
temp = []
for list in marks:
min1 = min([x for x in list])
minimum.append(min1)
# for j in range(rows):
# for i in range(cols):
# temp.append(marks)
# x = min(temp)
# minimum.append(x)
return minimum
How do I modify my code
I cant use any other modules/libraries like csv or pandas
i tries using zip(*marks) - But that just prints my marks list as is.
Is there any way to separate the inner-lists from the larger lists

This will calculate the minimum per subject:
In [707]: marks = [[45,55],[22,64]]
In [697]: [min(idx) for idx in zip(*marks)]
Out[697]: [22, 55]

Try transposing the marks array (which is one student per row) so each list entry corresponds to a column ("subject") from your CSV:
def find_min(marks):
mt = zip(*marks)
mins = [min(row) for row in mt]
return mins
example usage:
marks = [[45,55],[22,64],[0,86]]
print(find_min(marks))
which prints:
[0, 55]

Find value and index in panda series where the value increased 5 times

In a panda series it should go through the series and stop if one value has increased 5 times. With a simple example it works so far:
list2 = pd.Series([2,3,3,4,5,1,4,6,7,8,9,10,2,3,2,3,2,3,4])
def cut(x):
y = iter(x)
for i in y:
if x[i] < x[i+1] < x[i+2] < x[i+3] < x[i+4] < x[i+5]:
return x[i]
break
out = cut(list2)
index = list2[list2 == out].index[0]
So I get the correct Output of 1 and Index of 5.
But if I use a second list with series type and instead of (19,) which has (23999,) values then I get the Error:
pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 3489660928

You can do something like this:
# compare list2 with the previous values
s = list2.gt(list2.shift())
# looking at last 5 values
s = s.rolling(5).sum()
# select those equal 5
list2[s.eq(5)]
Output:
10 9
11 10
dtype: int64
The first index where it happens is
s.eq(5).idxmax()
# output 10
Also, you can chain them together:
(list2.gt(list2.shift())
.rolling(5).sum()
.eq(5).idxmax()
)

I cant get my code to work. it keeps saying: IndexError: List index out of range

My code is using the lengths of lists to try and find a percentage of how many scores are over an entered number.It all makes sense but I think some of the code needs some editing because it comes up with that error code.How can I fix it???
Here is the code:
result = [("bob",7),("jeff",2),("harold",3)]
score = [7,2,3]
lower = []
higher = []
index2 = len(score)
indexy = int(index2)
index1 = 0
chosen = int(input("the number of marks you want the percentage to be displayed higher than:"))
for counter in score[indexy]:
if score[index1] >= chosen:
higher.append(score[index1])
else:
lower.append(score[index1])
index1 = index1 + 1
original = indexy
new = len(higher)
decrease = int(original) - int(new)
finished1 = decrease/original
finished = finished1 * 100
finishedlow = original - finished
print(finished,"% of the students got over",chosen,"marks")
print(finishedlow,"% of the students got under",chosen,"marks")

Just notice one thing:
>>>score = [7,2,3]
>>>len(score) = 3
but ,index of list start counting from 0, so
>>>score[3]
IndexError: list index out of range
fix your row 12 to:
...
for counter in score:
if counter >= chosen:
...
if you really want to get the index and use them:
....
for index, number in enumerate(score):
if score[index] >= chosen:
......

Your mistake is in Line 9: for counter in score[indexy]:
counter should iterate through a list not through an int and even that you are referring to a value that is out of index range of your list:
1 - Remember indexing should be from 0 to (len(list)-0).
2 - You cannot iterate through a fixed value of int.
So, you should change Line 9 to :
for counter in score
But I'm not sure of the result you will get from your code, you need to check out your code logic.
There is a lot to optimize in your code.

index2 is an int, so no need to convert it to indexy. Indizes in Python are counted from 0, so the highest index is len(list)-1.
You have a counter, so why use index1 in for-loop? You cannot iterate over a number score[indexy].
results = [("bob",7),("jeff",2),("harold",3)]
chosen = int(input("the number of marks you want the percentage to be displayed higher than:"))
higher = sum(score >= chosen for name, score in results)
finished = higher / len(results)
finishedlow = 1 - finished
print("{0:.0%} of the students got over {1} marks".format(finished, chosen))
print("{0:.0%} of the students got under {1} marks".format(finishedlow, chosen))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Going through a DataFrame using .iloc - python

Related

How to iterate over rows of each column in a dataframe

TypeError: zip argument #1 must support iteration in Python

CSV Python list

Find value and index in panda series where the value increased 5 times

I cant get my code to work. it keeps saying: IndexError: List index out of range

Categories

Resources