Sum of rows for a given column - python

I am trying to add elements in rows from "list1" and "list2" using while loop. But getting "KeyError: 'the label [7] is not in the [index]". I know the simple way to do this is:
df['sum'] = (df["list1"]+df["list2"])
But I want to try this with loop for learning purposes.
import pandas as pd
df= pd.DataFrame({"list1":[2,5,4,8,4,7,8],"list2":[5,8,4,8,7,5,5],"list3":
[50,65,4,82,89,90,76]})
d=[]
count=0
x=0
while count<len(df):
df1=df.loc[x,"list1"]+df.loc[x,"list2"]
d.append(df1)
x=x+1
count=count+1
df["sum"]=d

you are really close but just a few suggestions:
no need for both count and x values
you are getting the error because then len of df (7) falls outside the index which is what loc is looking for. That can be fixed by doing len(df)-1
you do not need to do x = x+1 you can use x+=1
d=[]
x=0
while x <= len(df)-1:
df1 = df.loc[x, "list1"] + df.loc[x,"list2"]
d.append(df1)
x += 1
df["sum"]=d

Related

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

How to calculate values row wise using lambda function?

So I have a dataframe where I want to count all the days a student was present. The dataframe headers are the days of the month and I want to count the frequency of the character 'P' row wise over all the columns and store them in a new column. What I have done unti now is defined a function which should accept each row and count the frequency of P -
def count_P(list):
frequency = 0
for item in list:
if item == 'P':
frequency += 1
return frequency
And then I am trying to apply this function which is what I am confused about:
df['Attendance'] = df.apply(lambda x: count_P(x) for x in , axis = 1)
In the above line I need to pass x everytime as a row of the dataframe so do I write
for x in range(df.iloc[0],df.iloc[df.shape[0]])? But that gives me SyntaxError. And do I need axis here? Or does it need to be done in some other way?
Edit:
The error message I am getting-
df['Attendance'] = df.apply(lambda x: count_P(x) for x in range(df.iloc[0],df.iloc[df.shape[0]]),axis=1)
^
SyntaxError: Generator expression must be parenthesized
Assuming your dataframe looks like this:
df = pd.DataFrame({'2021-03-01': ['P','P'], '2021-03-02': ['P','X']})
You can do :
df["p_count"] = (df == 'P').sum(axis=1)
yields:
2021-03-01 2021-03-02 p_count
0 P P 2
1 P X 1

Finding the smallest number not smaller than x for a rolling data of a dataframe in python

Let suppose i have data in rows for a column(O) : 1,2,3,4,5,6,7,8,9,10.
Its average is 5.5.
I need to find the smallest number which is larger than the average 5.5 :- i.e. '6'
Here is what I have tried soo far.
method 1:
df["test1"] = df["O"].shift().rolling(min_periods=1, window=10).apply(lambda x: pd.Series(x).nlargest(5).iloc[-1])
Discarded as that number may not always be the 6th number.
method 2:
great = []
df['test1'] = ''
df["avg"] = df["O"].shift().rolling(min_periods=1, window=10).mean()
for i in range(1, len(df)):
for j in range(0, 10):
if(df.loc[i-j, 'O'] > df.loc[i, 'avg']):
great.append(df.loc[i-j, 'O'])
df.loc[i, 'test1'] = min(great)
This throws an error:
KeyError: -1
Please help to find the small error in the code as soon as possible.
Thanks,
D.
Mask the Series when it is greater than the mean, then sort, then take the first row.
import pandas as pd
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10], columns=("vals",))
df[df.vals > df.vals.mean()].sort_values("vals").head(1)
# > vals
# 5 6
Try with
n = 10
output = df.vals.rolling(10).apply(lambda x : x[x>x.mean()].min())

customize step in loop through pandas

I know this question was asked a few times, but I couldn't understand the answers or apply them to my case.
I'm trying to iterate over a dataframe, and for each row, if column A has 1 add one to the counter, if it has 0 don't count the line in the counter (but don't skip it).
When we reach 10 for the counter, take all the rows and put them in an array and restart the counter. After searching a bit, it seems that generators could do the trick but I have a bit of trouble with them. So far I have something like this thanks to the help of SO community !
data = pd.DataFrame(np.random.randint(0,50,size=(50, 4)), columns=list('ABCD'))
data['C'] = np.random.randint(2, size=50)
data
counter = 0
chunk = 10
arrays = []
for x in range(0, len(data), chunk):
array = data.iloc[x: x+chunk]
arrays.append(array)
print(array)
the idea looks something like this :
while counter <= 10:
if data['A'] == 1:
counter += 1
yield counter
if counter > 10:
counter = 0
But I don't know how to combine this pseudo code with my current for loop.
When we use pandas, we should try not do for loop, based on your question , we can use groupby
arrays=[frame for _,frame in data.groupby(data.A.eq(1).cumsum().sub(1)//10)]
Explain :
we do cumsum with A if it is 1, then we will add the number up, 0 will keep same sum as pervious row, and // here is get the div to split the dataframe by step of 10 , for example 10//10 will return 1 and 20//10 will return 2.

Write a code that Calculates the average number of non-zero ratings per individual in our data set

So I've written my code below but I am having a hard time getting the code to not include the number of zeros. It runs but unfortunately not the way I want it to. Can anyone shed some light?
the variable movies is basically a list of movies we were given The code will be right if an average of 21.4 is the output.
all_ratings =[
[5,5,4,4,3,1,2,3,4,4,4,3,4,0,0,0,1,2,3,4,4,4,1,4,0,0,0,1,2,5],
[5,0,1,2,3,1,2,3,4,4,4,5,4,2,1,0,1,2,0,5,0,4,1,4,2,0,0,1,0,5],
[5,2,3,4,4,0,0,0,4,5,0,3,0,0,0,3,4,0,1,4,4,4,0,4,0,3,0,1,2,5],
[5,0,4,0,0,4,2,3,0,0,4,0,3,0,1,0,1,2,3,0,2,0,1,0,0,0,4,0,1,5],
[5,4,3,2,1,1,2,3,4,3,4,3,4,0,3,0,1,2,4,4,4,4,1,4,0,0,0,1,2,5],
]
total=[]
average=[]
for index in range (len(all_ratings)):
total+=[sum(all_ratings[index])]
for index in range(len(all_ratings)):
average = average + [total[index]/30]
for index in range(len(movies)):
print(average)
define
def mean(l):
return sum(l)/len(l)
and now
[mean([y for y in x if y > 0]) for x in allratings]
You can use numpy array to select nonzero elements and numpy mean to get the average
from numpy import array, mean
answ = [mean(array(el)[array(el)!=0]) for el in
all_ratings]
print (answ)
Output:
[3.2083333333333335, 2.869565217391304, 3.4210526315789473, 2.8125, 2.96]
Perhaps this might be a solution.
all_ratings =[
[5,5,4,4,3,1,2,3,4,4,4,3,4,0,0,0,1,2,3,4,4,4,1,4,0,0,0,1,2,5],
[5,0,1,2,3,1,2,3,4,4,4,5,4,2,1,0,1,2,0,5,0,4,1,4,2,0,0,1,0,5],
[5,2,3,4,4,0,0,0,4,5,0,3,0,0,0,3,4,0,1,4,4,4,0,4,0,3,0,1,2,5],
[5,0,4,0,0,4,2,3,0,0,4,0,3,0,1,0,1,2,3,0,2,0,1,0,0,0,4,0,1,5],
[5,4,3,2,1,1,2,3,4,3,4,3,4,0,3,0,1,2,4,4,4,4,1,4,0,0,0,1,2,5],
]
def get_average(vals): # without zero
counter = 0
total = 0
for i in range(len(vals)):
if (vals[i] != 0):
counter += 1
total += vals[i]
return round(total / counter, 2)
for i in range(len(all_ratings)):
print(get_average(all_ratings[i]))
Update: I guess 21.4 would be the average number of movies that were rated in each list of ratings. The code below returns 21.4
counter = 0;
for index, item in np.ndenumerate(all_ratings):
if item != 0:
counter += 1
print(counter / 5)

Categories