Count items with condition in DataFrame Python - python

I have a DataFrame like this:
index column1 column2 column3
1 30 55 62
2 69 20 40
3 23 62 23
...
May I know how to count the number of values which are > 50 for all elements in the above table?
I'm trying below method:
count = 0
for column in df.items():
count += df[df[column] > 50][column].count()
Is this a proper way to do it? Or any other more effective suggestion?

You can just check all the values at once and then sum() them since True evaluates to 1 and False to 0:
df.gt(50).sum().sum()

(df > 54).values.sum() will do what you're looking for here is the total code to get the results:
>>> df = pd.DataFrame(np.random.randint(0,100,size=(5, 2)), columns=list('AB'))
>>> df
A B
0 68 92
1 47 53
2 5 35
3 75 82
4 51 89
>>> (df > 54).values.sum()
5
>>>
Basically what I'm doing is creating a mask of true false values of the entire data frame based on the condition in this case > 54 and then just rolling up the data frame because true/false is equal to 1/0 when added.

Related

How do I create a new column with values from the next row of another column in python?

Say I have a dataframe as such:
id
pos_X
pos_y
1
100
0
2
68
17
3
42
28
4
94
35
5
15
59
6
84
19
This is my desired dataframe:
id
pos_X
pos_y
pos_xend
pos_yend
1
100
0
68
17
2
42
28
94
35
3
15
59
84
19
Basically the new column will have the values from the next row. How can I do this?
You can use a pivot:
out = (df
.drop(columns='id')
.assign(idx=np.arange(len(df))//2,
col=np.where(np.arange(len(df))%2, '', 'end'))
.pivot(index='idx', columns='col')
.pipe(lambda d: d.set_axis(d.columns.map(''.join), axis=1))
)
output:
pos_X pos_Xend pos_Y pos_Yend
idx
0 68 100 17 0
1 94 42 35 28
2 64 15 19 59
You only need create a new DataFrame. You can build it with a "for" that travell around the old dataframe
import pandas as pf
old_datas = {'id':[1,2],'pos_x':[100,68],'pos_y':[0,17]}
old_df = pf.DataFrame(data=old_datas)
new_pos_x = []
new_pos_y = []
pos_xend = []
pos_yend = []
new_id =[]
for i in range(len(old_df)):
if i%2:
new_pos_x.append(old_df.iloc[i]['pos_x'])
new_pos_y.append(old_df.iloc[i]['pos_y'])
new_id.append(i)
else:
pos_xend.append(old_df.iloc[i]['pos_x'])
pos_yend.append(old_df.iloc[i]['pos_y'])
new_datas = {'id':new_id,'pos_x':new_pos_x,'pos_y':new_pos_y,'pos_xend':pos_xend,'pos_yend':pos_yend}
new_df = pf.DataFrame(data=new_datas)
print(new_df)
# select alternate row, first starting from 0, second from 1
# reset index and concat based on the index
# choose columns to use in concat
df2=pd.concat(
[df.loc[ ::2][['id','pos_X', 'pos_y']].reset_index(drop=True) ,
df.loc[1::2][['pos_X', 'pos_y']] .reset_index(drop=True) .add_suffix('end')
],
axis=1)
# reset the ID column
df2['id']=np.arange(0, len(df2))
df2
or
# drop the extra column after concat
df2=pd.concat([df.loc[ ::2].reset_index(drop=True) ,
df.loc[1::2].reset_index(drop=True) .add_suffix('end')
],
axis=1).drop(columns='idend')
# reset the ID column
df2['id']=np.arange(0, len(df2))
df2
id pos_X pos_y pos_Xend pos_yend
0 0 100 0 68 17
1 1 42 28 94 35
2 2 15 59 84 19
Selecting odd and even rows then concatenating them would solve the problem. Something like this
import pandas as pd
df = pd.DataFrame({'X':[100,68,12,6,21] , 'Y':[0,17,32,23,14]})
print(df)
# select even and odd rows
even_df = df.iloc[::2].reset_index(drop=True)
odd_df = df.iloc[1::2].reset_index(drop=True) # odd
# concatente columns
result = pd.concat( [even_df, odd_df], axis=1)
print(result)
I think you are taking alternate rows, so i would suggest something like this, considering data to be a pandas dataframe:
df = """your data"""
posx_end = df.loc[df['id'] % 2 ==0 ]['pos_X'].values
posy_end = df.loc[df['id'] % 2 ==0 ]['pos_y'].values
df = df.loc[df['id'] % 2 !=0].copy()
df['posx_end'] = posx_end
df['posy_end'] = posy_end
edit:
add the following lines as well for id column formatting
df['id'] = range(1, len(df)+1)
df.set_index('id', inplace=True)
result:
id pos_X pos_y posx_end posy_end
1 100 0 68 17
2 42 28 94 35
3 15 59 84 19

Pandas Multiindex get values from first entry of index

I have the following multiindex dataframe:
from io import StringIO
import pandas as pd
datastring = StringIO("""File,no,runtime,value1,value2
A,0, 0,12,34
A,0, 1,13,34
A,0, 2,23,34
A,1, 6,23,38
A,1, 7,22,38
B,0,17,15,35
B,0,18,17,35
C,0,34,23,32
C,0,35,21,32
""")
df = pd.read_csv(datastring, sep=',')
df.set_index(['File','no',df.index], inplace=True)
>> df
runtime value1 value2
File no
A 0 0 0 12 34
1 1 13 34
2 2 23 34
1 3 6 23 38
4 7 22 38
B 0 5 17 15 35
6 18 17 35
C 0 7 34 23 32
8 35 21 32
What I would like to get is just the first values of every entry with a new file and a different number
A 0 34
A 1 38
B 0 35
C 0 32
The most similar questions I could find where these
Resample pandas dataframe only knowing result measurement count
MultiIndex-based indexing in pandas
Select rows in pandas MultiIndex DataFrame
but I was unable to construct a solution from them. The best I got was the ix operation, but as the values technically are still there (just not on display), the result is
idx = pd.IndexSlice
df.loc[idx[:,0],:]
could, for example, filter for the 0 value but would still return the entire rest of the dataframe.
Is a multiindex even the right tool for the task at hand? How to solve this?
Use GroupBy.first by first and second level of MultiIndex:
s = df.groupby(level=[0,1])['value2'].first()
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
If need one column DataFrame use one element list:
df1 = df.groupby(level=[0,1])[['value2']].first()
print (df1)
value2
File no
A 0 34
1 38
B 0 35
C 0 32
Another idea is remove 3rd level by DataFrame.reset_index and filter by Index.get_level_values with boolean indexing:
df2 = df.reset_index(level=2, drop=True)
s = df2.loc[~df2.index.duplicated(), 'value2']
print (s)
File no
A 0 34
1 38
B 0 35
C 0 32
Name: value2, dtype: int64
For the sake of completeness, I would like to add another method (which I would not have found without the answere by jezrael).
s = df.groupby(level=[0,1])['value2'].nth(0)
This can be generalized to finding any, not merely the first entry
t = df.groupby(level=[0,1])['value1'].nth(1)
Note that the selection was changed from value2 to value1 as for the former, the results of nth(0) and nth(1) would have been identical.
Pandas documentation link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html

Generate Column Value in Pandas based on previous rows

Let us assume I am taking a temperature measurement on a regular interval and recording the values in a Pandas Dataframe
day temperature [F]
0 89
1 91
2 93
3 88
4 90
Now I want to create another column which is set to 1 if and only if the two previous values are above a certain level. In my scenario I want to create a column value of 1 if the two consecutive values are above 90, thus yielding
day temperature Above limit?
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
Despite some SO and Google digging, it's not clear if I can use iloc[x], loc[x] or something else in a for loop?
You are looking for the shift function in pandas.
import io
import pandas as pd
data = """
day temperature Expected
0 89 0
1 91 0
2 93 1
3 88 0
4 91 0
5 91 1
6 93 1
"""
data = io.StringIO(data)
df = pd.read_csv(data, sep='\s+')
df['Result'] = ((df['temperature'].shift(1) > 90) & (df['temperature'] > 90)).astype(int)
# Validation
(df['Result'] == df['Expected']).all()
Try this:
df = pd.DataFrame({'temperature': [89, 91, 93, 88, 90, 91, 91, 93]})
limit = 90
df['Above'] = ((df['temperature']>limit) & (df['temperature'].shift(1)>limit)).astype(int)
df
In the future, please include the code to testing (in this case the df construction line)
df['limit']=""
df.iloc[0,2]=0
for i in range (1,len(df)):
if df.iloc[i,1]>90 and df.iloc[i-1,1]>90:
df.iloc[i,2]=1
else:
df.iloc[i,2]=0
Here iloc[i,2] refers to ith row index and 2 column index(limit column). Hope this helps
Solution using shift():
>> threshold = 90
>> df['Above limit?'] = 0
>> df.loc[((df['temperature [F]'] > threshold) & (df['temperature [F]'].shift(1) > threshold)), 'Above limit?'] = 1
>> df
day temperature [F] Above limit?
0 0 89 0
1 1 91 0
2 2 93 1
3 3 88 0
4 4 90 0
Try using rolling(window = 2) and then apply() as follows:
df["limit"]=df['temperature'].rolling(2).apply(lambda x: int(x[0]>90)&int(x[-1]> 90))

how to add complementary intervals in pandas dataframe

Lets say that I have a signal of 100 samples L=100
In this signal I found some intervals that I label as "OK". The intervals are stored in a Pandas DataFrame that looks like this:
c = pd.DataFrame(np.array([[10,26],[50,84]]),columns=['Start','End'])
c['Value']='OK'
How can I add the complementary intervals in another dataframe in order to have something like this
d = pd.DataFrame(np.array([[0,9],[10,26],[27,49],[50,84],[85,100]]),columns=['Start','End'])
d['Value']=['Check','OK','Check','OK','Check']
You can use the first Dataframe to create the second one and merge like suggested #jezrael :
d = pd.DataFrame({"Start":[0] + sorted(pd.concat([c.Start , c.End+1])), "End": sorted(pd.concat([c.Start-1 , c.End]))+[100]} )
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
d = d.reindex_axis(["Start","End","Value"], axis=1)
output
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
I think you need:
d = pd.merge(d, c, how='left')
d['Value'] = d['Value'].fillna('Check')
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT:
You can use numpy.concatenate with numpy.sort, numpy.column_stack and DataFrame constructor for new df. Last need merge with fillna by dict for column for replace:
s = np.sort(np.concatenate([[0], c['Start'].values, c['End'].values + 1]))
e = np.sort(np.concatenate([c['Start'].values - 1, c['End'].values, [100]]))
d = pd.DataFrame(np.column_stack([s,e]), columns=['Start','End'])
d = pd.merge(d, c, how='left').fillna({'Value':'Check'})
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check
EDIT1 :
For d was added new values by loc, rehape to Series by stack and shift. Last create df back by unstack:
b = c.copy()
max_val = 100
min_val = 0
c.loc[-1, 'Start'] = max_val + 1
a = c[['Start','End']].stack(dropna=False).shift().fillna(min_val - 1).astype(int).unstack()
a['Start'] = a['Start'] + 1
a['End'] = a['End'] - 1
a['Value'] = 'Check'
print (a)
Start End Value
0 0 9 Check
1 27 49 Check
-1 85 100 Check
d = pd.concat([b, a]).sort_values('Start').reset_index(drop=True)
print (d)
Start End Value
0 0 9 Check
1 10 26 OK
2 27 49 Check
3 50 84 OK
4 85 100 Check

Pandas individual item using index and column

I have a csv file test.csv. I am trying to use pandas to select items dependent on whether the second value is above a certain value. Eg
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
So what i would like is if B is larger than 50 then give me the values in A as an integer which I could assign a variable to
edit 1:
Sorry for the poor explanation. The final purpose of this is that I want to look in table 1:
index A B
0 44 1
1 45 2
2 46 57
3 47 598
4 48 5
for any values above 50 in column B and get the column A value and then look in table 2:
index A B
5 44 12
6 45 13
7 46 14
8 47 15
9 48 16
so in the end i want to end up with the value in column B of table two which i can print out as an integer and not as a series. If this is not possible using panda then ok but is there a way to do it in any case?
You can use dataframa slicing, to get the values you want:
import pandas as pd
f = pd.read_csv('yourfile.csv')
f[f['B'] > 50].A
in this code
f['B'] > 50
is the condition, returning a booleans array of True/False for all values meeting the condition or not, and then the corresponding A values are selected
This would be the output:
2 46
3 47
Name: A, dtype: int64
Is this what you wanted?

Categories