I am creating four columns which are labeled as flagMin, flagMax, flagLow, flagUp. I am updating these dataframe columns each time it runs through the loop whoever my original data is being override. I would like to keep the previous data I had in the 4 columns since they contain 1s when true.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'] = np.where(df['Min'] == colMin, '1', '0')
df['flagMax'] = np.where(df['Max'] == colMax, '1', '0')
df['flagLow'] = np.where(df['Lower'] == colLow, '1', '0')
df['flagUp'] = np.where(df['Upper'] == colUp, '1', '0')
print(df)
4 Dataframes for each flag printed above
Name Min
0 Vo 12.8
1 Vi -51.3
Name Max
0 Vo 39.9
1 Vi -25.7
Name Low
0 Vo -46.0
1 Vi -66.1
Name Up
0 Vo 94.3
1 Vi -14.1
Output 1st iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 0
4 0 0 0
5 0 0 0
6 0 0 1
7 0 1 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
Output 2nd Iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 1 0
16 0 0 0
17 0 0 0
I lose the 1s from row 3,6,7. I would like to keep the 1s from both sets of data. Thank you
Just set to '1' only those elements you want to update and not the whole column.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
df['flagMin'] = '0'
df['flagMax'] = '0'
df['flagLow'] = '0'
df['flagUp'] = '0'
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'][df['Min'] == colMin] = '1'
df['flagMax'][df['Max'] == colMax] = '1'
df['flagLow'][df['Lower'] == colLow] = '1'
df['flagUp'][df['Upper'] == colUp] = '1'
print(df)
P.S. I don't know why you are using strings of '0' and '1' instead of just using 0 and 1 but that's up to you.
Related
I have a pandas DataFrame like this:
Movie Rate
0 5821 4
1 2124 2
2 7582 1
3 3029 5
4 17479 1
both movie and the rating could be repeated. I need to transform this DataFrame to something like this:
Movie Rate_1_Count Rate_2_Count ... Rate_5_Count
0 5821 20 1 5
1 2124 2 0 99
2 7582 50 22 22
...
which the movie ids are unique and Rate {Number} Count is the count of the ratings to that movie that are equal to the {Number}.
I already accomplished this task using the code below which I believe is very messy. I guess there must be a neater way to do that. Can anyone help me with it?
self.movie_df_tmp = self.rating_df[['MovieId', 'Rate']]
self.movie_df_tmp['RaCount'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('count')
self.movie_df_tmp['Sum'] = self.movie_df_tmp.groupby(['MovieId'])['Rate'].transform('sum')
self.movie_df_tmp['NORC'] = self.movie_df_tmp.groupby(['MovieId', 'Rate'])['Rate'].transform('count')
self.movie_df_tmp = self.movie_df_tmp.drop_duplicates()
self.movie_df_tmp['Rate1C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 1]['NORC']
self.movie_df_tmp['Rate2C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 2]['NORC']
self.movie_df_tmp['Rate3C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 3]['NORC']
self.movie_df_tmp['Rate4C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 4]['NORC']
self.movie_df_tmp['Rate5C'] = self.movie_df_tmp[self.movie_df_tmp['Rate'] == 5]['NORC']
self.movie_df_tmp = self.movie_df_tmp.replace(np.nan, 0)
self.movie_df = self.movie_df_tmp[['MovieId', 'RaCount', 'Sum']].drop_duplicates()
self.movie_df_tmp = self.movie_df_tmp.drop(columns=['Rate', 'NORC', 'Sum', 'RaCount'])
self.movie_df_tmp = self.movie_df_tmp.groupby(['MovieId'])["Rate1C", "Rate2C", "Rate3C", "Rate4C", "Rate5C"].apply(
lambda x: x.astype(int).sum())
self.movie_df = self.movie_df.merge(self.movie_df_tmp, left_on='MovieId', right_on='MovieId')
self.movie_df = pd.DataFrame(self.movie_df.values,
columns=['MovieId', 'Rate1C', 'Rate2C', 'Rate3C', 'Rate4C',
'Rate5C'])
Try with pd.crosstab:
pd.crosstab(df['Movie'], df['Rate'])
Rate 1 2 4 5
Movie
2124 0 1 0 0
3029 0 0 0 1
5821 0 0 1 0
7582 1 0 0 0
17479 1 0 0 0
Fix axis names and column names rename + reset_index + rename_axis:
new_df = (
pd.crosstab(df['Movie'], df['Rate'])
.rename(columns=lambda c: f'Rate_{c}_Count')
.reset_index()
.rename_axis(columns=None)
)
Movie Rate_1_Count Rate_2_Count Rate_4_Count Rate_5_Count
0 2124 0 1 0 0
1 3029 0 0 0 1
2 5821 0 0 1 0
3 7582 1 0 0 0
4 17479 1 0 0 0
This should give you the desired output:
grouper=df.groupby(['Movie','Rate']).size()
dg=pd.DataFrame()
dg['Movie']=df['Movie'].unique()
for i in [1,2,3,4,5]:
dg['Rate_'+str(i)+'Count']=dg['Movie'].apply(lambda x: grouper[x,i] if (x,i)
in grouper.index else 0)
Confusing title, let me explain. I have 2 dataframes like this:
dataframe named df1: Looks like this (with million of rows in original):
id ` text c1
1 Hello world how are you people 1
2 Hello people I am fine people 1
3 Good Morning people -1
4 Good Evening -1
Dataframe named df2 looks like this:
Word count Points Percentage
hello 2 2 100
world 1 1 100
how 1 1 100
are 1 1 100
you 1 1 100
people 3 1 33.33
I 1 1 100
am 1 1 100
fine 1 1 100
Good 2 -2 -100
Morning 1 -1 -100
Evening 1 -1 -100
-1
df2 columns explaination:
count means the total number of times that word appeared in df1
points is points given to each word by some kind of algorithm
percentage = points/count*100
Now, I want to add 40 new columns in df1, according to the point & percentage. They will look like this:
perc_-90_2 perc_-80_2 perc_-70_2 perc_-60_2 perc_-50_2 perc_-40_2 perc_-20_2 perc_-10_2 perc_0_2 perc_10_2 perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 perc_80_2 perc_90_2
perc_-90_1 perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 perc_-20_1 perc_-10_1 perc_0_1 perc_10_1 perc_20_1 perc_30_1 perc_40_1 perc_50_1 perc_60_ perc_70_1 perc_80_1 perc_90_1
Let me break it down. The column name contain 3 parts:
1.) perc just a string, means nothing
2.) Numbers from range -90 to +90. For example, Here -90 means, the percentage is -90 in df2. Now for example, If a word has percentage value in range 81-90, then there will be a value of 1 in that row, and column named prec_-80_xx. The xx is the third part.
3.) The third part is the count. Here I want two type of counts. 1 and 2. As the example given in point 2, If the word count is in range of 0 to 1, then the value will be 1 in prec_-80_1 column. If the word count is 2 or more, then the value will be 1 in prec_-80_2 column.
I hope it is not very on confusing.
Use:
#change previous answer with add id for matching
df2 = (df.drop_duplicates(['id','Word'])
.groupby('Word', sort=False)
.agg({'c1':['sum','size'], 'id':'first'})
)
df2.columns = df2.columns.map(''.join)
df2 = df2.reset_index()
df2 = df2.rename(columns={'c1sum':'Points','c1size':'Totalcount','idfirst':'id'})
df2['Percentage'] = df2['Points'] / df2['Totalcount'] * 100
s1 = df2['Percentage'].div(10).astype(int).mul(10).astype(str)
s2 = np.where(df2['Totalcount'] == 1, '1', '2')
#s2= np.where(df1['Totalcount'].isin([0,1]), '1', '2')
#create colum by join
df2['new'] = 'perc_' + s1 + '_' +s2
#create indicator DataFrame
df3 = pd.get_dummies(df2[['id','new']].drop_duplicates().set_index('id'),
prefix='',
prefix_sep='').max(level=0)
print (df3)
#reindex for add missing columns
c = 'perc_' + pd.Series(np.arange(-100, 110, 10).astype(str)) + '_'
cols = (c + '1').append(c + '2')
#join to original df1
df = df1.join(df3.reindex(columns=cols, fill_value=0), on='id')
print (df)
id text c1 perc_-100_1 perc_-90_1 \
0 1 Hello world how are you people 1 0 0
1 2 Hello people I am fine people 1 0 0
2 3 Good Morning people -1 1 0
3 4 Good Evening -1 1 0
perc_-80_1 perc_-70_1 perc_-60_1 perc_-50_1 perc_-40_1 ... perc_10_2 \
0 0 0 0 0 0 ... 0
1 0 0 0 0 0 ... 0
2 0 0 0 0 0 ... 0
3 0 0 0 0 0 ... 0
perc_20_2 perc_30_2 perc_40_2 perc_50_2 perc_60_2 perc_70_2 \
0 0 1 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
perc_80_2 perc_90_2 perc_100_2
0 0 0 1
1 0 0 0
2 0 0 0
3 0 0 0
[4 rows x 45 columns]
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i want to count number of consecutive zeros in my Dataframe shown below, help please
DEC JAN FEB MARCH APRIL MAY consecutive zeros
0 X X X 1 0 1 0
1 X X X 1 0 1 0
2 0 0 1 0 0 1 2
3 1 0 0 0 1 1 3
4 0 0 0 0 0 1 5
5 X 1 1 0 0 0 3
6 1 0 0 1 0 0 2
7 0 0 0 0 1 0 4
For each row, you want cumsum(1-row) with reset at every point when row == 1. Then you take the row max.
For example
ts = pd.Series([0,0,0,0,1,1,0,0,1,1,1,0])
ts2 = 1-ts
tsgroup = ts.cumsum()
consec_0 = ts2.groupby(tsgroup).transform(pd.Series.cumsum)
consec_0.max()
will give you 4 as needed.
Write that in a function and apply to your dataframe
Here's my two cents...
Think of all the other non-zero elements as 1, then you will have a binary code. All you need to do now is find the 'largest interval' where there's no bit flip starting with 0.
We can write a function and 'apply' with lambda
def len_consec_zeros(a):
a = np.array(list(a)) # convert elements to `str`
rr = np.argwhere(a == '0').ravel() # find out positions of `0`
if not rr.size: # if there are no zeros, return 0
return 0
full = np.arange(rr[0], rr[-1]+1) # get the range of spread of 0s
# get the indices where `0` was flipped to something else
diff = np.setdiff1d(full, rr)
if not diff.size: # if there are no bit flips, return the
return len(full) # size of the full range
# break the array into pieces wherever there's a bit flip
# and the result is the size of the largest chunk
pos, difs = full[0], []
for el in diff:
difs.append(el - pos)
pos = el + 1
difs.append(full[-1]+1 - pos)
# return size of the largest chunk
res = max(difs) if max(difs) != 1 else 0
return res
Now that you have this function, call it on every row...
# join all columns to get a string column
# assuming you have your data in `df`
df['concated'] = df.astype(str).apply(lambda x: ''.join(x), axis=1)
df['consecutive_zeros'] = df.concated.apply(lambda x: len_consec_zeros(x))
Here's one approach -
# Inspired by https://stackoverflow.com/a/44385183/
def pos_neg_counts(mask):
idx = np.flatnonzero(mask[1:] != mask[:-1])
if len(idx)==0: # To handle all 0s or all 1s cases
if mask[0]:
return np.array([mask.size]), np.array([0])
else:
return np.array([0]), np.array([mask.size])
else:
count = np.r_[ [idx[0]+1], idx[1:] - idx[:-1], [mask.size-1-idx[-1]] ]
if mask[0]:
return count[::2], count[1::2] # True, False counts
else:
return count[1::2], count[::2] # True, False counts
def get_consecutive_zeros(df):
arr = df.values
mask = (arr==0) | (arr=='0')
zero_count = np.array([pos_neg_counts(i)[0].max() for i in mask])
zero_count[zero_count<2] = 0
return zero_count
Sample run -
In [272]: df
Out[272]:
DEC JAN FEB MARCH APRIL MAY
0 X X X 1 0 1
1 X X X 1 0 1
2 0 0 1 0 0 1
3 1 0 0 0 1 1
4 0 0 0 0 0 1
5 X 1 1 0 0 0
6 1 0 0 1 0 0
7 0 0 0 0 1 0
In [273]: df['consecutive_zeros'] = get_consecutive_zeros(df)
In [274]: df
Out[274]:
DEC JAN FEB MARCH APRIL MAY consecutive_zeros
0 X X X 1 0 1 0
1 X X X 1 0 1 0
2 0 0 1 0 0 1 2
3 1 0 0 0 1 1 3
4 0 0 0 0 0 1 5
5 X 1 1 0 0 0 3
6 1 0 0 1 0 0 2
7 0 0 0 0 1 0 4
Whats the best way to do the following in python/pandas please?
I want to count the occurences where trend data 2 steps out of line with trend data 1 and reset the counter each time trend data 1 changes.
I'm struggling with the right way to do it on the dataframe creating a new column df['D'] in this example.
df['A'] = trend data 1
df['B'] = boolean indicator if trend data 1 changes
df['C'] = trend data 2
df['D'] = desired result
df['A'] df['B'] df['C'] df['D']
1 0 1 0
1 0 1 0
-1 1 -1 0
-1 0 -1 0
-1 0 1 1
-1 0 -1 1
-1 0 -1 1
-1 0 1 2
-1 0 1 2
-1 0 -1 2
1 1 1 0
1 0 1 0
1 0 -1 1
1 0 1 1
1 0 -1 2
1 0 1 2
1 0 1 2
in excel i would simply use:
=IF(B2=1,0,IF(AND((C2<>C1),(C2<>A2)),D1+1,D1))
however, i've always struggled in not being able to reference prior cells in pandas.
I can't use np.where(). I'm sure its just apply a function in the correct way but I can't seem to make it work referencing other columns and resetting the variable. I've looked at other answers but can't seem to find anything to work in this situation.
something like
note: create df['E'] = df['C'].shift(1)
def corrections(x):
if df['B'] == 1:
x = 0
elif ((df['C'] != df['E']) AND ( df['C'] != df['A'])):
x = x + 1
else:
x
apologies, as I feel i'm missing something rather simple with this question but just keep going round in circles!
def make_D (df):
counter = 0
array = []
for index in df.index:
if df.loc[index, 'A']!=df.loc[index, 'C']:
counter = counter + 1
if index>0:
if df.loc[index, 'B'] != df.loc[index-1, 'B']:
counter = 0
array.append(counter)
df['D'] = array
return (df)
new_df = make_D(df)
hope it helps!
#Set a list to store values for column D
d = []
#calculate D using the given conditions
df.apply(lambda x: d.append(0) if ((x.name==0)|(x.B==1)) else d.append(d[-1]+1) if (x.C!=df.iloc[x.name-1].C) & (x.C!=x.A) else d.append(d[-1]), axis=1)
#set columns D using values from the list d.
df['D'] = d
Out[594]:
A B C D
0 1 0 1 0
1 1 0 1 0
2 -1 1 -1 0
3 -1 0 -1 0
4 -1 0 1 1
5 -1 0 -1 1
6 -1 0 -1 1
7 -1 0 1 2
8 -1 0 1 2
9 -1 0 -1 2
10 1 1 1 0
11 1 0 1 0
12 1 0 -1 1
13 1 0 1 1
14 1 0 -1 2
15 1 0 1 2
16 1 0 1 2
I'm looking for a way to filter/search for seqeuences/patterns in rows in a data frame that looks like this:
sensor A B C D E F
date
2011-11-02 19:22:32 0 0 0 0 1 0
2011-11-02 19:29:18 0 0 0 0 1 0
2011-11-02 19:29:30 0 0 1 0 1 0
2011-11-02 19:29:34 0 0 1 1 1 0
2011-11-02 19:29:35 0 0 1 1 0 0
2011-11-02 19:30:06 0 0 1 0 0 0
2011-11-02 19:30:10 0 0 1 0 1 0
2011-11-02 19:30:46 0 0 0 0 1 0
2011-11-02 19:31:25 0 0 1 0 1 0
2011-11-02 19:31:26 0 0 1 0 0 0
2011-11-02 19:31:31 0 0 1 1 0 0
2011-11-02 19:31:41 0 0 0 1 0 0
I need to now in which timeframes the sensors (A,B,C,..) were active (value == 1). E.g. for sensor C there are two intervals:
start: 2011-11-02 19:29:30, end: 2011-11-02 19:30:46
start: 2011-11-02 19:31:25, end: 2011-11-02 19:31:41
So:
0 -> 1: startdate and
1 -> 0: enddate
My first solution was to iterate over the rows. But since the real dataset is quite big, I was wondering if there is any way this can be done with pandas.
Thanks.
res = {}
t = df - df.shift(1)
for col in df.columns:
res[col] = t[col][t[col] != 0]
when a value for a particular column is 1, it means that the time frame has started, when it's -1, it means it's over
also, you could use a dict comprehension instead:
res = {col: t[col][t[col] != 0] for col in df.columns}
You can do it like this:
col = df['A']
scol = col.shift()
starts = col & ~(scol == 1)
ends = ~(col == 1) & scol
if col[len(col)-1]:
ends[len(ends)-1] = True
Then starts and ends will be two boolean series marking all start dates and end dates in the column 'A'.
The last two lines are for creating end dates that will be missing if a column ends like ... 1 1. Also, if the column begins with 1 1 ... (as DSM stated in the question's comments), a start date will be created.