I have a pandas series containing zeros and ones:
df1 = pd.Series([ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
df1
Out[3]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 0
I would like to create a dataframe df2 that contains the start and the end of intervals with the same value, together with the value associated... df2 in this case should be...
df2
Out[5]:
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
My attempt was:
from operator import itemgetter
from itertools import groupby
a=[next(group) for key, group in groupby(enumerate(df1), key=itemgetter(1))]
df2 = pd.DataFrame(a,columns=['Start','Value'])
but I don't know how to get the 'End' indeces
You can groupby by Series which is create by cumsum of shifted Series df1 by shift.
Then apply custum function and last reshape by unstack.
s = df1.ne(df1.shift()).cumsum()
df2 = df1.groupby(s).apply(lambda x: pd.Series([x.index[0], x.index[-1], x.iat[0]],
index=['Start','End','Value']))
.unstack().reset_index(drop=True)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
Another solution with aggregation by agg with first and last, but there is necessary more code for handling output by desired output.
s = df1.ne(df1.shift()).cumsum()
d = {'first':'Start','last':'End'}
df2 = df1.reset_index(name='Value') \
.groupby([s, 'Value'])['index'] \
.agg(['first','last']) \
.reset_index(level=0, drop=True) \
.reset_index() \
.rename(columns=d) \
.reindex_axis(['Start','End','Value'], axis=1)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
You could use the pd.Series.diff() method so as to identify the starting indexes:
df2 = pd.DataFrame()
df2['Start'] = df1[df1.diff().fillna(1) != 0].index
Then compute end indexes from this:
df2['End'] = [e - 1 for e in df2['Start'][1:]] + [df1.index.max()]
And finally gather the associated values :
df2['Value'] = df1[df2['Start']].values
ouput
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
The thing you are looking for is get first and last values in a groupby
import pandas as pd
def first_last(df):
return df.ix[[0,-1]]
df = pd.DataFrame([3]*4+[4]*4+[1]*4+[3]*3,columns=['value'])
print df
df['block'] = (df.value.shift(1) != df.value).astype(int).cumsum()
df = df.reset_index().groupby(['block','value'])['index'].agg(['first', 'last']).reset_index()
del df['block']
print df
You can groupby using shift and cumsum and find first and last valid index
df2 = df1.groupby((df1 != df1.shift()).cumsum()).apply(lambda x: np.ravel([x.index[0], x.index[-1], x.unique()]))
df2 = pd.DataFrame(df2.values.tolist()).rename(columns = {0: 'Start', 1: 'End',2:'Value'})
You get
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
Related
I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>
Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']
Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1
You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1
I have a list with columns to create :
new_cols = ['new_1', 'new_2', 'new_3']
I want to create these columns in a dataframe and fill them with zero :
df[new_cols] = 0
Get error :
"['new_1', 'new_2', 'new_3'] not in index"
which is true but unfortunate as I want to create them...
EDIT : This is a duplicate of this question : Add multiple empty columns to pandas DataFrame however I keep this one too because the accepted answer here was the simple solution I was looking for, and it was not he accepted answer out there
EDIT 2 : While the accepted answer is the most simple, interesting one-liner solutions were posted below
You need to add the columns one by one.
for col in new_cols:
df[col] = 0
Also see the answers in here for other methods.
Use assign by dictionary:
df = pd.DataFrame({
'A': ['a','a','a','a','b','b','b','c','d'],
'B': list(range(9))
})
print (df)
0 a 0
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 c 7
8 d 8
new_cols = ['new_1', 'new_2', 'new_3']
df = df.assign(**dict.fromkeys(new_cols, 0))
print (df)
A B new_1 new_2 new_3
0 a 0 0 0 0
1 a 1 0 0 0
2 a 2 0 0 0
3 a 3 0 0 0
4 b 4 0 0 0
5 b 5 0 0 0
6 b 6 0 0 0
7 c 7 0 0 0
8 d 8 0 0 0
import pandas as pd
new_cols = ['new_1', 'new_2', 'new_3']
df = pd.DataFrame.from_records([(0, 0, 0)], columns=new_cols)
Is this what you're looking for ?
You can use assign:
new_cols = ['new_1', 'new_2', 'new_3']
values = [0, 0, 0] # could be anything, also pd.Series
df = df.assign(**dict(zip(new_cols, values)
Try looping through the column names before creating the column:
for col in new_cols:
df[col] = 0
We can use the Apply function to loop through the columns in the dataframe and assigning each of the element to a new field
for instance for a list in a dataframe with a list named keys
[10,20,30]
In your case since its all 0 we can directly assign them as 0 instead of looping through. But if we have values we can populate them as below
...
df['new_01']=df['keys'].apply(lambda x: x[0])
df['new_02']=df['keys'].apply(lambda x: x[1])
df['new_03']=df['keys'].apply(lambda x: x[2])
I am trying to develop a process that automatically scales each Series in a pandas df to zero. For instance, if we use the df below:
import pandas as pd
d = ({
'A' : [0,1,2,3],
'B' : [6,7,8,9],
'C' : [10,11,12,13],
'D' : [-4,-5,-4,-3],
})
df = pd.DataFrame(data=d)
I'm manually adjusting each Column so it begins at zero. You'll notice the increments are either +1 or -1 but the starting integers vary.
df['B'] = df['B'] - 6
df['C'] = df['C'] - 10
df['D'] = df['D'] + 4
Output:
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 -2
3 3 3 3 -3
This isn't very efficient as I have to go through each series to determine the scaling factor. Is there a more efficient way to determine this?
You can subtract first row byiloc with sub:
df = df.sub(df.iloc[0])
#same as
#df = df - df.iloc[0]
print (df)
A B C D
0 0 0 0 0
1 1 1 1 -1
2 2 2 2 0
3 3 3 3 1
Detail:
print (df.iloc[0])
A 0
B 6
C 10
D -4
Name: 0, dtype: int64
I have a DataFrame with the following structure:
I want to transform the DataFrame so that for every unique user_id, if a column contains a 1, then the whole column should contain 1s for that user_id. Assume that I don't know all of the column names in advance. Based on the above input, the output would be:
I have the following code (please excuse how unsuccinct it is):
df = df.groupby('user_id').apply(self.transform_columns)
def transform_columns(self, x):
x.apply(self.transform)
def transform(self, x):
if 1 in x:
for element in x:
element = 1
var = x
At the point of the transform function, x is definitely a series. For some reason this code is returning an empty DataFrame. Btw if you also know a way of excluding certain columns from the transformation (e.g. user_id) that would be great. Please help.
I'm going to explain how I transformed the data into the initial state for the input, as after attempting Jezrael's answer, I am getting a KeyError on the 'user_id' column (which definitely exists in the df). The initial state of the data was as below:
I transformed it to the state shown in the first image in the question with the following code:
df2 = self.add_support_columns(df)
df = df.join(df2)
def add_support_columns(self, df):
df['pivot_column'] = df.apply(self.get_col_name, axis=1)
df['flag'] = 1
df = df.pivot(index='user_id', columns='pivot_column')['flag']
df.reset_index(inplace=True)
df = df.fillna(0)
return df
You can use set_index + groupby + transform with any + reset_index:
It working because 1s are in any function processes like Trues - so if at least one 1 it return 1 else 0.
df = pd.DataFrame({
'user_id' : [33,33,33,33,22,22],
'q1' : [1,0,0,0,0,0],
'q2' : [0,0,0,0,1,0],
'q3' : [0,1,0,0,0,1],
})
df = df.reindex_axis(['user_id','q1','q2','q3'], 1)
print (df)
user_id q1 q2 q3
0 33 1 0 0
1 33 0 0 1
2 33 0 0 0
3 33 0 0 0
4 22 0 1 0
5 22 0 0 1
df = df.set_index('user_id')
.groupby('user_id') # or groupby(level=0)
.transform(lambda x: 1 if x.any() else 0)
.reset_index()
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
Solution with join:
df = df[['user_id']].join(df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
print (df)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More dynamic solution with difference + reindex_axis:
#select only some columns
cols = ['q1','q2']
#all another columns are not transforming
cols2 = df.columns.difference(cols)
df1 = df[cols2].join(df.groupby('user_id')[cols].transform(lambda x: 1 if x.any() else 0))
#if need same order of columns as original
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 0
1 33 1 0 1
2 33 1 0 0
3 33 1 0 0
4 22 0 1 0
5 22 0 1 1
Also logic can be inverted:
#select only columns which are not transforming
cols = ['user_id']
#all another columns are transforming
cols2 = df.columns.difference(cols)
df1 = df[cols].join(df.groupby('user_id')[cols2].transform(lambda x: 1 if x.any() else 0))
df1 = df1.reindex_axis(df.columns, axis=1)
print (df1)
user_id q1 q2 q3
0 33 1 0 1
1 33 1 0 1
2 33 1 0 1
3 33 1 0 1
4 22 0 1 1
5 22 0 1 1
EDIT:
More efficient solution is return only boolean mask and then convert to int:
df1 = df.groupby('user_id').transform('any').astype(int)
Timings:
In [170]: %timeit (df.groupby('user_id').transform(lambda x: 1 if x.any() else 0))
1 loop, best of 3: 514 ms per loop
In [171]: %timeit (df.groupby('user_id').transform('any').astype(int))
10 loops, best of 3: 84 ms per loop
Sample for timings:
np.random.seed(123)
N = 1000
df = pd.DataFrame(np.random.choice([0,1], size=(N, 3)),
index=np.random.randint(1000, size=N))
df.index.name = 'user_id'
df = df.add_prefix('q').reset_index()
#print (df)
I have a DataFrame like this:
a b c d
1 0 0 0
0 1 0 7
5 2 0 4
6 3 0 0
0 0 8 8
0 7 7 7
0 0 0 1
1: fow each row, if the counts of 0 is >90% of the column counts(in this case: mean: 0.9*4 ), then delete the row.
2: fow each column, if the counts of 0 is >90% of the row counts(in this case: mean: 0.9*7 ), then delete the column.
I guess you want something like:
mask_rows = pd.DataFrame.sum(df == 0, axis=1) > 0.9*len(df.columns)
mask_cols = pd.DataFrame.sum(df == 0, axis=0) > 0.9*len(df.columns)
This creates mask following my interpretation of your question...
First create a mask that reveals the place zeros are:
df_temp = (df == 0)
Then drop the lines:
df.drop(df_temp.mean(axis = 1) > 0.9,inplace = True)
And finally the columns:
df.drop(df_temp.mean() > 0.9, 1, inplace = True)