How to obtain Nan Values in pandas.groupby - python

When i put my data to groupby function with the following code
x =x.groupby(['Time', 'Distance'],as_index=True,observed=False).size().reset_index()
x.columns=['Time','Distance','Flow']
x.head(3)
i get such output:
Time Distance Flow
0 0 5 1
1 0 7 170
2 0 8 10
However, i need to do some smoothing, thus i need the skipped values such as:
Time Distance Flow
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 5 1
etc. In short, i need also the missed grouping values. How can i do this?

Use:
x = pd.DataFrame({
'Time':[0,1,1,1,1,0],
'Distance':[4,5,4,5,5,3],
})
df = x.groupby(['Time', 'Distance'],as_index=True,observed=False).size()
print (df)
Time Distance
0 3 1
4 1
1 4 1
5 3
dtype: int64
df1 = df.unstack(fill_value=0).stack().reset_index(name='Flow')
print (df1)
Time Distance Flow
0 0 3 1
1 0 4 1
2 0 5 0
3 1 3 0
4 1 4 1
5 1 5 3
Or:
m = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df1 = df.reindex(m, fill_value=0).reset_index(name='Flow')

Related

Sum specific number of columns for each row with Pandas

I have the dollowing dataframe:
name code 1 2 3 4 5 6 7 .........155 days
0 Lari EH214 0 5 2 1 0 0 0 0 3
1 Suzi FK362 0 0 0 0 2 3 0 0 108
2 Jil LM121 0 0 4 2 1 0 0 0 5
...
I want to sum the column between column 1 to column with the number that appears on "days" , for example,
for row 1, I will sum 3 days-> 0+5+2
For row 2 108 days,
for row 3 5 days->0+4+2+1+0
How can I do something like this?
Looking for method.
For vectorized solution filter rows by positions first and get mask by compare days in numpy boroadasting, if not match replace 0 in DataFrame.where and last sum:
df1 = df.iloc[:, 2:-1]
m = df1.columns.astype(int).to_numpy() <= df['days'].to_numpy()[:, None]
df['sum'] = df1.where(m, 0).sum(axis=1)
print (df)
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7
IIUC, use:
df['sum'] = df.apply(lambda r: r.loc[1: r['days']].sum(), axis=1)
or, if the column names are strings:
df['sum'] = df.apply(lambda r: r.loc['1': str(r['days'])].sum(), axis=1)
output:
name code 1 2 3 4 5 6 7 155 days sum
0 Lari EH214 0 5 2 1 0 0 0 0 3 7
1 Suzi FK362 0 0 0 0 2 3 0 0 108 5
2 Jil LM121 0 0 4 2 1 0 0 0 5 7

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

Count number of occurences of values per column of DataFrame

I have the following dataframe:
df = pd.DataFrame(np.array([[4, 1], [1,1], [5,1], [1,3], [7,8], [np.NaN,8]]), columns=['a', 'b'])
a b
0 4 1
1 1 1
2 5 1
3 1 3
4 7 8
5 Nan 8
Now I would like to do a value_counts() on the columns for values from 1 to 9 which should give me the following:
a b
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0
That means I just count the number of occurences of the values 1 to 9 for each column. How can this be done? I would like to get this format so that I can apply afterwards df.plot(kind='bar', stacked=True) to get e stacked bar plot with the discrete values from 1 to 9 at the x axis and the count for a and b on the y axis.
Use pd.value_counts:
df.apply(pd.value_counts).reindex(range(10)).fillna(0)
Use np.bincount on each column:
df.apply(lambda x: np.bincount(x.dropna(),minlength=10))
a b
0 0 0
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0
Alternatively, using a list comprehension instead of apply.
pd.DataFrame([
np.bincount(df[c].dropna(), minlength=10) for c in df
], index=df.columns).T
a b
0 0 0
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0

Subset pandas dataframe up to when condition is met the first time

I have not had any luck accomplishing a task, where I want to subset a pandas dataframe up to a value, and grouping by their id. In the actual dataset I have several columns in between 'id' and 'status'
For example:
d = {'id': [1,1,1,1,1,1,1,2,2,2,2,2,2,2], 'status': [0,0,0,0,1,1,1,0,0,0,0,1,0,1]}
df = pd.DataFrame(data=d)
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 1 1
6 1 1
7 2 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 0
13 2 1
The desired subset would be:
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Let's try groupby + cumsum:
df = df.groupby('id', group_keys=False)\
.apply(lambda x: x[x.status.cumsum().cumsum().le(1)])\
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1
Here's an alternative that performs a groupby to create a mask to be used as an indexer:
df = df[df.status.eq(1).groupby(df.id)\
.apply(lambda x: x.cumsum().cumsum().le(1))]\
.reset_index(drop=1)
df
id status
0 1 0
1 1 0
2 1 0
3 1 0
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 2 1

Apply a value to all instances of a number based on conditions

I have a df like this:
ID Number
1 0
1 0
1 1
2 0
2 0
3 1
3 1
3 0
I want to apply a 5 to any ids that have a 1 anywhere in the number column and a zero to those that don't. For example, if the number "1" appears anywhere in the Number column for ID 1, I want to place a 5 in the total column for every instance of that ID.
My desired output would look as such
ID Number Total
1 0 5
1 0 5
1 1 5
2 0 0
2 0 0
3 1 5
3 1 5
3 0 5
Trying to think of a way leverage applymap for this issue but not sure how to implement.
Use transform to add a column to your df as a result of a groupby on 'ID':
In [6]:
df['Total'] = df.groupby('ID').transform(lambda x: 5 if (x == 1).any() else 0)
df
Out[6]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5
You can use DataFrame.groupby() on ID column and then take max() of the Number column, and then make that into a dictionary and then use that to create the 'Total' column. Example -
grouped = df.groupby('ID')['Number'].max().to_dict()
df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
Demo -
In [44]: df
Out[44]:
ID Number
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 3 1
6 3 1
7 3 0
In [56]: grouped = df.groupby('ID')['Number'].max().to_dict()
In [58]: df['Total'] = df.apply((lambda row:5 if grouped[row['ID']] else 0), axis=1)
In [59]: df
Out[59]:
ID Number Total
0 1 0 5
1 1 0 5
2 1 1 5
3 2 0 0
4 2 0 0
5 3 1 5
6 3 1 5
7 3 0 5

Categories