pandas: replicating an excel formula in pandas - python

What i have is a dataframe like:
total_sum pid
5 2
1 2
6 7
3 7
1 7
1 7
0 7
5 10
1 10
1 10
What I want is another column pos like:
total_sum pid pos
5 2 1
1 2 2
6 7 1
3 7 2
1 7 3
1 7 3
0 7 4
5 10 1
1 10 2
1 10 2
The logic behind is:
The initial pos value for new pid is 1.
If pid does not change but the total_sum changes, the value for pos is incremented by 1 (example first two rows) else the value for pos is the previous value (example last two rows).
What i tried:
df['pos'] = 1
df['pos'] = np.where(((df.pid.diff(-1)) == 0 & (df.total_sum.diff(-1) == 0)),
df.pos, (np.where(df.total_sum.diff(1) < 1, df.pos + 1, df.pos )))
Currently, I am doing it in an excel sheet, where I initially write 1 manually in the first column of pos and then write the formula in second cell of pos:
=IF(A3<>A2,1,IF(B3=B2,C2,C2+1))

Explanation:
Doing groupby on pid to group the same pid into separate groups. On each group, apply these following operations:
_ Call diff on each group. diff returns integers or NaN indicate the differences between 2 consecutive rows. First row of each group has no previous row, so diff always returns NaN for first row of each group :
df.groupby('pid').total_sum.transform(lambda x: x.diff()
Out[120]:
0 NaN
1 -4.0
2 NaN
3 -3.0
4 -2.0
5 0.0
6 -1.0
7 NaN
8 -4.0
9 0.0
Name: total_sum, dtype: float64
_ ne checks to see if any value is not 0. It returns True on not 0
df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0))
Out[121]:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: total_sum, dtype: bool
_ cumsum is cumulative sum which successively adds each rows. In Python, True is interpreted as 1 and False interpreted as 0. The 1st of each group is always True, so cumsum is always starting from 1 and adding up each rows to get the desired output.
df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0).cumsum())
Out[122]:
0 1
1 2
2 1
3 2
4 3
5 3
6 4
7 1
8 2
9 2
Name: total_sum, dtype: int32
Chain all commands to one-liner as follows:
df['pos'] = df.groupby('pid').total_sum.transform(lambda x: x.diff().ne(0).cumsum())
Out[99]:
total_sum pid pos
0 5 2 1
1 1 2 2
2 6 7 1
3 3 7 2
4 1 7 3
5 1 7 3
6 0 7 4
7 5 10 1
8 1 10 2
9 1 10 2

Related

How to count the number of occurrences on comma delimited column in Python Pandas

How can to count the number of occurrences of comma-separated values from the whole list of columns
data frame is like this:
id column
1
2 1
3 1
4 1,2
5 1,2
6 1,2,4
7 1,2,4
8 1,2,4,6
9 1,2,4,6
10 1,2,4,6,8
11 1,2,4,6,8
Desired output is:
id column count
1. 10
2 1. 7
3 1. 0
4 1,2. 6
5 1,2. 0
6 1,2,4. 4
7 1,2,4. 0
8 1,2,4,6. 2
9 1,2,4,6. 0
10 1,2,4,6,8 0
11 1,2,4,6,8 0
Tried this:
df = pd.read_csv('parentsplit/parentlist.csv')
df['count'] = df['parent_list'].str.split(',', expand=True).stack().value_counts()
its not working.
You can do as follows:
df['count'] = df['id'].apply(lambda x: df['column'].fillna('X').str.contains(str(x)).sum())
This is basically counting the number of occurence of each id in the column.
Output:
id column count
0 1 None 10
1 2 1 8
2 3 1 0
3 4 1,2 6
4 5 1,2 0
5 6 1,2,4 4
6 7 1,2,4 0
7 8 1,2,4,6 2
8 9 1,2,4,6 0
9 10 1,2,4,6,8 0
10 11 1,2,4,6,8 0
Split and explode the column, then count the occurrences using value_counts then map the counts onto id column
s = df['column'].str.split(',').explode().value_counts()
df['count'] = df['id'].astype(str).map(s).fillna(0)
id column count
0 1 None 10.0
1 2 1 8.0
2 3 1 0.0
3 4 1,2 6.0
4 5 1,2 0.0
5 6 1,2,4 4.0
6 7 1,2,4 0.0
7 8 1,2,4,6 2.0
8 9 1,2,4,6 0.0
9 10 1,2,4,6,8 0.0
10 11 1,2,4,6,8 0.0
A fast method would be to not use pandas methods, but pure python: itertools.chain and collections.Counter:
from itertools import chain
from collections import Counter
c = Counter(chain(*df['column'].str.split(',').values))
df['count'] = df['id'].astype(str).map(c)
output:
id column count
0 1 10
1 2 1 8
2 3 1 0
3 4 1,2 6
4 5 1,2 0
5 6 1,2,4 4
6 7 1,2,4 0
7 8 1,2,4,6 2
8 9 1,2,4,6 0
9 10 1,2,4,6,8 0
10 11 1,2,4,6,8 0

How to renumber a dataframe according to a periodic and successive column?

The original dataframe df is:
type month
0 a 1
1 b 1
2 c 1
3 e 5
4 a 5
5 c 5
6 b 9
7 e 9
8 a 9
9 e 9
10 a 1
11 a 1
Notice that the month is arranged in successive segments and repeated periodically. The size of the segments is not always the same. I would like to add a column num, for each successive month, renumbered from 0 again. The order of the original sequence should not be changed. The expected output should be:
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1
I can't use groupby since the values of month are repeated but separated.
First we create the groups with checking if the next row is equal to the previous row with Series.shift and then cumsum the booleans.
Then we groupby on the groups and use cumcount
grps = df['month'].ne(df['month'].shift()).cumsum()
df['num'] = df.groupby(grps).cumcount()
type month num
0 a 1 0
1 b 1 1
2 c 1 2
3 e 5 0
4 a 5 1
5 c 5 2
6 b 9 0
7 e 9 1
8 a 9 2
9 e 9 3
10 a 1 0
11 a 1 1

how to get the average of values for one column based on another column value in python (pandas, jupyter)

the image shows the test dataset I am using to verify if the right averages are being calculated.
I want to be able to get the average of the corresponding values in the 'G' column based on the filtered values in the 'T' column.
So I set the values for the 'T' coloumn based on which I want to sum the values in the 'G' column and then divide the total by the count to get an average, which is appended to a variable.
however the average is not correctly calculated. see below
screenshot
total=0
g_avg=[]
output=[]
counter=0
for i, row in df_new.iterrows():
if (row['T'] > 2):
counter+=1
total+=row['G']
if (counter != 0 and row['T']==10):
g_avg.append(total/counter)
counter = 0
total = 0
print(g_avg)
below is a better set of data as there is repetition in the 'T' values so I would need a counter in order to get my average for the G values when the T value is in a certain range i.e. from 2am to 10 am etc
sorry it wont allow me to just paste the dataset so ive took a snippy of it
If you want the average of column G values when T is between 2 and 7:
df_new.loc[(df_new['T']>2) & (df_new['T']<7), 'G'].mean()
Update
It's difficult to know exactly what you want without any expected output. If you have some data that looks like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 0
6 5 4
7 6 5
8 7 0
9 8 6
10 9 7
And you want something like this:
print(df)
T G
0 0 0
1 0 0
2 1 0
3 2 1
4 3 3
5 4 3
6 5 3
7 6 3
8 7 0
9 8 6
10 9 7
Then you could use boolean indexing and DataFrame.loc:
avg = df.loc[(df['T']>2) & (df['T']<7), 'G'].mean()
df.loc[(df['T']>2) & (df['T']<7), 'G'] = avg
print(df)
T G
0 0 0.0
1 0 0.0
2 1 0.0
3 2 1.0
4 3 3.0
5 4 3.0
6 5 3.0
7 6 3.0
8 7 0.0
9 8 6.0
10 9 7.0
Update 2
If you have some sample data:
print(df)
T G
0 0 1
1 2 2
2 3 3
3 3 1
4 3 2
5 10 4
6 2 5
7 2 5
8 2 5
9 10 5
Method 1: To simply get a list of those means, you could create groups for your interval and filter on m:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()[m]
lst = df.groupby(g).mean()['G'].tolist()
print(lst)
[2.0, 5.0]
Method 2: If you want to include these means at their respective T values, then you could do this instead:
m = df['T'].between(0,5,inclusive=False)
g = m.ne(m.shift()).cumsum()
df['G_new'] = df.groupby(g)['G'].transform('mean')
print(df)
T G G_new
0 0 1 1
1 2 2 2
2 3 3 2
3 3 1 2
4 3 2 2
5 10 4 4
6 2 5 5
7 2 5 5
8 2 5 5
9 10 5 5

Pandas sum integers separated by commas in a string column

I have a pandas data frame with a column as type string, looking like:
1 1
2 3,1
3 1
4 1
5 2,1,2
6 1
7 1
8 1
9 1
10 4,3,1
I want to sum all integers separated by the commas, obtaining as a result:
1 1
2 4
3 1
4 1
5 5
6 1
7 1
8 1
9 1
10 8
My attempt so far has been:
qty = []
for i in df['Qty']:
i = i.split(",")
i = sum(i)
qty.append(i)
df['Qty'] = qty
Although, I get the error:
TypeError: cannot perform reduce with flexible type
Use apply on column to do df['B'].apply(lambda x: sum(map(int, x.split(','))))
In [81]: df
Out[81]:
A B
0 1 1
1 2 3,1
2 3 1
3 4 1
4 5 2,1,2
5 6 1
6 7 1
7 8 1
8 9 1
9 10 4,3,1
In [82]: df['B'].apply(lambda x: sum(map(int, x.split(','))))
Out[82]:
0 1
1 4
2 1
3 1
4 5
5 1
6 1
7 1
8 1
9 8
Name: B, dtype: int64

Masking a Pandas DataFrame rows based on the whole row

Background:
I'm working with 8 band multispectral satellite imagery and estimating water depth from reflectance values. Using statsmodels, I've come up with an OLS model that will predict depth for each pixel based on the 8 reflectance values of that pixel. In order to work easily with the OLS model, I've stuck all the pixel reflectance values into a pandas dataframe formated like the one in the example below; where each row represents a pixel and each column is a spectral band of the multispectral image.
Due to some pre-processing steps, all the on-shore pixels have been transformed to all zeros. I don't want to try and predict the 'depth' of those pixels so I want to restrict my OLS model predictions to the rows that are NOT all zero values.
I will need to reshape my results back to the row x col dimensions of the original image so I can't just drop the all zero rows.
Specific Question:
I've got a Pandas dataframe. Some rows contain all zeros. I would like to mask those rows for some calculations but I need to keep the rows. I can't figure out how to mask all the entries for rows that are all zero.
For example:
In [1]: import pandas as pd
In [2]: import numpy as np
# my actual data has about 16 million rows so
# I'll simulate some data for the example.
In [3]: cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
In [4]: rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
In [5]: zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
In [6]: df = pd.concat((rdf,zdf)).reset_index(drop=True)
In [7]: df
Out[7]:
band1 band2 band3 band4 band5 band6 band7 band8
0 9 9 8 7 2 7 5 6
1 7 7 5 6 3 0 9 8
2 5 4 3 6 0 3 8 8
3 6 4 5 0 5 7 4 5
4 8 3 2 4 1 3 2 5
5 9 7 6 3 8 7 8 4
6 6 2 8 2 2 6 9 8
7 9 4 0 2 7 6 4 8
8 1 3 5 3 3 3 0 1
9 4 2 9 7 3 5 5 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0
[13 rows x 8 columns]
I know I can get just the rows I'm interested in by doing this:
In [8]: df[df.any(axis=1)==True]
Out[8]:
band1 band2 band3 band4 band5 band6 band7 band8
0 9 9 8 7 2 7 5 6
1 7 7 5 6 3 0 9 8
2 5 4 3 6 0 3 8 8
3 6 4 5 0 5 7 4 5
4 8 3 2 4 1 3 2 5
5 9 7 6 3 8 7 8 4
6 6 2 8 2 2 6 9 8
7 9 4 0 2 7 6 4 8
8 1 3 5 3 3 3 0 1
9 4 2 9 7 3 5 5 0
[10 rows x 8 columns]
But I need to reshape the data again later so I'll need those rows to be in the right place. I've tried all sorts of things including df.where(df.any(axis=1)==True) but I can't find anything that works.
Fails:
df.any(axis=1)==True gives me True for the rows I'm interested in and False for rows I'd like to mask but when I try df.where(df.any(axis=1)==True) I just get back the whole data frame complete with all the zeros. I want the whole data frame but with all the values in those zero rows masked so, as I understand it, they should show up as Nan, right?
I tried getting the indexes of the rows with all zeros and masking by row:
mskidxs = df[df.any(axis=1)==False].index
df.mask(df.index.isin(mskidxs))
That didn't work either that gave me:
ValueError: Array conditional must be same shape as self
The .index is just giving an Int64Index back. I need a boolean array the same dimensions as my data frame and I just can't figure out how to get one.
Thanks in advance for your help.
-Jared
The process of clarifying my question has lead me, in a roundabout way, to finding the answer. This question also helped point me in the right direction. Here's what I figured out:
import pandas as pd
# Set up my fake test data again. My actual data is described
# in the question.
cols = ['band1','band2','band3','band4','band5','band6','band7','band8']
rdf = pd.DataFrame(np.random.randint(0,10,80).reshape(10,8),columns=cols)
zdf = pd.DataFrame(np.zeros( (3,8) ),columns=cols)
df = pd.concat((zdf,rdf)).reset_index(drop=True)
# View the dataframe. (sorry about the alignment, I don't
# want to spend the time putting in all the spaces)
df
band1 band2 band3 band4 band5 band6 band7 band8
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 6 3 7 0 1 7 1 8
4 9 2 6 8 7 1 4 3
5 4 2 1 1 3 2 1 9
6 5 3 8 7 3 7 5 2
7 8 2 6 0 7 2 0 7
8 1 3 5 0 7 3 3 5
9 1 8 6 0 1 5 7 7
10 4 2 6 2 2 2 4 9
11 8 7 8 0 9 3 3 0
12 6 1 6 8 2 0 2 5
13 rows × 8 columns
# This is essentially the same as item #2 under Fails
# in my question. It gives me the indexes of the rows
# I want unmasked as True and those I want masked as
# False. However, the result is not the right shape to
# use as a mask.
df.apply( lambda row: any([i<>0 for i in row]),axis=1 )
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
dtype: bool
# This is what actually works. By setting broadcast to
# True, I get a result that's the right shape to use.
land_rows = df.apply( lambda row: any([i<>0 for i in row]),axis=1,
broadcast=True )
land_rows
Out[92]:
band1 band2 band3 band4 band5 band6 band7 band8
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1
11 1 1 1 1 1 1 1 1
12 1 1 1 1 1 1 1 1
13 rows × 8 columns
# This produces the result I was looking for:
df.where(land_rows)
Out[93]:
band1 band2 band3 band4 band5 band6 band7 band8
0 NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 6 3 7 0 1 7 1 8
4 9 2 6 8 7 1 4 3
5 4 2 1 1 3 2 1 9
6 5 3 8 7 3 7 5 2
7 8 2 6 0 7 2 0 7
8 1 3 5 0 7 3 3 5
9 1 8 6 0 1 5 7 7
10 4 2 6 2 2 2 4 9
11 8 7 8 0 9 3 3 0
12 6 1 6 8 2 0 2 5
13 rows × 8 columns
Thanks again to those who helped. Hopefully the solution I found will be of use to somebody at some point.
I found another way to do the same thing. There are more steps involved but, according to %timeit, it is about 9 times faster. Here it is:
def mask_all_zero_rows_numpy(df):
"""
Take a dataframe, find all the rows that contain only zeros
and mask them. Return a dataframe of the same shape with all
Nan rows in place of the all zero rows.
"""
no_data = -99
arr = df.as_matrix().astype(int16)
# make a row full of the 'no data' value
replacement_row = np.array([no_data for x in range(arr.shape[1])], dtype=int16)
# find out what rows are all zeros
mask_rows = ~arr.any(axis=1)
# replace those all zero rows with all 'no_data' rows
arr[mask_rows] = replacement_row
# create a masked array with the no_data value masked
marr = np.ma.masked_where(arr==no_data,arr)
# turn masked array into a data frame
mdf = pd.DataFrame(marr,columns=df.columns)
return mdf
The result of mask_all_zero_rows_numpy(df) should be the same as Out[93]: above.
It is not clear to me why you cannot simply perform the calculations on only a subset of the rows:
np.average(df[1][:11])
to exclude the zero rows.
Or you can just make calculations on a slice and read the computed values back into the original dataframe:
dfs = df[:10]
dfs['1_deviation_from_mean'] = pd.Series([abs(np.average(dfs[1]) - val) for val in dfs[1]])
df['deviation_from_mean'] = dfs['1_deviation_from_mean']
Alternatively you could create a list of the index points you want to mask, and then make calculations using numpy masked arrays, created by uising the np.ma.masked_where() method and specifying to mask the values in the index positions:
row_for_mask = [row for row in df.index if all(df.loc[row] == 0)]
masked_array = np.ma.masked_where(df[1].index.isin(row_for_mask), df[1])
np.mean(masked_array)
The masked array looks like this:
Name: 1, dtype: float64(data =
0 5
1 0
2 0
3 4
4 4
5 4
6 3
7 1
8 0
9 9
10 --
11 --
12 --
Name: 1, dtype: object,

Categories