I have a dataframe of "sentences", from which I wish to search for a keyword. Let's say that my keyword is just the letter 'A'. Sample data:
year | sentence | index
-----------------------
2015 | AAX | 0
2015 | BAX | 1
2015 | XXY | -1
2016 | AWY | 0
2017 | BWY | -1
That is, the "index" column shows the index of the first occurence of "A" in each sentence (-1 if not found). I want to group up the rows into their respective years, with a column showing the percentage of occurences of 'A' in the records of each year. That is:
year | index
-------------
2015 | 0.667
2016 | 1.0
2017 | 0
I have a feeling that this involves agg or groupby in some fashion, but I'm not clear how to string these together. I've gotten as far as:
df.groupby("index").count()
But the issue here is some kind of conditional count() first, where we first count the number of rows in year 201X containing 'A', then dividing that by the number of rows in year 201X.
You can use value_counts or GroupBy.size with boolean indexing:
What is the difference between size and count in pandas?
df2 = df['year'].value_counts()
print (df2)
2015 3
2017 1
2016 1
Name: year, dtype: int64
df1 = df.loc[df['index'] != -1, 'year'].value_counts()
print (df1)
2015 2
2016 1
Name: year, dtype: int64
Or:
df2 = df.groupby('year').size()
print (df2)
year
2015 3
2016 1
2017 1
dtype: int64
df1 = df.loc[df['index'] != -1, ['year']].groupby('year').size()
print (df1)
year
2015 2
2016 1
dtype: int64
And last divide by div:
print (df1.div(df2, fill_value=0))
2015 0.666667
2016 1.000000
2017 0.000000
Name: year, dtype: float64
from __future__ import division
import pandas as pd
x_df = # your dataframe
y = x_df.groupby('year')['sentence'].apply(lambda x: sum(True if i.count('A') >0 else False for i in x)/len(x))
#or
y = x.groupby('year')['index'].apply(lambda x: sum(True if i >=0 else False for i in x)/len(x))
Using sentence to check
df.sentence.str.contains('A').groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: sentence, dtype: float64
Using index that has already checked
df['index'].ne(-1).groupby(df.year).mean()
year
2015 0.666667
2016 1.000000
2017 0.000000
Name: index, dtype: float64
There are different ways to do it but no 'native' way as far as I know.
Here's one example, with only one grouby:
g = df.groupby('year')['index'].agg([lambda x: x[x>=0].count(), 'count'])
g['<lambda>'] / g['count']
Check also:
pandas dataframe groupby: sum/count of only positive numbers
Pandas Very Simple Percent of total size from Group by
Related
I was trying to remove the rows with nan values in a python dataframe and when i do so, i want the row identifiers to shift in such way that the identifiers in the new data frame start from 0 and are one number away from each other. By identifiers i mean the numbers at the left of the following example. Notice that this is not an actual column of my df. This is rather placed by default in every dataframe.
If my Df is like:
name toy born
0 a 1 2020
1 na 2 2020
2 c 5 2020
3 na 1 2020
4 asf 1 2020
i want after dropna()
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
I dont want, but this is what i get:
name toy born
0 a 1 2020
2 c 5 2020
4 asf 1 2020
You can simply add df.reset_index(drop=True)
By default, df.dropna and df.reset_index are not performed in place. Therefore, the complete answer would be as follows.
df = df.dropna().reset_index(drop=True)
Results and Explanations
The above code yields the following result.
>>> df = df.dropna().reset_index(drop=True)
>>> df
name toy born
0 a 1 2020
1 c 5 2020
2 asf 1 2020
We use the argument drop=True to drop the index column. Otherwise, the result would look like this.
>>> df = df.dropna().reset_index()
>>> df
index name toy born
0 0 a 1 2020
1 2 c 5 2020
2 4 asf 1 2020
I want to create a column churn as shown.
The code should group and compare each year's column Col and assign 0 if it finds Col value in next year.
In this example 3rd row is missing from 2017. Hence assigning 1.
How do I do this in pandas?
State ID Col Year cost Churn
CT 123 M 2016 10 0
CT 123 C 2016 15 0
CT 123 A 2016 10 1
CT 123 C 2016 20 0
CT 123 M 2017 10 0
CT 123 C 2017 15 0
First add all missing combinations of first 4 columns by Series.reindex with MultiIndex.from_product, then shift per first 3 columns by DataFrameGroupBy.shift and last use DataFrame.merge for original order and remove all added rows (if no parameter on it use all columns wich are same in both DataFrames):
s = df.assign(Churn=0).set_index(['State','ID','Col','Year'])['Churn']
df1 = df.merge(s.reindex(pd.MultiIndex.from_product(s.index.levels), fill_value=1)
.groupby(level=[0,1,2])
.shift(-1, fill_value=0)
.reset_index())
print (df1)
State ID Col Year Churn
0 CT 123 M 2016 0
1 CT 123 C 2016 0
2 CT 123 A 2016 1
3 CT 123 M 2017 0
4 CT 123 C 2017 0
there is a dataframe as following:
id year number
1 2016 3
1 2017 5
2 2016 1
2 2017 5
...
I want to extract the rows that groupby id and the value of number column is more than 3 in both 2016 and 2017.
for example in the above first 4 rows, the result is:
id year number
1 2016 3
1 2017 5
Thanks!
Compare by >=3 and use GroupBy.transform for Series with same size like original, so possible filter by boolean indexing:
df1 = df[(df["number"] >= 3).groupby(df["id"]).transform('all')]
#alternative for reassign mask to column
#df = df[df.assign(number= df["number"] >= 3).groupby("id")['number'].transform('all')]
print (df1)
id year number
0 1 2016 3
1 1 2017 5
Or use filter, but it should be slow if large DataFrame or many groups:
df1 = df.groupby("id").filter(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years = df.groupby("id").apply(lambda x: (x["number"] >= 3).all())
>>> great_in_both_years
id
1 True
2 False
dtype: bool
>>> df.loc[lambda x: x["id"].map(great_in_both_years)]
id year number
0 1 2016 3
1 1 2017 5
Please find below the input and sample output:
If the count is null , then the next years weight becomes zero. We need to group by company and year. Please note that the starting and ending years may be different for different companies. Also if a year is missing then automatically the next available year should have zero.For e.g. def has data upto 2016 and then 2018(2017 is missing). Since 2017 is missing 2018 weight should be zero as we are assuming missing years have null values.
I have also added an image of sample input and output
df has columns company, year, weight, count
flag = False
for index, row in df.iterrows():
if flag:
row['weight'] = 0
flag = False
if row['count'] is None:
flag = True
If I understand your question correctly, what you need is pandas.DataFrame.shift:
Assume your pandas.DataFrame is named df:
import numpy as np
df.sort_values(['company', 'year'], inplace=True)
is_previous_null = df.loc[:, 'count'].shift(1).isnull() # Is the previous 'count' value null?
is_same_company = (df.loc[:, 'company'] == df.loc[:, 'company'].shift(1)) # Check if the previous row's 'company' value is the same as the current one
df.loc[is_previous_null & is_same_company, 'value'] = 0
Solution if consecutive years per company - first replace missing values by helper values - e.g. tmp, then use DataFrameGroupBy.shift and compare tmp.
Last set 0 by DataFrame.loc:
df = df.sort_values(['company', 'year'])
mask = df.assign(count=df['count'].fillna('tmp')).groupby('company')['count'].shift().eq('tmp')
df.loc[mask, 'weight'] = 0
print (df)
company year weight count
0 abc 2016 0.7 1.0
1 abc 2017 0.3 NaN
2 abc 2018 0.0 3.0
3 def 2015 0.6 6.0
4 def 2016 0.6 NaN
5 def 2017 0.0 7.0
6 def 2018 0.7 5.0
EDIT:
First add new years by reindex per groups with minimal and maximal years:
s = (df.set_index('year')
.groupby('company')['count']
.apply(lambda x: x.reindex(np.arange(x.index.min(), x.index.max() + 1)).fillna('tmp')))
print (s)
company year
abc 2016 1
2017 tmp
2018 3
def 2015 6
2016 8
2017 tmp
2018 5
Name: count, dtype: object
Then shift like in original solution per company, here by first level company and compare by tmp:
m = s.groupby(level=0).shift().eq('tmp').rename('m')
print (m)
company year
abc 2016 False
2017 False
2018 True
def 2015 False
2016 False
2017 False
2018 True
Name: m, dtype: bool
Create mask with same index like original DataFrame with join:
mask = df.join(m, on=['company','year'])['m']
print (mask)
0 False
1 False
2 True
3 False
4 False
5 True
Name: m, dtype: bool
Set 0 values:
df.loc[mask, 'weight'] = 0
print (df)
company year weight count
0 abc 2016 0.7 1.0
1 abc 2017 0.3 NaN
2 abc 2018 0.0 3.0
3 def 2015 0.6 6.0
4 def 2016 0.6 8.0
5 def 2018 0.0 5.0
I have a pandas dataframe something like this
Date ID
01/01/2016 a
05/01/2016 a
10/05/2017 a
05/05/2014 b
07/09/2014 b
12/08/2017 b
What I need to do is to add a column which shows the number of entries for each ID that occurred within the last year and another column showing the number within the next year. I've written some horrible code that iterates through the whole dataframe (millions of lines) and does the computations but there must be a better way!
I think you need between with boolean indexing for filter first and then groupby and aggregate size.
Outputs are concated and add reindex for add missing rows filled by 0:
print (df)
Date ID
0 01/01/2016 a
1 05/01/2016 a
2 10/05/2017 a
3 05/05/2018 b
4 07/09/2014 b
5 07/09/2014 c
6 12/08/2018 b
#convert to datetime (if first number is day, add parameter dayfirst)
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
now = pd.datetime.today()
print (now)
oneyarbeforenow = now - pd.offsets.DateOffset(years=1)
oneyarafternow = now + pd.offsets.DateOffset(years=1)
#first filter
a = df[df['Date'].between(oneyarbeforenow, now)].groupby('ID').size()
b = df[df['Date'].between(now, oneyarafternow)].groupby('ID').size()
print (a)
ID
a 1
dtype: int64
print (b)
ID
b 2
dtype: int64
df1 = pd.concat([a,b],axis=1).fillna(0).astype(int).reindex(df['ID'].unique(),fill_value=0)
print (df1)
0 1
a 1 0
b 0 2
c 0 0
EDIT:
If need compare each date by first date add or subtract year offset per group need custom function with condition and sum Trues:
offs = pd.offsets.DateOffset(years=1)
f = lambda x: pd.Series([(x > x.iat[-1] - offs).sum(), \
(x < x.iat[-1] + offs).sum()], index=['last','next'])
df = df.groupby('ID')['Date'].apply(f).unstack(fill_value=0).reset_index()
print (df)
ID last next
0 a 1 3
1 b 3 2
2 c 1 1
In [19]: x['date'] = pd.to_datetime( x['date']) # convert string date to datetime pd object
In [20]: x['date'] = x['date'].dt.year # get year from the date
In [21]: x
Out[21]:
date id
0 2016 a
1 2016 a
2 2017 a
3 2014 b
4 2014 b
5 2017 b
In [27]: x.groupby(['date','id']).size() # group by both columns
Out[27]:
date id
2014 b 2
2016 a 2
2017 a 1
b 1
Using resample takes care of missing inbetween years. See. year-2015
In [550]: df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
Out[550]:
Date 2014-12-31 2015-12-31 2016-12-31 2017-12-31
ID
a 0 0 2 1
b 2 0 0 1
Use rename if you want only year in columns
In [551]: (df.set_index('Date').groupby('ID').resample('Y').size().unstack(fill_value=0)
.rename(columns=lambda x: x.year))
Out[551]:
Date 2014 2015 2016 2017
ID
a 0 0 2 1
b 2 0 0 1