I have a df with year, I'm trying to combine two rows in a dataframe.
df
year
0 2020
1 2019
2 2018
3 2017
4 2016
Final df
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN
Let us do shift
df['combine'] = df.year.astype(str) + '-' + df.year.astype(str).shift(-1)
df
Out[302]:
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN
Related
For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))
For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8
Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))
I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0
I have a df with 4 observations per company (4 quarter). However, for several companies I have less than 4 observations. When I don't have the 4 quarters for a firm I would like to delete all observations relative to the firm. Any ideas how to do this ?
This is how the df looks like:
Quarter Year Company
1 2018 A
2 2018 A
3 2018 A
4 2018 A
1 2018 B
2 2018 B
1 2018 C
2 2018 C
3 2018 C
4 2018 C
In this df I would like to delete rows relative to company B because I only have 2 quarters.
Many thanks!
Use transform with size for Series with same size like original DataFrame, so possible filtering:
df = df[df.groupby('Company')['Quarter'].transform('size') == 4]
#if want check by Companies and years
#df = df[df.groupby(['Company','Year'])['Quarter'].transform('size') == 4]
print (df)
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
If performance is not important or small DataFrame use DataFrameGroupBy.filter:
df = df.groupby('Company').filter(lambda x: len(x) == 4)
Using value_counts
s=df.Company.value_counts()
df.loc[df.Company.isin(s[s==4].index)]
Out[527]:
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
You can go through your company column and check whether you have all 4 quarter results.
for i in set(df['Company']):
if len(df[df['Company']==i)!=4:
df=df[df['Company']!=i]
I have a data set which looks like the following
doc_created_month doc_created_year speciality doc_id count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 1
4 2017 Acupuncturist 1
4 2017 Allergist 1
5 2018 Allergist 1
10 2018 Allergist 2
I want to group by the month, year and speciality and get cumulative sum on 'doc_id count' column.
These are the following I tried:
1) docProfileDf2.groupby(by=['speciality','doc_created_year','doc_created_month']).sum().groupby(level=[0]).cumsum()
2) docProfileDf2.groupby(['doc_created_month','doc_created_year','speciality'])['doc_id count'].apply(lambda x: x.cumsum())
None of them are returning the proper cumulative sum.
Any Solution can help.
The expected output should be:
doc_created_month doc_created_year speciality doc_id count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 2
4 2017 Acupuncturist 3
4 2017 Allergist 1
5 2018 Allergist 2
10 2018 Allergist 4
For each year, month and speciality I want the cumsum of 'doc_id count'
It is Simple:
Solution is:
df.groupby(by=['speciality','doc_created_year','doc_created_month']).sum().groupby(level=[0]).cumsum()
i had to sum and groupby at speciality level.
Please note that I changed doc_id count to doc_id_count
You first call groupby('speciality') in order to group your data by that column. The second step is to call apply(). You'll then apply a function on each group. In this case we perform another groupby on the other required columns and further call group.sum().cumsum() to get the desired result.
from io import StringIO
import pandas as pd
data = """
doc_created_month doc_created_year speciality doc_id_count
8 2016 Acupuncturist 1
2 2017 Acupuncturist 1
4 2017 Acupuncturist 1
4 2017 Allergist 1
5 2018 Allergist 1
10 2018 Allergist 2
"""
df = pd.read_csv(StringIO(data), sep='\s+')
(df.groupby('speciality')
.apply(lambda df_: df_.groupby(['doc_created_year', 'doc_created_month'])
.sum().cumsum())
)
which outputs:
doc_id_count
speciality doc_created_year doc_created_month
Acupuncturist 2016 8 1
2017 2 2
4 3
Allergist 2017 4 1
2018 5 2
10 4