Remove rows that are not duplicated n time

Remove rows that are not duplicated n time - python

I have a df with 4 observations per company (4 quarter). However, for several companies I have less than 4 observations. When I don't have the 4 quarters for a firm I would like to delete all observations relative to the firm. Any ideas how to do this ?
This is how the df looks like:
Quarter Year Company
1 2018 A
2 2018 A
3 2018 A
4 2018 A
1 2018 B
2 2018 B
1 2018 C
2 2018 C
3 2018 C
4 2018 C
In this df I would like to delete rows relative to company B because I only have 2 quarters.
Many thanks!

Use transform with size for Series with same size like original DataFrame, so possible filtering:
df = df[df.groupby('Company')['Quarter'].transform('size') == 4]
#if want check by Companies and years
#df = df[df.groupby(['Company','Year'])['Quarter'].transform('size') == 4]
print (df)
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C
If performance is not important or small DataFrame use DataFrameGroupBy.filter:
df = df.groupby('Company').filter(lambda x: len(x) == 4)

Using value_counts
s=df.Company.value_counts()
df.loc[df.Company.isin(s[s==4].index)]
Out[527]:
Quarter Year Company
0 1 2018 A
1 2 2018 A
2 3 2018 A
3 4 2018 A
6 1 2018 C
7 2 2018 C
8 3 2018 C
9 4 2018 C

You can go through your company column and check whether you have all 4 quarter results.
for i in set(df['Company']):
if len(df[df['Company']==i)!=4:
df=df[df['Company']!=i]

Related

Doing joins between 2 csv files [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8

Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

merging two csv using python [duplicate]

For df2 which only has data in the year of 2019:
type year value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
df1 has multiple years data:
type year value
0 a 2015 12
1 a 2016 2
2 a 2019 3
3 b 2018 50
4 b 2019 10
5 c 2017 1
6 c 2016 5
7 c 2019 8
I need to concatenate them together while replacing df2's values in 2019 with the values from df1's same year.
The expected result will like this:
type date value
0 a 2015 12
1 a 2016 2
2 b 2018 50
3 c 2017 1
4 c 2016 5
5 a 2019 13
6 b 2019 5
7 c 2019 5
8 d 2019 20
The result from pd.concat([df1, df2], ignore_index=True, sort =False), which clearly have multiple values in year of 2019 for one type. How should I improve the code? Thank you.
type date value
0 a 2019 13
1 b 2019 5
2 c 2019 5
3 d 2019 20
4 a 2015 12
5 a 2016 2
6 a 2019 3
7 b 2018 50
8 b 2019 10
9 c 2017 1
10 c 2016 5
11 c 2019 8

Add DataFrame.drop_duplicates for get last rows per type and date after concat.
Solution working if type and date pairs are unique in both DataFrames.
df = (pd.concat([df1, df2], ignore_index=True, sort =False)
.drop_duplicates(['type','date'], keep='last'))

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?

Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})

Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0

Sort by both index and value in Multi-indexed data of Pandas dataframe

Suppose, I have a dataframe as below:
year month message
0 2018 2 txt1
1 2017 4 txt2
2 2019 5 txt3
3 2017 5 txt5
4 2017 5 txt4
5 2020 4 txt3
6 2020 6 txt3
7 2020 6 txt3
8 2020 6 txt4
I want to figure out top three number of messages in each year. So, I grouped the data as below:
df.groupby(['year','month']).count()
which results:
message
year month
2017 4 1
5 2
2018 2 1
2019 5 1
2020 4 1
6 3
The data is in ascending order for both indexes. But how to find the results as shown below where the data is sorted by year (ascending) and count (descending) for top n values. 'month' index will be free.
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1

value_counts gives you sort by default:
df.groupby('year')['month'].value_counts()
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64
If you want only 2 top values for each year, do another groupby:
(df.groupby('year')['month'].value_counts()
.groupby('year').head(2)
)
Output:
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1
Name: month, dtype: int64

This will sort by year (ascending) and count (descending).
df = df.groupby(['year', 'month']).count().sort_values(['year', 'message'], ascending=[True, False])

You can use sort_index, specifying ascending=[True,False] so that only the second level is sorted in descending order:
df = df.groupby(['year','month']).count().sort_index(ascending=[True,False])
message
year month
2017 5 2
4 1
2018 2 1
2019 5 1
2020 6 3
4 1

here you go
df.groupby(['year', 'month']).count().sort_values(axis=0, ascending=False, by='message').sort_values(axis=0, ascending=True, by='year')

you can use this code for it.
df.groupby(['year', 'month']).count().sort_index(axis=0, ascending=False).sort_values(by="year", ascending=True)

Converting columns to rows in Pandas with a secondary index value

Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7

Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7

this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove rows that are not duplicated n time - python

Using value_counts s=df.Company.value_counts() df.loc[df.Company.isin(s[s==4].index)] Out[527]: Quarter Year Company 0 1 2018 A 1 2 2018 A 2 3 2018 A 3 4 2018 A 6 1 2018 C 7 2 2018 C 8 3 2018 C 9 4 2018 C

You can go through your company column and check whether you have all 4 quarter results. for i in set(df['Company']): if len(df[df['Company']==i)!=4: df=df[df['Company']!=i]

Related

Doing joins between 2 csv files [duplicate]

merging two csv using python [duplicate]

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

Sort by both index and value in Multi-indexed data of Pandas dataframe

Converting columns to rows in Pandas with a secondary index value

Categories

Resources