pandas remove observations depending on multi-index level value - python

I have a multi-index data frame with levels 'id' and 'year':
value
id year
10 2001 100
2002 200
11 2001 110
12 2001 200
2002 300
13 2002 210
I want to keep the ids that have values for both years 2001 and 2002. This means I want to obtain:
value
id year
10 2001 100
2002 200
12 2001 200
2002 300
I know that df.loc[df.index.get_level_values('year') == 2002] works but I cannot extend that to account for both 2001 and 2002.
Thanks in advance.

How about use groupby and filter:
df.groupby(level=0).filter(
lambda df:np.in1d([2001, 2002], df.index.get_level_values(1)).all()
)

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

Panda dataframe sort by date while keeping a certain order in second column

I have multiple time-series dataframes from different countries. I want to merge these together, keeping the order in the dates (instead of merely putting one dataframe below the other) while at the same time making sure that for each date the column next to it which has the country index has a consistent pattern. However, when I do this the countries seem to be randomly distributed. For one date Australia is first but for another date, Japan is put first.
To clarify with an example:
Australia
country crime-index
2000 AU 100
2001 AU 110
2002 AU 120
Japan
country crime-index
2000 JP 90
2001 JP 100
2002 JP 95
United Kingdom
country crime-index
2000 UK 120
2001 UK 130
2002 UK 130
Merged
country crime-index
2000 AU 100
2000 JP 90
2000 UK 120
2001 AU 110
2001 JP 100
2001 UK 130
2002 AU 120
2002 JP 95
2002 UK 130
You can simply use the sort_values functions of pandas to sort your dataframe by multiple columns or together with index. With this, the ordering of the country column will be the same for each date.
df.rename_axis('dates').sort_values(["dates","country"])
You can try with
df['temp'] = df.index
df.sort_values(['temp', 'country'])
del df['temp']
df['temp'] copies the dates in a column, then sort values by two columns

Python Pandas convert selective columns into rows

My dataset has some information about price and sales for different years. The problem is each year is actually a different column header for price and for sales as well. For example the CSV looks like
Items
Price in 2018
Price in 2019
Price in 2020
Sales in 2018
Sales in 2019
Sales in 2020
A
100
120
135
5000
6000
6500
B
110
130
150
2000
4000
4500
C
150
110
175
1000
3000
3000
I want to show it something like this
Items
Year
Price
Sales
A
2018
100
5000
A
2019
120
6000
A
2020
135
6500
B
2018
110
2000
B
2019
130
4000
B
2020
150
4500
C
2018
150
1000
C
2019
110
3000
C
2020
175
3000
I used melt function from Pandas like this
df.melt(id_vars = ['Items'], var_name="Year", value_name="Price")
But I'm struggling in getting separate columns for Price and Sales as it gives Price and Sales in one column. Thanks
Let us try pandas wide_to_long
pd.wide_to_long(df, i='Items', j='year',
stubnames=['Price', 'Sales'],
suffix=r'\d+', sep=' in ').sort_index()
Price Sales
Items year
A 2018 100 5000
2019 120 6000
2020 135 6500
B 2018 110 2000
2019 130 4000
2020 150 4500
C 2018 150 1000
2019 110 3000
2020 175 3000

Applying a condition to a df to get the aggregate counts

I have this df structured like this, where each year has the same rows/entries:
Year Name Expire
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Bob 2002
2002 Tim 2003
2002 Will 2004
2003 Bob 2002
2003 Tim 2003
2003 Will 2004
I have subsetted the df (df[df['Expire']> df['Year'])
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Tim 2003
2002 Will 2004
2003 Will 2004
Now I want to return the count for each year the amount of names that expired, something like:
Year count
2001 0
2002 1
2003 1
How can I accomplish this? I can't do (df[df['Expire']<= df['Year'])['name'].groupby('Year').agg(['count']), because that would return unnecessary rows for me. Any way to count only the last instance only?
You can use groupby with boolean mask and aggregate sum:
print (df['Expire']<= df['Year'])
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 True
8 False
dtype: bool
df=(df['Expire']<=df['Year']).groupby(df['Year']).sum().astype(int).reset_index(name='count')
print (df)
Year count
0 2001 0
1 2002 1
2 2003 2
Verifying:
print (df[df['Expire']<= df['Year']])
Year Name Expire
3 2002 Bob 2002
6 2003 Bob 2002
7 2003 Tim 2003
IIUC : You can use .apply and sum of true values i.e
df.groupby('Year').apply(lambda x: (x['Expire']<=x['Year']).sum())
Output:
Year
2001 0
2002 1
2003 2

Pandas dataframe: how to find missing years in a timeseries?

I have a DataFrame with a timestamp index and some 100,000 rows. Via
df['year'] = df.index.year
it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,
grouped = df.groupby('year')
grouped.groups.keys()
will give me the years which are present in my dataset. I could now build a complete year vector with
pd.date_range(df.index.min(), df.index.max(), freq='AS')
and through reindex I should then be able to find the missing years as those years which have NaN values.
However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.
Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?
One method would be to construct a series of the years of interest and then use isin to see the missing values:
In [89]:
year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
6 1999
7 2000
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
20 2013
21 2014
dtype: int32
In [88]:
df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
year
0 1999
1 2000
2 2013
In [91]:
year_s[~year_s.isin(df['year'])]
Out[91]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
21 2014
dtype: int32
So in your case you can generate the year series as above, then for your df you can get the years using:
df.index.year.unique()
which will be much quicker than performing a groupby.
Take care that the last value passed to arange is not included in the range
If all you want is a list of missing years, you can first convert your Data Series to a list and simply build a list of missing years using a list comprehension:
years = df['year'].unique()
missing_years = [y for y in range(min(years), max(years)+1) if y not in years]

Categories