Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)
Related
I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:
City Year Restaurants
0 New York 2001 20
1 Paris 2000 40
2 New York 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 33
6 Barcelona 2001 15
As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:
City Year Restaurants
0 New York 2001 2001 20
1 Paris 2000 40
2 New York 1999 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 1998 33
6 Barcelona 2001 15
I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.
Use np.where, to modify column City if duplicated
df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])
A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.
df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 1 2000 40
2 New York 2 1999 41
3 Los Angeles 1 2004 35
4 Madrid 1 2001 22
5 New York 3 1998 33
6 Barcelona 1 2001 15
To have an increment only in the duplicate cases you can use loc:
df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 2000 40
2 New York 2 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 3 1998 33
6 Barcelona 2001 15
I am building a chart from a dataframe with a series of yearly values for six countries. This table is created by an SQL query and then passed to pandas with read_sql command...
country date value
0 CA 2000 123
1 CA 2001 125
2 US 1999 223
3 US 2000 235
4 US 2001 344
5 US 2002 355
...
Unfortunately, not every year has a value in each country, nevertheless the chart tool requires each country to have the same number of years in the dataframe. Years that have no values need a Nan (null) row added.
In the end, I want the pandas dataframe to look as follows for all six countries....
country date value
0 CA 1999 Nan
1 CA 2000 123
2 CA 2001 125
3 CA 2002 Nan
4 US 1999 223
5 US 2000 235
6 US 2001 344
7 US 2002 355
8 DE 1999 Nan
9 DE 2000 Nan
10 DE 2001 423
11 DE 2002 326
...
Are there any tools or shortcuts for determining min-max dates and then ensuring a new nan row is created if needed?
Use Series.unstack with DataFrame.stack trick:
df = df.set_index(['country','date']).unstack().stack(dropna=False).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0
Another idea with DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['country'].unique(),
range(df['date'].min(), df['date'].max() + 1)],
names=['country','date'])
df = df.set_index(['country','date']).reindex(mux).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0
I have a dataframe of the following type:
Country Year Age Male Female
0 Canada 2005 50 400 25
1 Canada 2005 51 100 25
2 Canada 2006 50 100 70
3 Columbia 2005 50 75 75
I would like to, for example, get the total number of males+females of any age, grouped by country and year. I.e. I'm trying to understand what operation could allow me to see a table such as
Country Year Total over ages and sexes
0 Canada 2005 550
1 Canada 2006 170
2 Columbia 2005 150
In the above example, the value 550 comes from the total number of males and females in Canada for the year 2005, regardless of age: so 550 = 400+25+100+25.
I probably need to groupby Country and Year, but I'm not sure how to collapse the ages and total the number of males and females.
df["Total"] = df.Male + df.Female
df.groupby(["Country", "Year"]).Total.sum()
Output:
Country Year
Canada 2005 550
2006 170
Columbia 2005 150
Name: Total, dtype: int64
Update
cᴏʟᴅsᴘᴇᴇᴅ's chained version:
(df.assign(Total=df.Male + df.Female)
.groupby(['Country', 'Year'])
.Total
.sum()
.reset_index(name='Total over ages and sexes'))
I have this df structured like this, where each year has the same rows/entries:
Year Name Expire
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Bob 2002
2002 Tim 2003
2002 Will 2004
2003 Bob 2002
2003 Tim 2003
2003 Will 2004
I have subsetted the df (df[df['Expire']> df['Year'])
2001 Bob 2002
2001 Tim 2003
2001 Will 2004
2002 Tim 2003
2002 Will 2004
2003 Will 2004
Now I want to return the count for each year the amount of names that expired, something like:
Year count
2001 0
2002 1
2003 1
How can I accomplish this? I can't do (df[df['Expire']<= df['Year'])['name'].groupby('Year').agg(['count']), because that would return unnecessary rows for me. Any way to count only the last instance only?
You can use groupby with boolean mask and aggregate sum:
print (df['Expire']<= df['Year'])
0 False
1 False
2 False
3 True
4 False
5 False
6 True
7 True
8 False
dtype: bool
df=(df['Expire']<=df['Year']).groupby(df['Year']).sum().astype(int).reset_index(name='count')
print (df)
Year count
0 2001 0
1 2002 1
2 2003 2
Verifying:
print (df[df['Expire']<= df['Year']])
Year Name Expire
3 2002 Bob 2002
6 2003 Bob 2002
7 2003 Tim 2003
IIUC : You can use .apply and sum of true values i.e
df.groupby('Year').apply(lambda x: (x['Expire']<=x['Year']).sum())
Output:
Year
2001 0
2002 1
2003 2
I'm trying to prune some data from my data frame but only the rows where there are duplicates in the "To country" column
My data frame looks like this:
Year From country To country Points
0 2016 Albania Armenia 0
1 2016 Albania Armenia 2
2 2016 Albania Australia 12
Year From country To country Points
2129 2016 United Kingdom The Netherlands 0
2130 2016 United Kingdom Ukraine 10
2131 2016 United Kingdom Ukraine 5
[2132 rows x 4 columns]
I try this on it:
df.drop_duplicates(subset='To country', inplace=True)
And what happens is this:
Year From country To country Points
0 2016 Albania Armenia 0
2 2016 Albania Australia 12
4 2016 Albania Austria 0
Year From country To country Points
46 2016 Albania The Netherlands 0
48 2016 Albania Ukraine 0
50 2016 Albania United Kingdom 5
[50 rows x 4 columns]
While this does get rid of the duplicated 'To country' entries, it also removes all the values of the 'From country' column. I must be using the drop_duplicates() wrong, but the pandas documentation isn't helping me understand why its dropping more than I'd expect it to?
No, this behavior is correct—assuming every team played every other team, it's finding the firsts, and all of those firsts are "From" Albania.
From what you've said below, you want to keep row 0, but not row 1 because it repeats both the To and From countries. The way to eliminate those is:
df.drop_duplicates(subset=['To country', 'From country'], inplace=True)
The simplest solution is to group by the 'to country' name and take the first (or the last, if you prefer) row from each group:
df.groupby('To country').first().reset_index()
# To country Year From country Points
#0 Armenia 2016 Albania 0
#1 Australia 2016 Albania 12
#2 The Netherlands 2016 United Kingdom 0
#3 Ukraine 2016 United Kingdom 10
Compared to aryamccarthy's solution, this one gives you more control over which duplicates to keep.