Modify duplicated rows in dataframe (Python)

Modify duplicated rows in dataframe (Python) - python

I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:
City Year Restaurants
0 New York 2001 20
1 Paris 2000 40
2 New York 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 33
6 Barcelona 2001 15
As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:
City Year Restaurants
0 New York 2001 2001 20
1 Paris 2000 40
2 New York 1999 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 1998 33
6 Barcelona 2001 15
I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.

Use np.where, to modify column City if duplicated
df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])

A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.
df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 1 2000 40
2 New York 2 1999 41
3 Los Angeles 1 2004 35
4 Madrid 1 2001 22
5 New York 3 1998 33
6 Barcelona 1 2001 15
To have an increment only in the duplicate cases you can use loc:
df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 2000 40
2 New York 2 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 3 1998 33
6 Barcelona 2001 15

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!

You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

replicate entire dataframe 'x' times in Python

I've a sample dataframe
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
How can I replicate the above dataframe without changing the order?
Expected outcome:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

How about:
pd.concat([df]*3, ignore_index=True)
Output:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

You can use pd.concat:
result=pd.concat([df]*x).reset_index(drop=True)
print(result)
Output (for x=3):
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC

Python: Sum values in col 1 if previous values in column 2 equals value in column 3 on each row

My data is in a csv like this
ID Date Year Home Team Away Team HP AP
1 09/02 1966 Miami Oakland 14 23
2 09/03 1966 Houston Denver 45 7
3 09/10 1966 Oakland Houston 31 0
4 09/27 1966 Houston Oakland 18 10
5 10/20 1966 Oakland Houston 21 18
On each row I want to sum the previously accumulated home and away points for both home team and away team.
I have used pandas groupby to get the home points for the home team and the away points for the away team similar to below
df1['HT_HP']=df1.groupby('Home Team')['HP'].apply(lambda x : x.shift().cumsum())
But I can't do it to get the previously scored away points for the home team and the previously scored home points for the away team.
so for the first Oakland Houston game there would be a column with Oakland 23 away points and a separate column for Houston 45 home points
Expected outcome:
ID Date Year Home Team Away Team HP AP HT_AP AT_HP
1 09/02 1966 Miami Oakland 14 23 NaN NaN
2 09/03 1966 Houston Denver 45 7 NaN NaN
3 09/10 1966 Oakland Houston 31 0 23 45
4 09/27 1966 Houston Oakland 18 10 0 31
5 10/20 1966 Oakland Houston 21 18 33 63
I've tried this
df1['HT_AGS'] = df1.where(df1['AwayTeam']==df1['HomeTeam']).groupby('HomeTeam')['FTHG'].apply(lambda x : x.shift().cumsum())
This returns a full column of NaN values
In excel it would be something similar to sumifs(F1:F3,D1:D3,E4)

Formatting pandas output data

I have a dataframe and want the output to be formatted to save paper for printing.
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8
London 40
France 2 20
France 2 22
France 3
France 3
France 3
USA 10
Is there a way to format the dataframe to look like this:
GameA GameB
Country
London 5 London 20
London 5 London 10
London 3 London 5
London 3 London 6
London London 8
London London 40
GameA GameB
France 2 France 20
France 2 France 22
France 3
France 3
France 3
GameA
USA 10

The formatting is off a bit because of how it copy and pasted the text results above (due to the missing values), but this should work with your actual data.
countries = df.index.unique()
for country in countries:
print(df.loc[df.index == country])
print(' ')
GameA GameB
Country
London 5 20
London 5 10
London 3 5
London 3 6
London 8 NaN
London 40 NaN
GameA GameB
Country
France 2 20
France 2 22
France 3 NaN
France 3 NaN
France 3 NaN
GameA GameB
Country
USA 10 NaN

Add new column based on sum of a column and grouped by 2 other columns in Pandas

I have the dataframe:
df = pd.DataFrame({'Continent':['North America','North America','North America','Europe','Europe','Europe','Europe'],
'Country': ['US','Canada','Mexico','France','Germany','Spain','Italy'],
'Status': ['Member','Non-Member','Non-Member','Member','Non-Member','Member','Non-Member'],
'Units': [27,5,4,10,15,8,8]})
print df
Continent Country Status Units
0 North America US Member 27
1 North America Canada Non-Member 5
2 North America Mexico Non-Member 4
3 Europe France Member 10
4 Europe Germany Non-Member 15
5 Europe Spain Member 8
6 Europe Italy Non-Member 8
I need to add 2 columns which are summary statistics about the Continents. I need a column with the sum of Units for Member countries and Non Member countries.
so that the final output would look like:
Continent Member Units Non-Member Units Country Status Units
0 North America 27 9 US Member 27
1 North America 27 9 Canada Non-Member 5
2 North America 27 9 Mexico Non-Member 4
3 Europe 18 23 France Member 10
4 Europe 18 23 Germany Non-Member 15
5 Europe 18 23 Spain Member 8
6 Europe 18 23 Italy Non-Member 8
It seems like I need to use groupby but I can't figure out how to take the groupby values and re-insert them into the dataframe as new columns.
summary_stats = df.groupby(['Continent','Status'])['Units'].sum()
print summary_stats
Continent Status
Europe Member 18
Non-Member 23
North America Member 27
Non-Member 9
Name: Units, dtype: int64
I also tried not using groupby with these:
df['Member Units'] = df['Units'][df['Status'] == 'Member'].sum()
df['Non-Member Units'] = df['Units'][df['Status'] == 'Non-Member'].sum()
but that doesn't differentiate by Continent so it just adds up all the Members and Non-Members
Any help is greatly appreicated!

I think you need first groupby and transform sum for creating new Series all_sum. Then I think is better use numpy.where and if is member, get value from Series, if not, get 0. Similar with non-members:
all_sum = df.groupby(['Continent','Status'])['Units'].transform(sum)
print all_sum
0 27
1 9
2 9
3 18
4 23
5 18
6 23
dtype: int64
df['Member Units'] = np.where(df['Status'] == 'Member', all_sum, 0)
df['Non-Member Units'] = np.where(df['Status'] != 'Member', all_sum, 0)
print df
Continent Country Status Units Member Units Non-Member Units
0 North America US Member 27 27 0
1 North America Canada Non-Member 5 0 9
2 North America Mexico Non-Member 4 0 9
3 Europe France Member 10 18 0
4 Europe Germany Non-Member 15 0 23
5 Europe Spain Member 8 18 0
6 Europe Italy Non-Member 8 0 23

Once you have summary_stats I think you can do something like:
df['Member Units'] = summary_stats[zip(df['Continent'].values, df['Status'].values)]
The reason you need to zip the Series values is that df['Continent'] returns a series with indices, but you don't want that to happen.

Since you have summary_stats, you can use merge() after reshape it:
summary = summary_stats.reset_index().pivot(index='Continent', columns='Status', values='Units')
summary['Continent'] = summary.index
df = df.merge(summary, on='Continent')
Then just rename columns as you want

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modify duplicated rows in dataframe (Python) - python

Use np.where, to modify column City if duplicated df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])

Related

Python summing selected values in a column that match given condition

replicate entire dataframe 'x' times in Python

Python: Sum values in col 1 if previous values in column 2 equals value in column 3 on each row

Formatting pandas output data

Add new column based on sum of a column and grouped by 2 other columns in Pandas

Categories

Resources