Performing Split on Pandas dataframe and create a new frame

Performing Split on Pandas dataframe and create a new frame - python

I have a pandas dataframe with one column like this:
Merged_Cities
New York, Wisconsin, Atlanta
Tokyo, Kyoto, Suzuki
Paris, Bordeaux, Lyon
Mumbai, Delhi, Bangalore
London, Manchester, Bermingham
And I want a new dataframe with the output like this:
Merged_Cities
Cities
New York, Wisconsin, Atlanta
New York
New York, Wisconsin, Atlanta
Wisconsin
New York, Wisconsin, Atlanta
Atlanta
Tokyo, Kyoto, Suzuki
Tokyo
Tokyo, Kyoto, Suzuki
Kyoto
Tokyo, Kyoto, Suzuki
Suzuki
Paris, Bordeaux, Lyon
Paris
Paris, Bordeaux, Lyon
Bordeaux
Paris, Bordeaux, Lyon
Lyon
Mumbai, Delhi, Bangalore
Mumbai
Mumbai, Delhi, Bangalore
Delhi
Mumbai, Delhi, Bangalore
Bangalore
London, Manchester, Bermingham
London
London, Manchester, Bermingham
Manchester
London, Manchester, Bermingham
Bermingham
In short I want to split all the cities into different rows while maintaining the 'Merged_Cities' column.
Here's a replicable version of df:
df = pd.DataFrame({'Merged_Cities':['New York, Wisconsin, Atlanta',
'Tokyo, Kyoto, Suzuki',
'Paris, Bordeaux, Lyon',
'Mumbai, Delhi, Bangalore',
'London, Manchester, Bermingham']})

Use .str.split() and .explode():
df = df.assign(Cities=df["Merged_Cities"].str.split(", ")).explode("Cities")
print(df)
Prints:
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki
2 Paris, Bordeaux, Lyon Paris
2 Paris, Bordeaux, Lyon Bordeaux
2 Paris, Bordeaux, Lyon Lyon
3 Mumbai, Delhi, Bangalore Mumbai
3 Mumbai, Delhi, Bangalore Delhi
3 Mumbai, Delhi, Bangalore Bangalore
4 London, Manchester, Bermingham London
4 London, Manchester, Bermingham Manchester
4 London, Manchester, Bermingham Bermingham

This is really similar to #AndrejKesely's answer, except it merges df and the cities on their index.
# Create pandas.Series from splitting the column on ', '
s = df['Merged_Cities'].str.split(', ').explode().rename('Cities')
# Merge df with s on their index
df = df.merge(s, left_index=True, right_index=True)
# Result
print(df)
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
Merged_Cities Cities
0 New York, Wisconsin, Atlanta New York
0 New York, Wisconsin, Atlanta Wisconsin
0 New York, Wisconsin, Atlanta Atlanta
1 Tokyo, Kyoto, Suzuki Tokyo
1 Tokyo, Kyoto, Suzuki Kyoto
1 Tokyo, Kyoto, Suzuki Suzuki

Related

Search pandas dataframe and edit values

I want to replace missing country with the country that corresponds to the city, i.e. find another data point with the same city and copy the country, if there is no other record with the same city then remove.
Here's the dataframe:
id city lat lng country
1036323110 Katherine -14.4667 132.2667 Australia
1840015979 South Pasadena 27.7526 -82.7394
1124755118 Beaconsfield 45.4333 -73.8667 Canada
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225
1231393325 Dīla 6.4104 38.3100 Ethiopia
1840015979 South Pasadena 27.7526 -82.7394 United States
1192391794 Kigoma 21.1072 -76.1367
1840054954 Hampstead 42.8821 -71.1709 United States
1840005111 West Islip 40.7097 -73.2971 United States
1076327352 Paulínia -22.7611 -47.1542 Brazil
1250921305 Ferney-Voltaire 46.2558 6.1081
1250921305 Ferney-Voltaire 46.2558 6.1081 France
1156346497 Jiangshan 28.7412 118.6225 China
1231393325 Dīla 6.4104 38.3100 Ethiopia
1192391794 Gibara 21.1072 -76.1367 Cuba
1840054954 Dodoma 42.8821 -71.1709
1840005111 West Islip 40.7097 -73.2971 United States
Here's my code so far:
df[df.isin(['city'])].stack()

You can group by city, lat, lng three columns and filling missing values by first not nan value in each group.
df['country'] = df['country'].fillna(
df.groupby(['city', 'lat', 'lng'])['country'].transform(
lambda x: x.loc[x.first_valid_index()] if x.first_valid_index() else x
)
)
print(df)
id city lat lng country
0 1036323110 Katherine -14.4667 132.2667 Australia
1 1840015979 South Pasadena 27.7526 -82.7394 United States
2 1124755118 Beaconsfield 45.4333 -73.8667 Canada
3 1250921305 Ferney-Voltaire 46.2558 6.1081 France
4 1156346497 Jiangshan 28.7412 118.6225 China
5 1231393325 Dīla 6.4104 38.3100 Ethiopia
6 1840015979 South Pasadena 27.7526 -82.7394 United States
7 1192391794 Kigoma 21.1072 -76.1367 NaN
8 1840054954 Hampstead 42.8821 -71.1709 United States
9 1840005111 West Islip 40.7097 -73.2971 United States
10 1076327352 Paulínia -22.7611 -47.1542 Brazil
11 1250921305 Ferney-Voltaire 46.2558 6.1081 France
12 1250921305 Ferney-Voltaire 46.2558 6.1081 France
13 1156346497 Jiangshan 28.7412 118.6225 China
14 1231393325 Dīla 6.4104 38.3100 Ethiopia
15 1192391794 Gibara 21.1072 -76.1367 Cuba
16 1840054954 Dodoma 42.8821 -71.1709 NaN
17 1840005111 West Islip 40.7097 -73.2971 United States

I've solved such a Problem with the geopy package.
Then you can use the lat and long. Then filter the Geopy-Output for the Country. This Way you will avoid NaN's and always get an answer based on geo information.
pip3 install geopy
to pip install the geopy package
geo=Nominatim(user_agent="geoapiExercises")
for i in range(0,len(df)):
lat=str(df.iloc[i,2])
lon=str(df.iloc[i,3])
df.iloc[i,4]=geo.reverse(lat+','+lon).raw['address']['country']
Please inform yourself about the user_agent api. For Exercise Purposes this key should work

Printing a column from a .txt file

I'm a beginner at Python.
I'm having trouble printing the values from a .txt file.
NFL_teams = csv.DictReader(txt_file, delimiter = "\t")
print(type(NFL_teams))
columns = NFL_teams.fieldnames
print('Columns: {0}'.format(columns))
print(columns[2])
This gets me to the name of the 3rd column ("team_id") but I can't seem to print any of the data that would be in the corresponding column. I have no idea what I'm doing.
Thanks in advance.
EDIT: The .txt file looks like this:
team_name team_name_short team_id team_id_pfr team_conference team_division team_conference_pre2002 team_division_pre2002 first_year last_year
Arizona Cardinals Cardinals ARI CRD NFC NFC West NFC NFC West 1994 2021
Phoenix Cardinals Cardinals ARI CRD NFC - NFC NFC East 1988 1993
St. Louis Cardinals Cardinals ARI ARI NFC - NFC NFC East 1966 1987
Atlanta Falcons Falcons ATL ATL NFC NFC South NFC NFC West 1966 2021
Baltimore Ravens Ravens BAL RAV AFC AFC North AFC AFC Central 1996 2021
Buffalo Bills Bills BUF BUF AFC AFC East AFC AFC East 1966 2021
Carolina Panthers Panthers CAR CAR NFC NFC South NFC NFC West 1995 2021
There are tabs between each item.

How to resample the data by month and plot monthly percentages?

I have a dataframe in which matches played by a team in a year is given. Match Date is a column.
Team 1 Team 2 Winner Match Date
5 Australia England England 2018-01-14
12 Australia England England 2018-01-19
14 Australia England England 2018-01-21
20 Australia England Australia 2018-01-26
22 Australia England England 2018-01-28
34 New Zealand England New Zealand 2018-02-25
35 New Zealand England England 2018-02-28
36 New Zealand England England 2018-03-03
43 New Zealand England New Zealand 2018-03-07
46 New Zealand England England 2018-03-10
62 Scotland England Scotland 2018-06-10
63 England Australia England 2018-06-13
64 England Australia England 2018-06-16
65 England Australia England 2018-06-19
66 England Australia England 2018-06-21
67 England Australia England 2018-06-24
68 England India India 2018-07-12
70 England India England 2018-07-14
72 England India England 2018-07-17
106 Sri Lanka England no result 2018-10-10
107 Sri Lanka England England 2018-10-13
108 Sri Lanka England England 2018-10-17
109 Sri Lanka England England 2018-10-20
112 Sri Lanka England Sri Lanka 2018-10-23
Match Date is in datetime. I could plot the number of matches played versus winning matches. This is the code I used.
England.set_index('Match Date', inplace = True)
England.resample('1M').count()['Winner'].plot()
England_win.resample('1M').count()['Winner'].plot()
But I would like to plot the winning percentage by month. Please help.
Thank you

I am sure there are more efficient ways to do this, but one way to plot this using an approach similar to yours:
import matplotlib.pyplot as plt
import pandas as pd
#reading your sample data
df = pd.read_csv("test.txt", sep="\s{2,}", parse_dates=["Match Date"], index_col="ID", engine="python")
df.set_index('Match Date', inplace = True)
#creating df that count the wins
df1 = df[df["Winner"]=="England"].resample("1M").count()
#calculate and plot the percentage - if no game, NaN values are substituted with zero
df1.Winner.div(df.resample('1M').count()['Winner']).mul(100).fillna(0).plot()
plt.tight_layout()
plt.show()
Sample output:

How to Groupby columns(ignore order) in Pandas DataFrame?

I have a pandas dataframe(4 of 8 columns):
df = pd.DataFrame( {"departure_country":["Mexico","Mexico","United States","United States","United States","United States","Japan","United States","United States","United States"],"departure_city":["Guadalajara","Guadalajara","New York","Chicago","Los Angeles","Michigan","Tokyo","New York","New York","Chicago"],"destination_country":["United States","United States","United States","United States","Mexico","United States","United States","Mexico","United States","Japan"],"destination_city":["Los Angeles","Los Angeles","Chicago","New York","Guadalajara","New York","Chicago","Guadalajara","Michigan","Tokyo"]})
df
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States New York United States Chicago
3 United States Chicago United States New York
4 United States Los Angeles Mexico Guadalajara
5 United States Michigan United States New York
6 Japan Tokyo United States Chicago
7 United States New York Mexico Guadalajara
8 United States New York United States Michigan
9 United States Chicago Japan Tokyo
I want to analyze the data in each group so I would like to groupby "the same pair" of departure and destination first, something like:
departure_country departure_city destination_country destination_city
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
2 United States Los Angeles Mexico Guadalajara
3 United States New York United States Chicago
4 United States Chicago United States New York
5 United States Michigan United States New York
6 United States New York United States Michigan
7 Japan Tokyo United States Chicago
8 United States Chicago Japan Tokyo
9 United States New York Mexico Guadalajara
Is it possible to make it in a DataFrame? I have tried groupby and key-value, but I failed.
Really appreciate your help with this, thanks!

I'm sure someone could think of a better optimized solution, but one way is to create sorted tuples of your country/city pairs and sort by it:
print (df.assign(country=[tuple(sorted(i)) for i in df.filter(like="country").to_numpy()],
city=[tuple(sorted(i)) for i in df.filter(like="city").to_numpy()])
.sort_values(["country","city"], ascending=False).filter(like="_"))
departure_country departure_city destination_country destination_city
5 United States Michigan United States New York
8 United States New York United States Michigan
2 United States New York United States Chicago
3 United States Chicago United States New York
7 United States New York Mexico Guadalajara
0 Mexico Guadalajara United States Los Angeles
1 Mexico Guadalajara United States Los Angeles
4 United States Los Angeles Mexico Guadalajara
6 Japan Tokyo United States Chicago
9 United States Chicago Japan Tokyo

creating new column by merging on column name and other column value

Trying to create a new column in DF1 that lists the home teams number of allstars for that year.
DF1
Date Visitor V_PTS Home H_PTS \
0 2012-10-30 19:00:00 Washington Wizards 84 Cleveland Cavaliers 94
1 2012-10-30 19:30:00 Dallas Mavericks 99 Los Angeles Lakers 91
2 2012-10-30 20:00:00 Boston Celtics 107 Miami Heat 120
3 2012-10-31 19:00:00 Dallas Mavericks 94 Utah Jazz 113
4 2012-10-31 19:00:00 San Antonio Spurs 99 New Orleans Pelicans 95
Attendance Arena Location Capacity \
0 20562 Quicken Loans Arena Cleveland, Ohio 20562
1 18997 Staples Center Los Angeles, California 18997
2 20296 American Airlines Arena Miami, Florida 19600
3 17634 Vivint Smart Home Arena Salt Lake City, Utah 18303
4 15358 Smoothie King Center New Orleans, Louisiana 16867
Yr Arena Opened Season
0 1994 2012-13
1 1992 2012-13
2 1999 2012-13
3 1991 2012-13
4 1999 2012-13
DF2
2012-13 2013-14 2014-15 2015-16 2016-17
Cleveland Cavaliers 1 1 2 1 3
Los Angeles Lakers 2 1 1 1 0
Miami Heat 3 3 2 2 1
Chicago Bulls 2 1 2 2 1
Detroit Pistons 0 0 0 1 1
Los Angeles Clippers 2 2 2 1 1
New Orleans Pelicans 0 1 1 1 1
Philadelphia 76ers 1 0 0 0 0
Phoenix Suns 0 0 0 0 0
Portland Trail Blazers 1 2 2 0 0
Toronto Raptors 0 1 1 2 2
DF1['H_Allstars']=DF2[DF1['Season'],DF1['Home']])
results in TypeError: 'Series' objects are mutable, thus they cannot be hashed
I understand the error just am not sure how else to do it.

I've removed the extra columns and just focused on the necessary ones for demonstration.
Input:
df1
Home 2012-13 2013-14 2014-15 2015-16 2016-17
0 Cleveland Cavaliers 1 1 2 1 3
1 Los Angeles Lakers 2 1 1 1 0
2 Miami Heat 3 3 2 2 1
3 Chicago Bulls 2 1 2 2 1
4 Detroit Pistons 0 0 0 1 1
5 Los Angeles Clippers 2 2 2 1 1
6 New Orleans Pelicans 0 1 1 1 1
7 Philadelphia 76ers 1 0 0 0 0
8 Phoenix Suns 0 0 0 0 0
9 Portland Trail Blazers 1 2 2 0 0
10 Toronto Raptors 0 1 1 2 2
df2
Visitor Home Season
0 Washington Wizards Cleveland Cavaliers 2012-13
1 Dallas Mavericks Los Angeles Lakers 2012-13
2 Boston Celtics Miami Heat 2012-13
3 Dallas Mavericks Utah Jazz 2012-13
4 San Antonio Spurs New Orleans Pelicans 2012-13
Step 1: Melt df1 to get the allstars column
df3 = pd.melt(df1, id_vars='Home', value_vars = df1.columns[df.columns.str.contains('20')], var_name = 'Season', value_name='H_Allstars')
Ouput:
Home Season H_Allstars
0 Cleveland Cavaliers 2012-13 1
1 Los Angeles Lakers 2012-13 2
2 Miami Heat 2012-13 3
3 Chicago Bulls 2012-13 2
4 Detroit Pistons 2012-13 0
5 Los Angeles Clippers 2012-13 2
6 New Orleans Pelicans 2012-13 0
7 Philadelphia 76ers 2012-13 1
8 Phoenix Suns 2012-13 0
...
Step 2: Merge this new dataframe with df2 to get the H_Allstars and V_Allstars columns
df4 = pd.merge(df2, df3, how='left', on=['Home', 'Season'])
Output:
Visitor Home Season H_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0
2 Boston Celtics Miami Heat 2012-13 3.0
3 Dallas Mavericks Utah Jazz 2012-13 NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0
Step 3: Add the V_Allstars column
# renaming column as required
df3.rename(columns={'Home': 'Visitor', 'H_Allstars': 'V_Allstars'}, inplace=True)
df5 = pd.merge(df4, df3, how='left', on=['Visitor', 'Season'])
Output:
Visitor Home Season H_Allstars V_Allstars
0 Washington Wizards Cleveland Cavaliers 2012-13 1.0 NaN
1 Dallas Mavericks Los Angeles Lakers 2012-13 2.0 NaN
2 Boston Celtics Miami Heat 2012-13 3.0 NaN
3 Dallas Mavericks Utah Jazz 2012-13 NaN NaN
4 San Antonio Spurs New Orleans Pelicans 2012-13 0.0 NaN

You can use pandas.melt . Bring your data df2 to long format, i.e. Home and season as columns and Allstars as values and then merge to df1 on 'Home' and 'Season'.
import pandas as pd
df2['Home'] = df2.index
df2 = pd.melt(df2, id_vars = 'Home', value_vars = ['2012-13', '2013-14', '2014-15', '2015-16', '2016-17'], var_name = 'Season', value_name='H_Allstars')
df = df1.merge(df2, on=['Home','Season'], how='left')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Performing Split on Pandas dataframe and create a new frame - python

Related

Search pandas dataframe and edit values

Printing a column from a .txt file

How to resample the data by month and plot monthly percentages?

How to Groupby columns(ignore order) in Pandas DataFrame?

creating new column by merging on column name and other column value

Categories

Resources