here is my question
If I have dataframe like:
Metropolitan area Population NHL
0 New York City 20153634 RangersIslandersDevils
1 Los Angeles 13310447 KingsDucks
2 Washington 23131112 New London
3 Alabama 11111112 Lighting
I want to get a new dataframe like:
Metropolitan area Population NHL
0 New York City 20153634 Rangers
1 New York City 20153634 Islanders
2 New York City 20153634 Devils
3 Los Angeles 13310447 Kings
4 Los Angeles 13310447 Ducks
5 Washington 23131112 New London
6 Alabama 11111112 Lighting
So, as you can see, I need to split NHL team names by upper case, but if there is a space in the name, should not do anything.
You can use a combination of findall and explode:
out = (
df.assign(NHL=df["NHL"].str.findall(r"[A-Z](?:\s[A-Z]|[^A-Z])+"))
.explode("NHL")
.reset_index(drop=True)
)
print(out)
Metropolitan area Population NHL
0 New York City 20153634 Rangers
1 New York City 20153634 Islanders
2 New York City 20153634 Devils
3 Los Angeles 13310447 Kings
4 Los Angeles 13310447 Ducks
5 Washington 23131112 New London
6 Alabama 11111112 Lighting
Here is one way:
df.drop('NHL', axis=1).merge(df['NHL'].str.extractall('([A-Z](?:\s[A-Z]|[^A-Z])+)')
.reset_index(level=1, drop=True)
.rename(columns={0:'NHL'}),
left_index=True, right_index=True)
Output:
Metropolitan area Population NHL
0 New York City 20153634 Rangers
0 New York City 20153634 Islanders
0 New York City 20153634 Devils
1 Los Angeles 13310447 Kings
1 Los Angeles 13310447 Ducks
2 Washington 23131112 New London
3 Alabama 11111112 Lighting
Borrowed #CameronRiddell regex to correct parse teams.
This can be updated:
import re
df1 = df[df["NHL"].str.contains(" ")]
df2 = df[~df["NHL"].str.contains(" ")]
df2["NHL"] = df2.apply(lambda x: re.findall(r"[A-Z][^A-Z]*", x["NHL"]), axis=1)
df2 = df2.explode("NHL")
pd.concat([df2,df1])
Related
I am working with a dataframe in Pandas and I need a solution to automatically modify one of the columns that has duplicate values. It is a column type 'object' and I would need to modify the name of the duplicate values. The dataframe is the following:
City Year Restaurants
0 New York 2001 20
1 Paris 2000 40
2 New York 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 33
6 Barcelona 2001 15
As you can see, New York is repeated 3 times. I would like to create a new dataframe in which this value would be automatically modified and the result would be the following:
City Year Restaurants
0 New York 2001 2001 20
1 Paris 2000 40
2 New York 1999 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 1998 1998 33
6 Barcelona 2001 15
I would also be happy with "New York 1", "New York 2" and "New York 3". Any option would be good.
Use np.where, to modify column City if duplicated
df['City']=np.where(df['City'].duplicated(keep=False), df['City']+' '+df['Year'].astype(str),df['City'])
A different approach without the use of numpy would be with groupby.cumcount() which will give you your alternative New York 1, New York 2 but for all values.
df['City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 1 2000 40
2 New York 2 1999 41
3 Los Angeles 1 2004 35
4 Madrid 1 2001 22
5 New York 3 1998 33
6 Barcelona 1 2001 15
To have an increment only in the duplicate cases you can use loc:
df.loc[df[df.City.duplicated(keep=False)].index, 'City'] = df['City'] + ' ' + df.groupby('City').cumcount().add(1).astype(str)
City Year Restaurants
0 New York 1 2001 20
1 Paris 2000 40
2 New York 2 1999 41
3 Los Angeles 2004 35
4 Madrid 2001 22
5 New York 3 1998 33
6 Barcelona 2001 15
I have this df
nhl_df=pd.read_csv("assets/nhl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]
cities = cities.rename(columns={'Population (2016 est.)[8]': 'Population'})
cities = cities[['Metropolitan area','Population']]
print(cities)
Metropolitan area Population
0 New York City 20153634
1 Los Angeles 13310447
2 San Francisco Bay Area 6657982
3 Chicago 9512999
4 Dallas–Fort Worth 7233323
5 Washington, D.C. 6131977
6 Philadelphia 6070500
7 Boston 4794447
8 Minneapolis–Saint Paul 3551036
9 Denver 2853077
10 Miami–Fort Lauderdale 6066387
11 Phoenix 4661537
12 Detroit 4297617
13 Toronto 5928040
14 Houston 6772470
15 Atlanta 5789700
16 Tampa Bay Area 3032171
17 Pittsburgh 2342299
18 Cleveland 2055612
19 Seattle 3798902
20 Cincinnati 2165139
21 Kansas City 2104509
22 St. Louis 2807002
23 Baltimore 2798886
24 Charlotte 2474314
25 Indianapolis 2004230
26 Nashville 1865298
27 Milwaukee 1572482
28 New Orleans 1268883
29 Buffalo 1132804
30 Montreal 4098927
31 Vancouver 2463431
32 Orlando 2441257
33 Portland 2424955
34 Columbus 2041520
35 Calgary 1392609
36 Ottawa 1323783
37 Edmonton 1321426
38 Salt Lake City 1186187
39 Winnipeg 778489
40 San Diego 3317749
41 San Antonio 2429609
42 Sacramento 2296418
43 Las Vegas 2155664
44 Jacksonville 1478212
45 Oklahoma City 1373211
46 Memphis 1342842
47 Raleigh 1302946
48 Green Bay 318236
49 Hamilton 747545
50 Regina 236481
It has 50 rows.
My second df has 28 rows
W/L Ratio
city
Arizona 0.707317
Boston 2.500000
Buffalo 0.555556
Calgary 1.057143
Carolina 1.028571
Chicago 0.846154
Colorado 1.433333
Columbus 1.500000
Dallas–Fort Worth 1.312500
Detroit 0.769231
Edmonton 0.900000
Florida 1.466667
Los Angeles 1.655862
Minnesota 1.730769
Montreal 0.725000
Nashville 2.944444
New York City 1.111661
Ottawa 0.651163
Philadelphia 1.615385
Pittsburgh 1.620690
San Jose 1.666667
St. Louis 1.375000
Tampa Bay 2.347826
Toronto 1.884615
Vancouver 0.775000
Vegas 2.125000
Washington 1.884615
Winnipeg 2.600000
I need to remove from the first dataframe the rows where the metropolitan area is not in the city column of the 2nd data frame.
I tried this:
cond = nhl_df['city'].isin(cities['Metropolitan Area'])
But I got this error which makes no sense
KeyError: 'city'
You need select column Metropolitan Area and in second cities DataFrame index, last filter with ~ for invert mask:
cond = cities['Metropolitan Area'].isin(nhl_df.index)
df = cities[~cond]
If first clumn is not index in city DataFrame:
cond = cities['Metropolitan Area'].isin(nhl_df['city'])
df = cities[~cond]
I've a sample dataframe
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
How can I replicate the above dataframe without changing the order?
Expected outcome:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
How about:
pd.concat([df]*3, ignore_index=True)
Output:
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
You can use pd.concat:
result=pd.concat([df]*x).reset_index(drop=True)
print(result)
Output (for x=3):
city
0 Los Angles
1 New York
2 Texas
3 Washington DC
4 Los Angles
5 New York
6 Texas
7 Washington DC
8 Los Angles
9 New York
10 Texas
11 Washington DC
i do road csv file's and grouping 2 headers in csv file so i want to
each other count about 1 headers value and percent count/total and add
dataframe
have a lot of data in test.csv
==example==
country city name
KOREA busan Kim
KOREA busan choi
KOREA Seoul park
USA LA Jane
Spain Madrid Torres
(name is not overlap)
==========
csv_file = pd.read_csv("test.csv")
need_group = csv_file.groupby(['category','city names'])
returns
country city names
0 KOREA Seoul, Busan, ...
1 KOREA Daegu, Seoul
2 USA LA, New York...
2 USA LA, ...
want to
- count is cf name's
country city names count percent
0 KOREA Seoul 2 20%
1 KOREA Daegu 1 10%
2 USA LA 2 20%
3 USA New York 1 10%
4 Spain Madrid 4 40%
I believe you need counts per country and name by GroupBy.size and then percentage divide by length of DataFrame:
print (csv_file)
country city name
0 KOREA busan Kim
1 KOREA busan Dongs
2 KOREA Seoul park
3 USA LA Jane
4 Spain Madrid Torres
df = csv_file.groupby(['country','city']).size().reset_index(name='count')
df['percent'] = df['count'].div(df['count'].sum()).mul(100)
I have one column containing all the data which looks something like this (values that need to be separated have a mark like (c)):
UK (c)
London
Wales
Liverpool
US (c)
Chicago
New York
San Francisco
Seattle
Australia (c)
Sydney
Perth
And I want it split into two columns looking like this:
London UK
Wales UK
Liverpool UK
Chicago US
New York US
San Francisco US
Seattle US
Sydney Australia
Perth Australia
Question 2: What if the countries did not have a pattern like (c)?
Step by step with endswith and ffill + str.strip
df['country']=df.loc[df.city.str.endswith('(c)'),'city']
df.country=df.country.ffill()
df=df[df.city.ne(df.country)]
df.country=df.country.str.strip('(c)')
extract and ffill
Start with extract and ffill, then remove redundant rows.
df['country'] = (
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill())
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia
Where,
df['data'].str.extract(r'(.*)\s+\(c\)', expand=False).ffill()
0 UK
1 UK
2 UK
3 UK
4 US
5 US
6 US
7 US
8 US
9 Australia
10 Australia
11 Australia
Name: country, dtype: object
The pattern '(.*)\s+\(c\)' matches strings of the form "country (c)" and extracts the country name. Anything not matching this pattern is replaced with NaN, so you can conveniently forward fill on rows.
split with np.where and ffill
This splits on "(c)".
u = df['data'].str.split(r'\s+\(c\)')
df['country'] = pd.Series(np.where(u.str.len() == 2, u.str[0], np.nan)).ffill()
df[~df['data'].str.contains('(c)', regex=False)].reset_index(drop=True)
data country
0 London UK
1 Wales UK
2 Liverpool UK
3 Chicago US
4 New York US
5 San Francisco US
6 Seattle US
7 Sydney Australia
8 Perth Australia
You can first use str.extract to locate the cities ending in (c) and extract the country name, and ffill to populate a new country column.
The same extracted matches can be use to locate the rows to be dropped, i.e. rows which are notna:
m = df.city.str.extract('^(.*?)(?=\(c\)$)')
ix = m[m.squeeze().notna()].index
df['country'] = m.ffill()
df.drop(ix)
city country
1 London UK
2 Wales UK
3 Liverpool UK
5 Chicago US
6 New York US
7 San Francisco US
8 Seattle US
10 Sydney Australia
11 Perth Australia
You can use np.where with str.contains too:
mask = df['places'].str.contains('(c)', regex = False)
df['country'] = np.where(mask, df['places'], np.nan)
df['country'] = df['country'].str.replace('\(c\)', '').ffill()
df = df[~mask]
df
places country
1 London UK
2 Wales UK
3 Liverpool UK
5 Chicago US
6 New York US
7 San Francisco US
8 Seattle US
10 Sydney Australia
11 Perth Australia
The str contains looks for (c) and if present will return True for that index. Where this condition is True, the country value will be added to the country columns
You could do the following:
data = ['UK (c)','London','Wales','Liverpool','US (c)','Chicago','New York','San Francisco','Seattle','Australia (c)','Sydney','Perth']
df = pd.DataFrame(data, columns = ['city'])
df['country'] = df.city.apply(lambda x : x.replace('(c)','') if '(c)' in x else None)
df.fillna(method='ffill', inplace=True)
df = df[df['city'].str.contains('\(c\)')==False]
Output
+-----+----------------+-----------+
| | city | country |
+-----+----------------+-----------+
| 1 | London | UK |
| 2 | Wales | UK |
| 3 | Liverpool | UK |
| 5 | Chicago | US |
| 6 | New York | US |
| 7 | San Francisco | US |
| 8 | Seattle | US |
| 10 | Sydney | Australia |
| 11 | Perth | Australia |
+-----+----------------+-----------+