I have a DataFrame with start_time in proper datetime format and start_station_name as a string that looks like this:
start_time start_station_name
2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
2019-04-06 14:19:06 Folsom St at 9th St
2019-05-24 17:21:11 Golden Gate Ave at Hyde St
2019-03-27 18:53:27 4th St at Mission Bay Blvd S
2019-04-16 08:45:16 Esprit Park
Now I would like to simply plot the frequency of each name over the year in months. To group the data accordingly, I used this:
data = df_clean.groupby(df_clean['start_time'].dt.strftime('%B'))['start_station_name'].value_counts()
Then I get something that is not a DataFrame but represented as a dtype: int64:
start_time start_station_name
April San Francisco Caltrain Station 2 (Townsend St at 4th St) 4866
Market St at 10th St 4609
San Francisco Ferry Building (Harry Bridges Plaza) 4270
Berry St at 4th St 3994
Montgomery St BART Station (Market St at 2nd St) 3550
...
September Mission Bay Kids Park 1026
11th St at Natoma St 1023
Victoria Manalo Draves Park 1018
Davis St at Jackson St 1015
San Francisco Caltrain Station (King St at 4th St) 1014
Now, I would like to simply plot it as a clustered bar chart using Seaborn's countplot(), only for an absolute frequency above 1000, where the x-axis represents the month, the hue is the name and y-axis should show the counts:
sns.countplot(data = data[data > 1000], x = 'start_time', hue = 'start_station_name')
Then I get the error message Could not interpret input 'start_time', probably because it's not a proper DataFrame. How can I group/aggregate it in the first place, so that the visualization works?
Try:
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
ax = sns.countplot(x="start_time", hue="start_station_name", data=data[data.count > 1000])
Explanations:
I change the key in the groupby by adding the start_station_name columns.
Use count to get the number of cells
Rename the count column to count using rename
Reset the index from the groupby using reset_index
Subset dataset
Plot the result using countplot (using the second example from the doc).
Full code
print(df)
# start_time start_station_name
# 0 2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
# 1 2019-04-06 14:19:06 Folsom St at 9th St
# 2 2019-05-24 17:21:11 Golden Gate Ave at Hyde St
# 3 2019-03-27 18:53:27 4th St at Mission Bay Blvd S
# 4 2019-04-16 08:45:16 Esprit Park
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
print(data)
# start_time start_station_name count
# 0 April Esprit Park 1
# 1 April Folsom St at 9th St 1
# 2 March 4th St at Mission Bay Blvd S 1
# 3 March San Francisco Caltrain (Townsend St at 4th St) 1
# 4 May Golden Gate Ave at Hyde St 1
# Filter as you desired
# data = data[data.count > 1000]
# Plot
ax = sns.countplot(x="start_time", hue="start_station_name", data=data)
plt.show()
output
Related
Percentage
NaN
1.576020
Redmond
4.264524
England
4.975278
England - Street XY
5.346106
Denmark Street x
7.601978
England – Street wy
11.773795
England – Street AU
13.936959
Redmond street COX
50.525340
Baharin
0
I need to create another data frame which sums all rows starting with Redmond Percentage, all all rows starting with England followed by street namePercentage, all rows starting with England only Percentage and all all rows starting with Redmond.
How to do it using python.
In above case output should be
Percentage
NaN
1.576020
Redmond
50.525340
England
4.975278
England with street
11.773795
Denmark
7.60
Baharin
0
One way to do this:
df = df.reset_index()
m = df['index'].astype(str).str.contains('Street')
street_df = df.loc[m]
street_df = street_df.groupby(street_df['index'].str.split(' ').str[0]).agg({'Percentage': sum}).reset_index()
street_df['index'] = street_df['index'] + ' with street'
result = pd.concat([df[~m],street_df])
I have the following dataframe where each row is a unique state-city pair:
State City
NY Albany
NY NYC
MA Boston
MA Cambridge
I want to a add a column of years ranging from 2000 to 2018:
State City. Year
NY Albany 2000
NY Albany 2001
NY Albany 2002
...
NY Albany 2018
NY NYC 2000
NY NYC 2018
...
MA Cambridge 2018
I know I can create a list of numbers using Year = list(range(2000,2019))
Does anyone know how to put this list as a column in the dataframe for each state-city?
You could try adding it as a list and then performing explode. I think it should work:
df['Year'] = [list(range(2000,2019))] * len(df)
df = df.explode('Year')
One way is to use the DataFrame.stack() method.
Here is sample of your current data:
data = [['NY', 'Albany'],
['NY', 'NYC'],
['MA', 'Boston'],
['MA', 'Cambridge']]
cities = pd.DataFrame(data, columns=['State', 'City'])
print(cities)
# State City
# 0 NY Albany
# 1 NY NYC
# 2 MA Boston
# 3 MA Cambridge
First, make this into a multi-level index (this will end up in the final dataframe):
cities_index = pd.MultiIndex.from_frame(cities)
print(cities_index)
# MultiIndex([('NY', 'Albany'),
# ('NY', 'NYC'),
# ('MA', 'Boston'),
# ('MA', 'Cambridge')],
# names=['State', 'City'])
Now, make a dataframe with all the years in it (I only use 3 years for brevity):
years = list(range(2000, 2003))
n_cities = len(cities)
years_data = np.repeat(years, n_cities).reshape(len(years), n_cities).T
years_data = pd.DataFrame(years_data, index=cities_index)
years_data.columns.name = 'Year index'
print(years_data)
# Year index 0 1 2
# State City
# NY Albany 2000 2001 2002
# NYC 2000 2001 2002
# MA Boston 2000 2001 2002
# Cambridge 2000 2001 2002
Finally, use stack to transform this dataframe into a vertically-stacked series which I think is what you want:
years_by_city = years_data.stack().rename('Year')
print(years_by_city.head())
# State City Year index
# NY Albany 0 2000
# 1 2001
# 2 2002
# NYC 0 2000
# 1 2001
# Name: Year, dtype: int64
If you want to remove the index and have all the values as a dataframe just do
cities_and_years = years_by_city.reset_index()
I have the following table and would like to split each row into three columns: state, postcode and city. State and postcode are easy, but I'm unable to extract the city. I thought about splitting each string after the street synonyms and before the state, but I seem to be getting the loop wrong as it will only use the last item in my list.
Input data:
Address Text
0 11 North Warren Circle Lisbon Falls ME 04252
1 227 Cony Street Augusta ME 04330
2 70 Buckner Drive Battle Creek MI
3 718 Perry Street Big Rapids MI
4 14857 Martinsville Road Van Buren MI
5 823 Woodlawn Ave Dallas TX 75208
6 2525 Washington Avenue Waco TX 76710
7 123 South Main St Dallas TX 75201
The output I'm trying to achieve (for all rows, but I only wrote out the first two to save time)
City State Postcode
0 Lisbon Falls ME 04252
1 Augusta ME 04330
My code:
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand = True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand = True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
# This is where I got stuck
df["Syn"] = df["Address Text"].apply(lambda x: x.split(syn))
df
Here's a way to do that:
import pandas as pd
# data
df = pd.DataFrame(
['11 North Warren Circle Lisbon Falls ME 04252',
'227 Cony Street Augusta ME 04330',
'70 Buckner Drive Battle Creek MI',
'718 Perry Street Big Rapids MI',
'14857 Martinsville Road Van Buren MI',
'823 Woodlawn Ave Dallas TX 75208',
'2525 Washington Avenue Waco TX 76710',
'123 South Main St Dallas TX 75201'],
columns=['Address Text'])
# Extract postcode and state
df["Zip"] = df["Address Text"].str.extract(r'(\d{5})', expand=True)
df["State"] = df["Address Text"].str.extract(r'([A-Z]{2})', expand=True)
# Split after these substrings
street_synonyms = ["Circle", "Street", "Drive", "Road", "Ave", "Avenue", "St"]
def find_city(address, state, street_synonyms):
for syn in street_synonyms:
if syn in address:
# remove street
city = address.split(syn)[-1]
# remove State and postcode
city = city.split(state)[0]
return city
df['City'] = df.apply(lambda x: find_city(x['Address Text'], x['State'], street_synonyms), axis=1)
print(df[['City', 'State', 'Zip']])
"""
City State Zip
0 Lisbon Falls ME 04252
1 Augusta ME 04330
2 Battle Creek MI NaN
3 Big Rapids MI NaN
4 Van Buren MI 14857
5 Dallas TX 75208
6 nue Waco TX 76710
7 Dallas TX 75201
"""
I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.
I have below dataframe nbr2:
Postal_Code Borough Neighborhood
0 M1B Scarborough Rouge, Malvern
1 M4C East York Woodbine Heights
2 M4E East Toronto The Beaches
3 M4L East Toronto The Beaches West, India Bazaar
4 M4M East Toronto Studio District
5 M4N Central Toronto Lawrence Park
On applying below code to filter out rows:
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
the dataframe gets distributes like below:
Postal_Code Borough \
37 M4E East Toronto
41 M4K East Toronto
42 M4L East Toronto
43 M4M East Toronto
Neighborhood
37 The Beaches
41 The Danforth West\n, Riverdale
42 The Beaches West\n, India Bazaar
43 Studio District\n
below code also results in similar structure:
# define the dataframe columns
column_names = ['Postal_Code','Borough', 'Neighborhood']
# instantiate the dataframe
neighbor = pd.DataFrame(columns=column_names)
neighbor = nbr2.drop(nbr2[nbr2['Borough'].str.contains("Toronto")==False].index, axis=0, inplace=True)
use
pd.set_option('display.expand_frame_repr', False)