Percentage
NaN
1.576020
Redmond
4.264524
England
4.975278
England - Street XY
5.346106
Denmark Street x
7.601978
England – Street wy
11.773795
England – Street AU
13.936959
Redmond street COX
50.525340
Baharin
0
I need to create another data frame which sums all rows starting with Redmond Percentage, all all rows starting with England followed by street namePercentage, all rows starting with England only Percentage and all all rows starting with Redmond.
How to do it using python.
In above case output should be
Percentage
NaN
1.576020
Redmond
50.525340
England
4.975278
England with street
11.773795
Denmark
7.60
Baharin
0
One way to do this:
df = df.reset_index()
m = df['index'].astype(str).str.contains('Street')
street_df = df.loc[m]
street_df = street_df.groupby(street_df['index'].str.split(' ').str[0]).agg({'Percentage': sum}).reset_index()
street_df['index'] = street_df['index'] + ' with street'
result = pd.concat([df[~m],street_df])
Related
I want to lookup the value of a the lookup_table value column based on the combination of text and two different columns from the data table. See example below:
Data:
VMType
Location
DSv3
East Europe
ESv3
East US
ESv3
East Asia
DSv4
Central US
Ca2
Central US
lookup_table:
Type
Code
Dv3/DSv3 - Gen Purpose East Europe
abc123
Dv3/D1 - Gen Purpose West US
abc321
Dav4/DSv4 - Gen Purpose Central US
bbb321
Eav3/ESv3 - Hi Tech East Asia
def321
Eav3/ESv3 - Hi Tech East US
xcd321
Csv2/Ca2 - Hi Tech Central US
xcc321
I want to do something like
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] == Data['VMType'] + '*' + Data['Location']
or to remove the wild card it could be evaluated as follows:
data['new_column'] = lookup_table['Code'] where lookup_table['Type'] contains Data['VMType'] AND lookup_table['Type'] contains Data['Location']
Resulting in:
Data:
VMType
Location
new_column
DSv3
East Europe
abc123
ESv3
East US
xcd321
ESv3
East Asia
def321
DSv4
Central US
abc321
Ca2
Central US
xcc321
Ideally this can be done without iterating through the df.
First, extract columns VMType and Location from the lookup_table. Then merge with your data dataframe:
lookup_table['VMType'] = lookup_table['Type'].str[:2]
lookup_table['Location'] = lookup_table['Type'].str.split().str[-1]
lookup_table = lookup_table[['VMType', 'Location', 'Code']]
data.merge(lookup_table)
Output:
VMType Location Code
0 D1 Europe abc123
1 E3 US xcd321
2 E3 Asia def321
3 D1 US abc321
4 C2 US xcc321
Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0
I have a DataFrame with start_time in proper datetime format and start_station_name as a string that looks like this:
start_time start_station_name
2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
2019-04-06 14:19:06 Folsom St at 9th St
2019-05-24 17:21:11 Golden Gate Ave at Hyde St
2019-03-27 18:53:27 4th St at Mission Bay Blvd S
2019-04-16 08:45:16 Esprit Park
Now I would like to simply plot the frequency of each name over the year in months. To group the data accordingly, I used this:
data = df_clean.groupby(df_clean['start_time'].dt.strftime('%B'))['start_station_name'].value_counts()
Then I get something that is not a DataFrame but represented as a dtype: int64:
start_time start_station_name
April San Francisco Caltrain Station 2 (Townsend St at 4th St) 4866
Market St at 10th St 4609
San Francisco Ferry Building (Harry Bridges Plaza) 4270
Berry St at 4th St 3994
Montgomery St BART Station (Market St at 2nd St) 3550
...
September Mission Bay Kids Park 1026
11th St at Natoma St 1023
Victoria Manalo Draves Park 1018
Davis St at Jackson St 1015
San Francisco Caltrain Station (King St at 4th St) 1014
Now, I would like to simply plot it as a clustered bar chart using Seaborn's countplot(), only for an absolute frequency above 1000, where the x-axis represents the month, the hue is the name and y-axis should show the counts:
sns.countplot(data = data[data > 1000], x = 'start_time', hue = 'start_station_name')
Then I get the error message Could not interpret input 'start_time', probably because it's not a proper DataFrame. How can I group/aggregate it in the first place, so that the visualization works?
Try:
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
ax = sns.countplot(x="start_time", hue="start_station_name", data=data[data.count > 1000])
Explanations:
I change the key in the groupby by adding the start_station_name columns.
Use count to get the number of cells
Rename the count column to count using rename
Reset the index from the groupby using reset_index
Subset dataset
Plot the result using countplot (using the second example from the doc).
Full code
print(df)
# start_time start_station_name
# 0 2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
# 1 2019-04-06 14:19:06 Folsom St at 9th St
# 2 2019-05-24 17:21:11 Golden Gate Ave at Hyde St
# 3 2019-03-27 18:53:27 4th St at Mission Bay Blvd S
# 4 2019-04-16 08:45:16 Esprit Park
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
print(data)
# start_time start_station_name count
# 0 April Esprit Park 1
# 1 April Folsom St at 9th St 1
# 2 March 4th St at Mission Bay Blvd S 1
# 3 March San Francisco Caltrain (Townsend St at 4th St) 1
# 4 May Golden Gate Ave at Hyde St 1
# Filter as you desired
# data = data[data.count > 1000]
# Plot
ax = sns.countplot(x="start_time", hue="start_station_name", data=data)
plt.show()
output
I know this should be easy but it's driving me mad...
I am trying to turn a dataframe into a grouped dataframe.
df outputs:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront
3 M5A Downtown Toronto Regent Park
4 M6A North York Lawrence Heights
5 M6A North York Lawrence Manor
6 M7A Queen's Park Not assigned
7 M9A Etobicoke Islington Avenue
8 M1B Scarborough Rouge
9 M1B Scarborough Malvern
10 M3B North York Don Mills North
...
I want to make a grouped dataframe where the Neighbourhood is grouped by Postcode and all neighborhoods then become a concatenated string of Neighbourhoods as grouped by Postcode...
something like:
Postcode Borough Neighbourhood
0 M3A North York Parkwoods
1 M4A North York Victoria Village
2 M5A Downtown Toronto Harbourfront, Regent Park
...
I am trying to use:
df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
But this does not return a new dataframe .. it outputs the same original dataframe when I use df after running.
if I use:
df = df.groupby(['Postcode'])['Neighbourhood'].apply(lambda strs: ', '.join(strs))
it turns df into an object?
Use this code
new_df = df.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x:', '.join(x)}).reset_index()
reset_index() will take your group by columns out of the index and return it as a column to the dataframe and create a new integer index.
I have dataframe which looks like below
Country City
UK London
USA Washington
UK London
UK Manchester
USA Washington
USA Chicago
I want to group country and aggregate on the most repeated city in a country
My desired output should be like
Country City
UK London
USA Washington
Because London and Washington appears 2 times whereas Manchester and Chicago appears only 1 time.
I tried
from scipy.stats import mode
df_summary = df.groupby('Country')['City'].\
apply(lambda x: mode(x)[0][0]).reset_index()
But it seems it won't work on strings
I can't replicate your error, but you can use pd.Series.mode, which accepts strings and returns a series, using iat to extract the first value:
res = df.groupby('Country')['City'].apply(lambda x: x.mode().iat[0]).reset_index()
print(res)
Country City
0 UK London
1 USA Washington
try like below:
>>> df.City.mode()
0 London
1 Washington
dtype: object
OR
import pandas as pd
from scipy import stats
Can use scipy with stats + lambda :
df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]})
City
Country
UK London
USA Washington
# df.groupby('Country').agg({'City': lambda x:stats.mode(x)[0]}).reset_index()
However, it gives nice count as well if you don't want to return ony First value:
>>> df.groupby('Country').agg({'City': lambda x:stats.mode(x)})
City
Country
UK ([London], [2])
USA ([Washington], [2])