I have a dataframe like this:
CITY LOCATION PRODUCT
CHICAGO CHI1 A
CHICAGO CHI1 B
CHICAGO CHI4 C
NEWYORK NY1 D
NEWYORK NY2 E
NEWYORK NY2 F
NEWYORK NY2 G
ATLANTA ATL1 H
ATLANTA ATL1 I
And I want to get 2 different stats based on the same grouping.
The grouping is [CITY, LOCATION]. I want to be able to get the number of products per location as well as the name of the first product (in alphabetical order) for that location.
The result would be:
CITY LOCATION FIRST COUNT
CHICAGO CHI1 A 2
CHICAGO CHI4 C 1
NEWYORK NY1 D 1
NEWYORK NY2 E 3
ATLANTA ATL1 H 2
The only way I've managed to do this is by:
gb = data.groupby(['CITY', 'LOCATION'])
df = gb.max().join(other=gb.count(), how='left', on=['CITY', 'LOCATION'], rsuffix='_r')
But I'm sure there's a better way to re-use the same groupby() object without having to join 2 dataframes.
Something similar to SQL:
SELECT city, location, max(product), count(product) FROM table GROUP BY city, location
Is there a better way to this this?
agg
df.groupby(['CITY', 'LOCATION'], sort=False).PRODUCT.agg(['min', 'count']).reset_index()
CITY LOCATION min count
0 CHICAGO CHI1 A 2
1 CHICAGO CHI4 C 1
2 NEWYORK NY1 D 1
3 NEWYORK NY2 E 3
4 ATLANTA ATL1 H 2
Related
This is not the best approach but this what I did so far:
I have this example df:
df = pd.DataFrame({
'City': ['I lived Los Angeles', 'I visited London and Toronto','the best one is Toronto', 'business hub is in New York',' Mexico city is stunning']
})
df
gives:
City
0 I lived Los Angeles
1 I visited London and Toronto
2 the best one is Toronto
3 business hub is in New York
4 Mexico city is stunning
I am trying to match (case insensitive) city names from a nested dic and create a new column with the country name with int values for statistical purposes.
So, here is my nested dic as a reference for countries and cities:
country = { 'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
and I created a function that should look for the city from the df and match it with the dic, then create a column with the country name:
def get_country(x):
count = 0
for k,v in country.items():
for y in v:
if y.lower() in x:
df[k] = count + 1
else:
return None
then applied it to df:
df.City.apply(lambda x: get_country(x.lower()))
I got the following output:
City US
0 I lived Los Angeles 1
1 I visited London and Toronto 1
2 the best one is Toronto 1
3 business hub is in New York 1
4 Mexico city is stunning 1
Expected output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Here is a solution based on your function. I changed the name of the variables to be more readable and easy to follow.
df = pd.DataFrame({
'City': ['I lived Los Angeles',
'I visited London and Toronto',
'the best one is Toronto',
'business hub is in New York',
' Mexico city is stunning']
})
country_cities = {
'US': ['New York','Los Angeles','San Diego'],
'CA': ['Montreal','Toronto','Manitoba'],
'UK': ['London','Liverpool','Manchester']
}
def get_country(text):
text = text.lower()
count = 0
country_counts = dict.fromkeys(country_cities, 0)
for country, cities in country_cities.items():
for city in cities:
if city.lower() in text:
country_counts[country] += 1
return pd.Series(country_counts)
df = df.join(df.City.apply(get_country))
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Solution based on Series.str.count
A simpler solution is using Series.str.count to count the occurences of the following regex pattern city1|city2|etc for each country (the pattern matches city1 or city2 or etc). Using the same setup as above:
country_patterns = {country: '|'.join(cities) for country, cities in country_cities.items()}
for country, pat in country_patterns.items():
df[country] = df['City'].str.count(pat)
Why your solution doesn't work?
if y.lower() in x:
df[k] = count + 1
else:
return None
The reason your function doesn't produce the right output is that
you are returning None if a city is not found in the text: the remaining countries and cities are not checked, because the return statement automatically exits the function.
What is happening is that only US cities are checked, and the line df[k] = 1 (in this case k = 'US') creates an entire column named k filled with the value 1. It's not creating a single value for that row, it creates or modifies the full column. When using apply you want to change a single row or value (the input of function), so don't change directly the main DataFrame inside the function.
You can achieve this result using a lambda function to check if any city for each country is contained in the string, after first lower-casing the city names in country:
cl = { k : list(map(str.lower, v)) for k, v in country.items() }
for ctry, cities in cl.items():
df[ctry] = df['City'].apply(lambda s:any(c in s.lower() for c in cities)).astype(int)
Output:
City US CA UK
0 I lived Los Angeles 1 0 0
1 I visited London and Toronto 0 1 1
2 the best one is Toronto 0 1 0
3 business hub is in New York 1 0 0
4 Mexico city is stunning 0 0 0
Suppose I have two dataframes
df_1
city state salary
New York NY 85000
Chicago IL 65000
Miami FL 75000
Dallas TX 78000
Seattle WA 96000
df_2
city state taxes
New York NY 15000
Chicago IL 5000
Miami FL 6500
Next, I join the two dataframes
joined_df = df_1.merge(df_2, how='inner', left_on=['city'], right_on = ['city'])
The Result:
joined_df
city state salary city state taxes
New York NY 85000 New York NY 15000
Chicago IL 65000 Chicago IL 5000
Miami FL 75000 Miami FL 6500
Is there anyway I can stack the two dataframes on top of each other joining on the city instead of extending the line horizontally, like below:
Requested:
joined_df
city state salary taxes
New York NY 85000
New York NY 15000
Chicago IL 65000
Chicago IL 5000
Miami FL 75000
Miami FL 6500
How can I do this in Pandas!
In this case we might need to use merge to restrict to the relevant rows before concat if we need to consider both city and state.
rel_df_1 = df_1.merge(df_2)[df_1.columns]
rel_df_2 = df_2.merge(df_1)[df_2.columns]
df = pd.concat([rel_df_1, rel_df_2]).sort_values(['city', 'state'])
You can use append (a shortcut for concat) to achieve that:
result = df1.append(df2, sort=False)
If your dataframes have overlapping indexes, you can use:
df1.append(df2, ignore_index=True, sort=False)
Also, you can look for more information here
UPDATE: After appending your dataframes, you can filter your result to get only the rows that contains the city in both dataframes:
result = result.loc[result['city'].isin(df1['city'])
& result['city'].isin(df2['city'])]
Try with stack():
stacked = df_1.merge(df_2, on=["city", "state"]).set_index(["city", "state"]).stack()
output = pd.concat([stacked.where(stacked.index.get_level_values(-1)=="salary"),
stacked.where(stacked.index.get_level_values(-1)=="taxes")],
axis=1,
keys=["salary", "taxes"]) \
.droplevel(-1) \
.reset_index()
>>> output
city state salary taxes
0 New York NY 85000.0 NaN
1 New York NY NaN 15000.0
2 Chicago IL 65000.0 NaN
3 Chicago IL NaN 5000.0
4 Miami FL 75000.0 NaN
5 Miami FL NaN 6500.0
I have two functions which shift a row of a pandas dataframe to the top or bottom, respectively. After applying them more then once to a dataframe, they seem to work incorrectly.
These are the 2 functions to move the row to top / bottom:
def shift_row_to_bottom(df, index_to_shift):
"""Shift row, given by index_to_shift, to bottom of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex(idx + [index_to_shift])
return df
def shift_row_to_top(df, index_to_shift):
"""Shift row, given by index_to_shift, to top of df."""
idx = df.index.tolist()
idx.pop(index_to_shift)
df = df.reindex([index_to_shift] + idx)
return df
Note: I don't want to reset_index for the returned df.
Example:
df = pd.DataFrame({'Country' : ['USA', 'GE', 'Russia', 'BR', 'France'],
'ID' : ['11', '22', '33','44', '55'],
'City' : ['New-York', 'Berlin', 'Moscow', 'London', 'Paris'],
'short_name' : ['NY', 'Ber', 'Mosc','Lon', 'Pa']
})
df =
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
This is my dataframe:
Now, apply function for the first time. Move row with index 0 to bottom:
df_shifted = shift_row_to_bottom(df,0)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
3 BR 44 London Lon
4 France 55 Paris Pa
0 USA 11 New-York NY
The result is exactly what I want.
Now, apply function again. This time move row with index 2 to the bottom:
df_shifted = shift_row_to_bottom(df_shifted,2)
df_shifted =
Country ID City short_name
1 GE 22 Berlin Ber
2 Russia 33 Moscow Mosc
4 France 55 Paris Pa
0 USA 11 New-York NY
2 Russia 33 Moscow Mosc
Well, this is not what I was expecting. There must be a problem when I want to apply the function a second time. The promblem is analog to the function shift_row_to_top.
My question is:
What's going on here?
Is there a better way to shift a specific row to top / bottom of the dataframe? Maybe a pandas-function?
If not, how would you do it?
Your problem is these two lines:
idx = df.index.tolist()
idx.pop(index_to_shift)
idx is a list and idx.pop(index_to_shift) removes the item at index index_to_shift of idx, which is not necessarily valued index_to_shift as in the second case.
Try this function:
def shift_row_to_bottom(df, index_to_shift):
idx = [i for i in df.index if i!=index_to_shift]
return df.loc[idx+[index_to_shift]]
# call the function twice
for i in range(2): df = shift_row_to_bottom(df, 2)
Output:
Country ID City short_name
0 USA 11 New-York NY
1 GE 22 Berlin Ber
3 BR 44 London Lon
4 France 55 Paris Pa
2 Russia 33 Moscow Mosc
I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None
I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09