Python Pandas Groupby isin - python

I have a dataframe that lists different teams (green, blue, yellow,orange, [there are hundreds of teams]etc) and also lists their revenue on a monthly basis. I want to be able to create a list of the top 10 teams based on revenue and then feed that into a groupby statement so that I am only looking at those teams as I work through various dataframes. These are the statements I have created and that I am having trouble with:
Rev = df['Revenue'].head(10) and I have also used Rev = df.nlargest(10,['Revenue'])
grpby = df.groupby([df['team'].isin(rev), 'team'], as_index=False)['Revenue'].sum().sort_values('Revenue', ascending=False).reset_index()
*Edit: Other code leading up to this request:
*Edit: df = pd.read_excel('c:/Test.xlsx', sheet_name="Sheet1", index_col = 'Date', parse_dates=True)
*Edit: df = pd.DataFrame(df)
I can make the groupby statement work, but I cannot feed in the 'Rev' list to the groupby statement that limits/filters which groups to look at.
Also, when using a groupby statement to create a dataframe, how do I add back in other columns that are not being grouped? For example, in my above statement i try to utilize 'team' and 'revenue', but if I also wanted to add in other columns like ('location' or 'team lead') what is the syntax to add in more columns?
*Edit
Sample input via excel file:
Teams Revenue
Green 10
Blue 15
Red 20
Orange 5
In the above example, I would like to use a statement that takes the top three and saves as a list and then feed that into the groupby statement. Now it looks like i have not filled the actual dataframe?
*from the console:
Empty DataFrame
Columns: [Team, Revenue]
Index: []

Need filter as first step by boolean indexing:
Sample:
df = pd.DataFrame({'Teams': ['Green', 'Blue', 'Red', 'Orange', 'Green', 'Blue', 'Grey', 'Purple'],
'Revenue': [18, 15, 20, 5, 10, 15, 2, 5],
'Location': ['A', 'B', 'V', 'G', 'A', 'D', 'B', 'C']})
print (df)
Teams Revenue Location
0 Green 18 A
1 Blue 15 B
2 Red 20 V
3 Orange 5 G
4 Green 10 A
5 Blue 15 D
6 Grey 2 B
7 Purple 5 C
First get top values and select column Teams:
Rev = df.nlargest(3,'Revenue')['Teams']
print (Rev)
2 Red
0 Green
1 Blue
Name: Teams, dtype: object
Then need filter first by boolean indexing:
print (df[df['Teams'].isin(Rev)])
Teams Revenue Location
0 Green 18 A
1 Blue 15 B
2 Red 20 V
4 Green 10 A
5 Blue 15 D
df1 = (df[df['Teams'].isin(Rev)]
.groupby('Teams',as_index=False)['Revenue']
.sum()
.sort_values('Revenue', ascending=False))
print (df1)
Teams Revenue
0 Blue 30
1 Green 28
2 Red 20
If need multiple columns to output is necessary set aggregation function for each of them like:
df2 = (df[df['Teams'].isin(Rev)]
.groupby('Teams',as_index=False)
.agg({'Revenue':'sum', 'Location': ', '.join, 'Another col':'mean'}))
print (df2)
Teams Revenue Location
0 Blue 30 B, D
1 Green 28 A, A
2 Red 20 V

Related

Substitute column values of a dataframe with the corresponding items in an array

I have a column in a dataframe which contains an array of numbers from 1 to 5 and I have an array containing five words. I would like to find the simplest, most compact and most elegant way in Python to "in place" replace the numbers in the column with the corresponding words. For example:
import pandas as pd
# This is the dataframe
df = pd.DataFrame({'code': ['A', 'B', 'C', 'D', 'E'], 'color': [4, 1, 2, 5, 1]})
# code color
# 0 A 4
# 1 B 1
# 2 C 2
# 3 D 5
# 4 E 1
# This is the array
colors = ["blue", "yellow", "red", "white", "black"]
# This is what I wish to obtain
# code color
# 0 A white
# 1 B blue
# 2 C yellow
# 3 D black
# 4 E blue
I am certain that the numbers in columns "color" are not NaN or outside range [1,5].
No check is necessary. Any suggestion?
This should do
df['color'] = df['color'].apply(lambda c: colors[c-1])
Create helper dictionary by enumerate and mapping values by Series.map, if no match is created missing values NaN:
df['color'] = df['color'].map(dict(enumerate(colors, 1)))
print (df)
code color
0 A white
1 B blue
2 C yellow
3 D black
4 E blue

How to add a column into Pandas with a condition

here is a simple pandas DataFrame :
data={'Name': ['John', 'Dav', 'Ann', 'Mike', 'Dany'],
'Number': ['2', '3', '2', '4', '2']}
df = pd.DataFrame(data, columns=['Name', 'Number'])
df
I would like to add a third column named "color" where the value is 'red' if Number = 2 and 'Blue' if Number = 3
This dataframe just has 5 rows. In reality It has thousand rows so I can not just add a simple column manually.
You can use .map:
dct = {2: "Red", 3: "Blue"}
df["color"] = df["Number"].astype(int).map(dct) # remove .astype(int) if the values are already integer
print(df)
Prints:
Name Number color
0 John 2 Red
1 Dav 3 Blue
2 Ann 2 Red
3 Mike 4 NaN
4 Dany 2 Red

Sorting and ranking subsets of complex data

I have a large and complex GIS-datafile on road accidents in “cities" within “counties”. Rows represent roads. Columns provide “City”, “County” and "Sum of accidents in city”. A city thus contains several roads (repeated values of accident sums) and a county several cities.
For each 'County', I now want to rank cities according to the number of accidents, so that within each 'County', the city with most accidents is ranked “1" and cities with fewer accidents are ranked “2” and above. This rank-value shall be written into the original datafile.
My original approach was to:
1. Sort data according to "'County'_ID" and “Accidents” (descending)
2. than calculate for each row:
if('County' in row 'n+1' = 'County' in row ’n’) AND (Accidents in row 'n+1' = 'Accidents' in row ’n’):
return value: ’n’ ## maintain same rank for cities within 'County'
else if ('County' in row 'n+1' = 'County' in row ’n’) AND if ('Accidents' in row 'n+1' < 'Accidents' in row ’n’):
return value: ’n+1’ ## increasing rank value within 'County'
else if ('County' in row 'n+1' < 'County' in row ’n’) AND ('Accidents' in row 'n+1’ < 'Accidents' in row ’n’):
return value:’1’ ## new 'County', i.e. start ranking from 1
else:
return “0” #error
However, I could not figure out how to code this properly; and maybe this approach is not an appropriate way either; maybe a loop would do the trick?
Any recommendations?
Suggest using Python Pandas module
Fictitious Data
Create Data with columns county, accidents, city
Would use pandas read_csv to load actual data.
import pandas as pd
df = pd.DataFrame([
['a', 1, 'A'],
['a', 2, 'B'],
['a', 5, 'C'],
['b', 5, 'D'],
['b', 5, 'E'],
['b', 6, 'F'],
['b', 8, 'G'],
['c', 2, 'H'],
['c', 2, 'I'],
['c', 7, 'J'],
['c', 7, 'K']
], columns = ['county', 'accidents', 'city'])
Resultant Dataframe
df:
county accidents city
0 a 1 A
1 a 2 B
2 a 5 C
3 b 5 D
4 b 5 E
5 b 6 F
6 b 8 G
7 c 2 H
8 c 2 I
9 c 7 J
10 c 7 K
Group data rows by county, and rank rows within group by accidents
Ranking Code
# ascending = False causes cities with most accidents to be ranked = 1
df["rank"] = df.groupby("county")["accidents"].rank("dense", ascending=True)
Result
df:
county accidents city rank
0 a 1 A 3.0
1 a 2 B 2.0
2 a 5 C 1.0
3 b 5 D 3.0
4 b 5 E 3.0
5 b 6 F 2.0
6 b 8 G 1.0
7 c 2 H 2.0
8 c 2 I 2.0
9 c 7 J 1.0
10 c 7 K 1.0
I think the approach by #DarryIG is correct, but it doesn't consider that the environment is ArcGIS.
Since you tagged your question with Python I came up with a workflow utilizing Pandas. There are other ways to do the same, using ArcGIS tools and or the Field Calculator.
import arcpy # if you are using this script outside ArcGIS
import pandas as pd
# change this to your actual shapefile, you might have to include a path
filename = "road_accidents"
sFields = ['County', 'City', 'SumOfAccidents'] # consider this to be your columns
# read everything in your file into a Pandas DataFrame with a SearchCursor
with arcpy.da.SearchCursor(filename, sFields) as sCursor:
df = pandas.DataFrame(data=[row for row in sCursor], columns=field_names)
df = df.drop_duplicates() # since each row represents a street, we can remove duplicate
# we use this code from DarrylG to calculate a rank
df['Rank'] = df.groupby('County')['SumOfAccidents'].rank('dense', ascending=True)
# set a multiindex, since there might be duplicate city-names
df = df.set_index(['County', 'City'])
dct = df.to_dict() # convert the dataframe into a dictionary
# add a field to your shapefile
arcpy.AddField_management('Rank', 'Rank', 'SHORT')
# now we can update the Shapefile
uFields = ['County', 'City', 'Rank']
with arcpy.da.UpdateCursor(filename, uFields) as uCursor: # open a UpdateCursor on the file
for row in uCursor: # for each row (street)
# get the county/city combo
County_City = (row[uFields.index('County')], row[uFields.index('City')])
if County_City in dct: # see if it is in your dictionary (it should)
# give it the value from dictionary
row[uFields.index('Rank')] = dct['Rank'][County_City]
else:
# otherwise...
row[uFields.index('Rank')] = 999
uCursor.updateRow(row) # update the row
You can run this code inside ArcGIS Pro Python console. Or using Jupyter Notebooks. Hope it helps!

Sort the dataframe column value based on a priority list in python

I have a dataframe with two columns as sted below:
Col 1 Col2
A RED
B GREEN
C AMBER
D RED
E GREEN
I want the output dataframe as:
Col1 Col2
A RED
D RED
C AMBER
B GREEN
E GREEN
I want the column to be sorted in the order of priorty red,amber and green
irrespective of column 1 value.
Thanks for any help in advance
Another solution :
#create a mapping of the sort order
sortbox = {'RED':1,'AMBER':2,'GREEN':3}
#create new column with the sort order
df['sort_column'] = df.Col2.map(sortbox)
#sort with sort_column
df.sort_values('sort_column').drop('sort_column',axis=1).reset_index(drop=True)
Col 1 Col2
0 A RED
1 D RED
2 C AMBER
3 B GREEN
4 E GREEN
One way to do this is by adding another column which contains the second letter of each row in col2 and sort by it (that's the only sorting order I found suitable for your question):
d1 = {'col1': ['A', 'B', 'C', 'D', 'E'], 'col2': ['RED', 'GREEN', 'AMBER', 'RED', 'GREEN']}
df1 = pd.DataFrame(data=d1)
df1['col3'] = [i[1] for i in df1['col2']]
df1 = df1.sort_values(by='col3')
The result, after excluding 3rd column, is like the one you posted

find difference between any two columns of dataframes with a common key column pandas

I have two dataframes with one having
Title Name Quantity ID
as the columns
and the 2nd dataframe has
ID Quantity
as the columns with lesser number of rows than first dataframe .
I need to find the difference between the Quantity of both dataframes based the match in the ID columns and I want to store this difference in a seperate column in the first dataframe .
I tried this (did't work) :
DF1[['ID','Quantity']].reset_index(drop=True).apply(lambda id_qty_tup : DF2[DF2.ID==asin_qty_tup[0]].quantity - id_qty_tup[1] , axis = 1)
Another approach is to apply the ID and quantity of DF1 and iterate through each row of DF2 but it takes more time . Im sure there is a better way .
You can perform index-aligned subtraction, and pandas takes care of the rest.
df['Diff'] = df.set_index('ID').Quantity.sub(df2.set_index('ID').Quantity).values
Demo
Here, changetype is the index, and I've already set it, so pd.Series.sub will align subtraction by default. Otherwise, you'd need to set the index as above.
df1
strings test
changetype
0 a very -1.250150
1 very boring text -1.376637
2 I cannot read it -1.011108
3 Hi everyone -0.527900
4 please go home -1.010845
5 or I will go 0.008159
6 now -0.470354
df2
strings test
changetype
0 a very very boring text 0.625465
1 I cannot read it -1.487183
2 Hi everyone 0.292866
3 please go home or I will go now 1.430081
df1.test.sub(df2.test)
changetype
0 -1.875614
1 0.110546
2 -1.303974
3 -1.957981
4 NaN
5 NaN
6 NaN
Name: test, dtype: float64
You can use map in this case:
df['diff'] = df['ID'].map(df2.set_index('ID').Quantity) - df.Quantity
Some Data
import pandas as pd
df = pd.DataFrame({'Title': ['A', 'B', 'C', 'D', 'E'],
'Name': ['AA', 'BB', 'CC', 'DD', 'EE'],
'Quantity': [1, 21, 14, 15, 611],
'ID': ['A1', 'A1', 'B2', 'B2', 'C1']})
df2 = pd.DataFrame({'Quantity': [11, 51, 44],
'ID': ['A1', 'B2', 'C1']})
We will use df2 to create a dictionary which can be used to map ID to Quantity. So anywhere there is an ID==A1 in df it gets assigned the Quantity 11, B2 gets assigned 51 and C1 gets assigned 44. Here' I'll add it as another column just for illustration purposes.
df['Quantity2'] = df['ID'].map(df2.set_index('ID').Quantity)
print(df)
ID Name Quantity Title Quantity2
0 A1 AA 1 A 11
1 A1 BB 21 B 11
2 B2 CC 14 C 51
3 B2 DD 15 D 51
4 C1 EE 611 E 44
Name: ID, dtype: int64
Then you can just subtract df['Quantity'] and the column we just created to get the difference. (Or subtract that from df['Quantity'] if you want the other difference)

Categories