Pandas, Pivoting DataFrame By Multiple Hierarchical Columns

Pandas, Pivoting DataFrame By Multiple Hierarchical Columns - python

I'm following
https://nikolaygrozev.wordpress.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
but am facing a different scenario for pivoting DataFrames.
The basic pivot command is like this:
d.pivot(index='Item', columns='CType', values='USD')
Now suppose my 'Item', belongs to two categories, 'Area' and 'Region', in two other data columns. I want the pivoted result contains those three levels (Region, Area, Item). How can I do that?
I had been looking for answers everywhere, and had been trying methods like 'unstack', 'droplevel', 'reset_index', etc, but wasn't able to make them work myself.
Please help.
Thanks

First, you probably want to use pd.pivot_table. Second, when you want to have multiple columns along a dimension, you need to pass them as a list (e.g. index=['Item', 'Area', 'Region']).
# Random data.
np.random.seed(0)
df = pd.DataFrame({'Area': ['A', 'A', 'A', 'B', 'B', 'B'],
'Region': ['r', 's', 'r', 's', 'r', 'r'],
'Item': ['car' ,'car', 'car', 'truck', 'bus', 'bus'],
'CType': [3, 4, 3, 3, 5, 5],
'USD': np.random.rand(6) * 100})
>>> df
Area CType Item Region USD
0 A 3 car r 54.881350
1 A 4 car s 71.518937
2 A 3 car r 60.276338
3 B 3 truck s 54.488318
4 B 5 bus r 42.365480
5 B 5 bus r 64.589411
>>> pd.pivot_table(df,
index=['Item', 'Area', 'Region'],
columns='CType',
values='USD',
aggfunc=sum)
CType 3 4 5
Item Area Region
bus B r NaN NaN 106.954891
car A r 115.157688 NaN NaN
s NaN 71.518937 NaN
truck B s 54.488318 NaN NaN

Related

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

I have a DataFrame with 2 columns:
import pandas as pd
data = {'Country': ['A', 'A', 'A' ,'B', 'B'],'Capital': ['CC', 'CD','CE','CF','CG'],'Population': [5, 35, 20,34,65]}
df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population'])
I want to compare each row with all others, and if it has the same Country, I would like to concatenate the pair into a new data frame (and transfor it into a new csv).
new_data = {'Country': ['A', 'A','B'],'Capital': ['CC', 'CD','CF'],'Population': [5, 35,34],'Country_2': ['A', 'A' ,'B'],'Capital_2': ['CD','CE','CG'],'Population_2': [35, 20,65]}
df_new = pd.DataFrame(new_data,columns=['Country', 'Capital', 'Population','Country_2','Capital_2','Population_2'])
NOTE: This is a simplification of my data, I have more than 5000 rows and I would like to do it automatically
I tried comparing dictionaries, and also comparing one row at a time, but I couldn't do it.
Thank you for the attention

>>> df.join(df.groupby('Country').shift(-1), rsuffix='_2')\
... .dropna(how='any')
Country Capital Population Capital_2 Population_2
0 A CC 5 CD 35.0
1 A CD 35 CE 20.0
3 B CF 34 CG 65.0
This pairs every row with the next one using join + shift − but we restrict shifting only within the same country using groupby. See what the groupby + shift does on its own:
>>> df.groupby('Country').shift(-1)
Capital Population
0 CD 35.0
1 CE 20.0
2 NaN NaN
3 CG 65.0
4 NaN NaN
Then once these values are added to the right of your data with the _2 suffix, the rows that have NaNs are dropped with dropna().
Finally note that Country_2 is not repeated as it’s the same as Country, but it would be very easy to add

To get all combinations you can try:
from itertools import combinations,chain
df = (
pd.concat(
[pd.DataFrame(
np.array(list(chain(*(combinations(k.values,2))))).reshape(-1, len(df.columns) * 2),
columns = df.columns.append(df.columns.map(lambda x: x + '_2')))
for g,k in df.groupby('Country')]
)
)

Sorting and ranking subsets of complex data

I have a large and complex GIS-datafile on road accidents in “cities" within “counties”. Rows represent roads. Columns provide “City”, “County” and "Sum of accidents in city”. A city thus contains several roads (repeated values of accident sums) and a county several cities.
For each 'County', I now want to rank cities according to the number of accidents, so that within each 'County', the city with most accidents is ranked “1" and cities with fewer accidents are ranked “2” and above. This rank-value shall be written into the original datafile.
My original approach was to:
1. Sort data according to "'County'_ID" and “Accidents” (descending)
2. than calculate for each row:
if('County' in row 'n+1' = 'County' in row ’n’) AND (Accidents in row 'n+1' = 'Accidents' in row ’n’):
return value: ’n’ ## maintain same rank for cities within 'County'
else if ('County' in row 'n+1' = 'County' in row ’n’) AND if ('Accidents' in row 'n+1' < 'Accidents' in row ’n’):
return value: ’n+1’ ## increasing rank value within 'County'
else if ('County' in row 'n+1' < 'County' in row ’n’) AND ('Accidents' in row 'n+1’ < 'Accidents' in row ’n’):
return value:’1’ ## new 'County', i.e. start ranking from 1
else:
return “0” #error
However, I could not figure out how to code this properly; and maybe this approach is not an appropriate way either; maybe a loop would do the trick?
Any recommendations?

Suggest using Python Pandas module
Fictitious Data
Create Data with columns county, accidents, city
Would use pandas read_csv to load actual data.
import pandas as pd
df = pd.DataFrame([
['a', 1, 'A'],
['a', 2, 'B'],
['a', 5, 'C'],
['b', 5, 'D'],
['b', 5, 'E'],
['b', 6, 'F'],
['b', 8, 'G'],
['c', 2, 'H'],
['c', 2, 'I'],
['c', 7, 'J'],
['c', 7, 'K']
], columns = ['county', 'accidents', 'city'])
Resultant Dataframe
df:
county accidents city
0 a 1 A
1 a 2 B
2 a 5 C
3 b 5 D
4 b 5 E
5 b 6 F
6 b 8 G
7 c 2 H
8 c 2 I
9 c 7 J
10 c 7 K
Group data rows by county, and rank rows within group by accidents
Ranking Code
# ascending = False causes cities with most accidents to be ranked = 1
df["rank"] = df.groupby("county")["accidents"].rank("dense", ascending=True)
Result
df:
county accidents city rank
0 a 1 A 3.0
1 a 2 B 2.0
2 a 5 C 1.0
3 b 5 D 3.0
4 b 5 E 3.0
5 b 6 F 2.0
6 b 8 G 1.0
7 c 2 H 2.0
8 c 2 I 2.0
9 c 7 J 1.0
10 c 7 K 1.0

I think the approach by #DarryIG is correct, but it doesn't consider that the environment is ArcGIS.
Since you tagged your question with Python I came up with a workflow utilizing Pandas. There are other ways to do the same, using ArcGIS tools and or the Field Calculator.
import arcpy # if you are using this script outside ArcGIS
import pandas as pd
# change this to your actual shapefile, you might have to include a path
filename = "road_accidents"
sFields = ['County', 'City', 'SumOfAccidents'] # consider this to be your columns
# read everything in your file into a Pandas DataFrame with a SearchCursor
with arcpy.da.SearchCursor(filename, sFields) as sCursor:
df = pandas.DataFrame(data=[row for row in sCursor], columns=field_names)
df = df.drop_duplicates() # since each row represents a street, we can remove duplicate
# we use this code from DarrylG to calculate a rank
df['Rank'] = df.groupby('County')['SumOfAccidents'].rank('dense', ascending=True)
# set a multiindex, since there might be duplicate city-names
df = df.set_index(['County', 'City'])
dct = df.to_dict() # convert the dataframe into a dictionary
# add a field to your shapefile
arcpy.AddField_management('Rank', 'Rank', 'SHORT')
# now we can update the Shapefile
uFields = ['County', 'City', 'Rank']
with arcpy.da.UpdateCursor(filename, uFields) as uCursor: # open a UpdateCursor on the file
for row in uCursor: # for each row (street)
# get the county/city combo
County_City = (row[uFields.index('County')], row[uFields.index('City')])
if County_City in dct: # see if it is in your dictionary (it should)
# give it the value from dictionary
row[uFields.index('Rank')] = dct['Rank'][County_City]
else:
# otherwise...
row[uFields.index('Rank')] = 999
uCursor.updateRow(row) # update the row
You can run this code inside ArcGIS Pro Python console. Or using Jupyter Notebooks. Hope it helps!

Pandas merge on variable columns

I have a table of sites with a land cover class and a state. I have another table with values linked to class and state. In the second table, however, some of the rows are linked only to class:
sites = pd.DataFrame({'id': ['a', 'b', 'c'],
'class': [1, 2, 23],
'state': ['al', 'ar', 'wy']})
values = pd.DataFrame({'class': [1, 1, 2, 2, 23],
'state': ['al', 'ar', 'al', 'ar', None],
'val': [10, 11, 12, 13, 16]})
I'd like to link the tables by class and state, except for those rows in the value table for which state is None, in which case they would be linked only by class.
A merge has the following result:
combined = sites.merge(values, how='left', on=['class', 'state'])
id class state val
0 a 1 al 10.0
1 b 2 ar 13.0
2 c 23 wy NaN
But I'd like val in the last row to be 16. Is there an inexpensive way to do this short of breaking up both tables, performing separate merges, and then concatenating the result?

How about merge them separately:
pd.concat([sites.merge(values, on=['class','state']),
sites.merge(values[values['state'].isna()].drop('state',axis=1),
on=['class'])
])
Output:
id class state val
0 a 1 al 10
1 b 2 ar 13
0 c 23 wy 16

We can use combine_first here:
(sites.set_index(['class','state'])
.combine_first(values.set_index(['class','state']))
.dropna().reset_index())
class state id val
0 1 al a 10.0
1 2 ar b 13.0
2 23 wy c 16.0

How do I create a data frame and form two separate columns from these two outputs?

I am trying to have a dataframe that includes the following two outputs, side by side as columns:
finalcust = mainorder_df, custname1_df
print(finalcust)
finalcust
Out[46]:
(10 10103.0
26 10104.0
39 10105.0
54 10106.0
72 10107.0
...
2932 10418.0
2941 10419.0
2955 10420.0
2977 10424.0
2983 10425.0
Name: ordernumber, Length: 213, dtype: float64,
1 Signal Gift Stores
2 Australian Collectors, Co.
3 La Rochelle Gifts
4 Baane Mini Imports
5 Mini Gifts Distributors Ltd.
...
117 Motor Mint Distributors Inc.
118 Signal Collectibles Ltd.
119 Double Decker Gift Stores, Ltd
120 Diecast Collectables
121 Kelly's Gift Shop
Name: customerName, Length: 91, dtype: object)
I have tried pd.merge but it says I am not allowed since there is no common column.
Anyone have any idea?

What are you actually trying to accomplish?
General Merging with df.merge()
The data frames cannot be merged because they are not related in anyway. Pandas expects them to have a similar column in order to know how to merge. pandas.DataFrame.merge docs
Example: If you wanted to take information from a customer information sheet and add it to an order list.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Adress_A', 'Address_B', 'Address_C', 'Address_D']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': ['A', 'B', 'C', 'D','A','B','C','D','A','B'],
'Order': [1,2,3,4,5,6,7,8,9,10]})
df = df1.merge(df2)
df =
Customer Info Order
0 A Adress_A 1
1 A Adress_A 5
2 A Adress_A 9
3 B Address_B 2
4 B Address_B 6
5 B Address_B 10
6 C Address_C 3
7 C Address_C 7
8 D Address_D 4
9 D Address_D 8
Combining with df.concat()
If they were the same size, you would use concat to combine them. There is a post about it here
Example: Adding a new list of customers to the customer df
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
new_customers = ['E', 'F', 'G', 'H']
new_addresses = ['Address_E', 'Address_F', 'Address_G', 'Address_G']
df1 = pd.DataFrame({'Customer': customers,
'Info': addresses})
df2 = pd.DataFrame({'Customer': new_customers,
'Info': new_addresses})
df = pd.concat([df1, df2])
df =
Customer Info
0 A Address_A
1 B Address_B
2 C Address_C
3 D Address_D
0 E Address_E
1 F Address_F
2 G Address_G
3 H Address_G
Combining "Side by Side" by Adding a New Column
The side by side method of combination would be adding a column.
Example: Adding a new column to customer information df.
import pandas as pd
customers = ['A', 'B', 'C', 'D']
addresses = ['Address_A', 'Address_B', 'Address_C', 'Address_D']
phones = [1,2,3,4]
df = pd.DataFrame({'Customer': customers,
'Info': addresses})
df['Phones'] = phones
df =
Customer Info Phones
0 A Address_A 1
1 B Address_B 2
2 C Address_C 3
3 D Address_D 4
Actually Doing...?
If you are trying to assign a customer name to an order, that can't be done with the data you have here.
Hope this helps..

Average one field over two columns

I have searched around and have my own solution, but I believe there is a better way to achieve the result.
I have a dataframe with the following columns:
from_country to_country score
from_country and to_country columns have the same set of entries, e.g. US, UK, China, and so on. For each combination of from-to, there is a specific score.
I need to calculate the average of score for each country, regardless appearing in the from_country or the to_country field.
df_from = df[["from_country", "score"]].copy()
df_from.rename(columns={"from_country":"country"}, inplace=True)
df_to = df[["to_country", "score"]].copy()
df_to.rename(columns={"to_country":"country"}, inplace=True)
df_countries = pd.concat([df_from, df_to])
and then finally calculated the average over the new dataframe.
Is there a way to do it better?
Thanks

You can first stack the columns and then a simple groupby will get you all of the averages.
df.set_index('score').stack().reset_index().groupby(0).score.mean()
Here's an example, which renames the columns
import pandas as pd
df = pd.DataFrame({'from_country': ['A', 'B', 'C', 'D', 'E', 'G'],
'to_country': ['G', 'C', 'Z', 'X', 'A', 'A'],
'score': [1, 2, 3, 4, 5, 6]})
stacked = df.set_index('score').stack().to_frame('country').reset_index().drop(columns='level_1')
# score country
#0 1 A
#1 1 G
#2 2 B
#3 2 C
#4 3 C
#5 3 Z
#...
stacked.groupby('country').score.mean()
Outputs:
country
A 4.0
B 2.0
C 2.5
D 4.0
E 5.0
G 3.5
X 4.0
Z 3.0
Name: score, dtype: float64

Another way with set_index + concat:
pd.concat((
df.set_index('from_country').score,
df.set_index('to_country').score
)).groupby(level=0).mean()
A 4.0
B 2.0
C 2.5
D 4.0
E 5.0
G 3.5
X 4.0
Z 3.0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas, Pivoting DataFrame By Multiple Hierarchical Columns - python

Related

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

Sorting and ranking subsets of complex data

Pandas merge on variable columns

How do I create a data frame and form two separate columns from these two outputs?

Average one field over two columns

Categories

Resources