long to wide format in pandas with multiple groups

long to wide format in pandas with multiple groups - python

I have a dataset that is similar in this format:
CITY - YEAR - ATTRIBUTE - VALUE
## example:
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100
etc....
I'm trying to transpose this long to wide format so that each city+year value is a row and all the distinct combinations of attributes are the columns-names.
Thus this new dataframe would look like:
###
city - year - population - crime - median_income- etc....
I've looked at the pivot function, but it doesn't seem to support a multi-index for reshaping. Can someone let me know how to work around transposing? Additionally, I tried to look at
pd.pivot_table but it seems this typically only works with numerical data with sums,means, etc. Most of my VALUE attributes are actually strings, so I don't seem to be able to use this.
### doesn't work - can't use a multindex
df.pivot(index=['city','year'], columns = 'attribute', values='value')
Thank you for your help!

Is this what you are looking for:
import pandas as pd
from io import StringIO
data = """city-year-attribute-value
dallas-2002-crime-100
dallas-2003-crime-101
dallas-2002-population-4000
houston-2002-population-4100"""
df = pd.read_csv(StringIO(data), sep="-")
pivoted = df.pivot_table(
index=["city", "year"],
columns=["attribute"],
values=["value"]
)
print(pivoted.reset_index())
Result:
city year value
attribute crime population
0 dallas 2002 100.0 4000.0
1 dallas 2003 101.0 NaN
2 houston 2002 NaN 4100.0

Related

Manipulate Dataframe

Lets say I'm working on a dataset: # dummy dataset
import pandas as pd
data = pd.DataFrame({"Name_id" : ["John","Deep","Julia","John","Sandy",'Deep'],
"Month_id" : ["December","March","May","April","May","July"],
"Colour_id" : ["Red",'Purple','Green','Black','Yellow','Orange']})
data
How can I convert this data frame into something like this:
Where the A_id is unique and forms new columns based on both the value and the existence / non-existence of the other columns in order of appearance? I have tried to use pivot but I noticed it's more used for numerical data instead of categorical.

Probably you should try pivot
data['Rowid'] = data.groupby('Name_id').cumcount()+1
d = data.pivot(index='Name_id', columns='Rowid',values = ['Month_id','Colour_id'])
d.reset_index(inplace=True)
d.columns = ['Name_id','Month_id1', 'Colour_id1', 'Month_id2', 'Colour_id2']
which gives
Name_id Month_id1 Colour_id1 Month_id2 Colour_id2
0 Deep March July Purple Orange
1 John December April Red Black
2 Julia May NaN Green NaN
3 Sandy May NaN Yellow NaN

How to split a series into multiple columns with a separator python pandas dataframe

I have a csv with one single column and I want to get a multiple columns dataframe in python to work on it.
My test.csv is the following (in one single column):
ID;location;level;temperature;season
001;leeds;63;11;autumn
002;sydney;36;11;spring
003;lyon;250;11;autumn
004;edmonton;645;8;autumn
I want to get the information in a Data Frame like this:
ID Location Level Temperature Season
001 Leeds 63 11 Autumn
002 Sydney 36 11 Sprint
003 Lyon 250 11 Autumn
004 Edmonton 645 8 Autumn
I've tried:
df = pd.read_csv(r'test.csv', header=0, index_col=0)
df= df.str.split(';', expand=True)
But I've got this error: 'DataFrame' object has no attribute 'str' this was an attempt to use it as in Series. Is the similar way to do it in dataframes?
I would like to know if there is a pythonic way to do it or I should iterate per rows.
Is it str.split deprecated? I found it that is for Series but seems to be deprecated.
Any guidance is much appreciated.

Convert columns on pandas from non-null object to float

I'm having some difficulties to convert the values from object to float!!!
I saw some examples but I couldn't be able to get it.
I would like to have a for loop to convert the values in all columns.
I didn't have yet a script cause I saw different ways to do it
Terawatt-hours Total Asia Pacific Total CIS Total Europe
2000 0.428429 0 0.134473
2001 0.608465 0 0.170166
2002 0.829254 0 0.276783
2003 1.11654 0 0.468726
2004 1.46406 0 0.751126
2005 1.85281 0 1.48641
2006 2.29128 0 2.52412
2007 2.74858 0 3.81573
2008 3.3306 0 7.5011
2009 4.3835 7.375e-06 14.1928
2010 6.73875 0.000240125 23.2634
2011 12.1544 0.00182275 46.7135
I tried this:
df = pd.read_excel(r'bp-stats-review-2019-all-data.xls')
columns = list(df.head(0))
for i in range(len(columns)):
df[columns[i]].astype(float)

Your question is not clear as to which column you are trying to convert, So I am sharing the example for the 1st column in your screenshot.
df['Terawatt-hours'] = df.Terawatt-hours.astype(float)
or same for any other column.
EDIT
for creating a loop on the dataframe and change it for all the columns, you can do the following :
Generating a dummy dataframe
df = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('abcd'))
Check the type of column in dataframe :
for column in df.columns:
print(df[column].dtype)
Change the type of all the columns to float
for column in df.columns:
df[column] = df[column].astype(float)

your question is not clear? which columns are you trying to convert to float and also post what you have done.
EDIT:
what you tried is right until the last line of your code where you failed to reassign the columns.
df[columns[i]] = df[columns[i]].astype(float)
also try using df.columns to get column names instead of list(df.head(0))
the link here to pandas docs on how to cast a pandas object to a specified dtype

How to sort a MultiIndex Pandas Pivot Table based on a certain column

I am new to Python and am trying to play around with the Pandas Pivot Tables. I have searched and searched but none of the answers have been what I am looking for. Basically, I am trying to sort the below pandas pivot table
import numpy as np
import pandas as pd
df = pd.DataFrame({
"TIME":["FQ1","FQ2","FQ2","FQ2"],
"NAME":["Robert",'Miranda',"Robert","Robert"],
"TOTAL":[900,42,360,2000],
"TYPE":["Air","Ground","Air","Ground"],
"GROUP":["A","A","A","A"]})
pt = pd.pivot_table(data=df,
values =["TOTAL"], aggfunc = (np.sum),
index = ["GROUP","TYPE","NAME"],
columns = "TIME",
fill_value=0,
margins = True)
Basically I am hoping to sort the "Type" and the "Name" column based on the sum of each row.
The end goal in this case would be "Ground" type appearing first before "Air", and within the "Ground" type, I'm hoping to have Robert appear before Miranda, since his sum is higher.
Here is how it appears now:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Miranda 0 42 42
Robert 0 2000 2000
All 900 2402 3302
Thanks to anyone who is able to help!!

Try this, because your column header is multiindex, you need to use a tuple to access the colums:
pt.sort_values(['GROUP','TYPE',('TOTAL','All')],
ascending=[True, True, False])
Output:
TOTAL
TIME FQ1 FQ2 All
GROUP TYPE NAME
A Air Robert 900 360 1260
Ground Robert 0 2000 2000
Miranda 0 42 42
All 900 2402 3302

Manipulations with Lat-Lon and Time Series Pandas

I am trying to do some file merging with Latitude and Longitude.
Input File1.csv
Name,Lat,Lon,timeseries(n)
London,80.5234,121.0452,523
London,80.5234,121.0452,732
London,80.5234,121.0452,848
Paris,90.4414,130.0252,464
Paris,90.4414,130.0252,829
Paris,90.4414,130.0252,98
New York,110.5324,90.0023,572
New York,110.5324,90.0023,689
New York,110.5324,90.0023,794
File2.csv
Name,lat,lon,timeseries1
London,80.5234,121.0452,500
Paris,90.4414,130.0252,400
New York,110.5324,90.0023,700
Now Expected output is
File2.csv
Name,lat,lon,timeseries1,timeseries(n) #timeseries is 24 hrs format 17:45:00
London,80.5234,121.0452,500,2103 #Addition of all three values
Paris,90.4414,130.0252,400,1391
New York,110.5324,90.0023,700,2055
With python, numpy and dictionaries it would be straight as key = sum of values but I want to use Pandas
Please suggest me how to start with or may be a point me to some example. I have not see anything like Dictionary types with Pandas with Latitude and Longitude.

Perform a groupby aggregation on the first df, call sum and then merge this with the other df:
In [12]:
gp = df.groupby('Name')['timeseries(n)'].sum().reset_index()
df1.merge(gp, on='Name')
Out[14]:
Name Lat Lon timeseries1 timeseries(n)
0 London 80.5234 121.0452 500 2103
1 Paris 90.4414 130.0252 400 1391
2 New York 110.5324 90.0023 700 2055
the aggregation looks like this:
In [15]:
gp
Out[15]:
Name timeseries(n)
0 London 2103
1 New York 2055
2 Paris 1391
Your csv files can loaded using read_csv so something like:
df = pd.read_csv('File1.csv')
df1 = pd.read_csv('File2.csv')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

long to wide format in pandas with multiple groups - python

Related

Manipulate Dataframe

How to split a series into multiple columns with a separator python pandas dataframe

Convert columns on pandas from non-null object to float

How to sort a MultiIndex Pandas Pivot Table based on a certain column

Manipulations with Lat-Lon and Time Series Pandas

Categories

Resources