Reshaping pandas dataframe with unstack() - python

I am trying to reshape pandas DataFrame so that one of the columns would be unstacked to 'broader'. Once I proceed with unstack() new column levels occure but I seem to be unable to re-arrange the headers the way I want.
Firstly, I have following df:
from pandas import *
fList = [['Packs', 'Brablik', 'Holesovice', '2017', 100],
['Decorations', 'Drapp-design', 'Holesovice', '2017', 150],
['Decorations', 'Klapetkovi', 'Holesovice', '2017', 200],
['Decorations', 'Lezecké dárky', 'Fler', '2017', 100],
['Decorations', 'PP', 'Other', '2017', 350],
['Decorations', 'Pavlimila', 'Akce', '2017', 20],
['Decorations', 'Pavlimila', 'Holesovice', '2017', 50],
['Decorations', 'Wiccare', 'Holesovice', '2017', 70],
['Toys', 'Klára Vágnerová', 'Holesovice', '2017', 100],
['Toys', 'Lucie Polonyiová', 'Holesovice', '2017', 80],
['Dresses', 'PP', 'Other', '2018', 200]]
df = DataFrame(fList, columns = ['Section', 'Seller', 'Store', 'Selected_period', 'Total_pieces'])
This produces:
Consequently I reshape it like:
df = df.set_index(['Section', 'Seller', 'Store', 'Selected_period']).unstack(level = -1)
df = df.fillna(0)
df.columns = df.columns.droplevel(0)
That outputs:
However, I would like to have just following columns in the final dataframe: Section, Seller, Store, 2017, 2018. I still fail to re-arrange it so that I would get the output I want, despite I tried to adopt solutions posted here and here and here. Any suggestions?

If I understand correctly, you seem to just be missing a reset_index() call. Try this:
df = df.set_index(['Section', 'Seller', 'Store', 'Selected_period']).unstack(level = -1).fillna(0)
df.columns = df.columns.droplevel(0).rename('')
df = df.reset_index()

Related

Python: How to validate and append non-existing row in a dataset/dataframe?

How can we append a non-existing row/value in a dataset? I have here a sample table with list of names and the objective is to validate first the name if this doesn't exist and append it to the dataset.
Please see code below for reference:
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
df = df.append({'Name':'Jane', 'Age':25, 'Location':'Madrid'}, ignore_index=True)
print(df)
you can check the condition before insering in the dataframe :
import pandas as pd
df = pd.DataFrame.from_dict({
'Name': ['Nik', 'Kate', 'Evan', 'Kyra'],
'Age': [31, 30, 40, 33],
'Location': ['Toronto', 'London', 'Kingston', 'Hamilton']
})
if 'Jane' not in df.Name.values:
df = df.append({'Name':'Jane', 'Age':25, 'Location':'Madrid'}, ignore_index=True)
print(df)

How do I rearrange nested Pandas DataFrame columns?

In the DataFrame below, I want to rearrange the nested columns - i.e. to have 'region_sea' appearing before 'region_inland'
df = pd.DataFrame( {'state': ['WA', 'CA', 'NY', 'NY', 'CA', 'CA', 'WA' ]
, 'region': ['region_sea', 'region_inland', 'region_sea', 'region_inland', 'region_sea', 'region_sea', 'region_inland',]
, 'count': [1, 3, 4, 6, 7, 8, 4]
, 'income': [100, 200, 300, 400, 600, 400, 300]
}
)
df = df.pivot_table(index='state', columns='region', values=['count', 'income'], aggfunc={'count': 'sum', 'income': 'mean'})
df
I tried the code below but it's not working...any idea how to do this? Thanks
df[['count']]['region_sea', 'region_inland']
You can use sort_index to sort it. However, as it is nested columns, it will replace income and count too.
df.sort_index(axis='columns', level=0, ascending=False, inplace=True)
If you don't want replace income/count, than it will not give common header for both.
df.sort_index(axis='columns', level='region', ascending=False, inplace=True)

How to combine two (or morer) DF w/ different leng and ind providing an appropriate index for both in a single DF

I have two dataframes (DF and DF2). Anyobody could help me in understand how can I combine these two dataframes and make them look like this third one (DF3)? I presented a simple example, but I need this to compile dataframes that include different samples (or observations). Eventually, there are samples that emcompass the same group of variables. But most of the cases, the samples present different variables. Each column corresponds to one sample.
Any help is welcome!
DF -
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)
DF2 -
raw_data2 = {'first_name': ['Molly', 'Jake'],
'civil_status': ['Single', 'Single']}
df2 = pd.DataFrame(raw_data2, columns = ['first_name', 'civil_status'])
print(df2)´´´
DF3 -
raw_data3 = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70],
'civil_status': ['NaN', 'Single', 'NaN', 'Single', 'NaN']}
df3 = pd.DataFrame(raw_data3, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore',
'civil_status'])
print(df3)
join
df.set_index("first_name").join(df2.set_index("first_name"))
I applied the solution above in the real context, by using the code below:
arquivo1 = pd.read_excel(f1, header=7, index_col=False)
arquivo2 = pd.read_excel(f2, header=7, index_col=False)
joined = arquivo1.set_index("Element").join(arquivo2.set_index("Element"))
It provided ValueError: columns overlap but no suffix specified: Index(['AN', 'series', ' Net', ' [wt.%]', ' [norm. wt.%]', '[norm. at.%]',
'Error in %'],
dtype='object')
The pictures below represent "arquivo1" and "arquivo2"
arquivo1
arquivo2
When I include the suffix 'Element' in the right and left, it actually join the both dataframe.
joined = arquivo1.set_index("Element").join(arquivo2.set_index("Element"), lsuffix='Element', rsuffix='Element')
But when a dataframe containing more variables (lines) is joined to the first, it simply delete the new variables. Anybody know how to fix it?

Adding 1 values together from different rows where value in column matches

I currently have 2 datasets.
The first contains a list of football teams with values I have worked out.
I have a second dataset has a list of teams that are playing today
What I would like to do is add to dataset2 the mean number of both teams that are playing each other so the outcome would be
I have looked through Stack Overflow and not found anything that been able to help. I am fairly new to working with Pandas so I am not sure if this is possible or not.
As an example data set:
data1 = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
}
data2 = {
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
}
with the desired output being
Desired = {
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
}
Another option as opposed to iterating over rows, is you can merge the datasets, then iterate over the columns.
I also noticed your desired output is rounded, so I have that there as well
Sample:
data1 = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
})
data2 = pd.DataFrame({
'Team': ['Hull', 'Leeds','Everton', 'Man City'],
'Home0-0': ['80', '78','80', '66'],
'Home1-0': ['81', '100','90', '70'],
'Away0-1': ['88', '42','75', '69'],
})
Desired = pd.DataFrame({
'DATAMECI': ['17/06/2020', '17/06/2020'],
'ORAMECI': ['11:30', '15:30'],
'TXTECHIPA1': ['Everton', 'Man City'],
'TXTECHIPA2': ['Hull', 'Leeds'],
'Home0-0': ['80', '72'],
'Home1-0': ['86', '85'],
'Away0-1': ['86', '56',],
})
Code:
import pandas as pd
cols = [ x for x in data2 if 'Home' in x or 'Away' in x ]
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA1'}), how='left', on=['TXTECHIPA1'])
data1 = data1.merge(data2.rename(columns={'Team':'TXTECHIPA2'}), how='left', on=['TXTECHIPA2'])
for col in cols:
data1[col] = data1[[col + '_x', col + '_y']].astype(int).mean(axis=1).round(0)
data1 = data1.drop([col + '_x', col + '_y'], axis=1)
Output:
print(data1)
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 86.0 82.0
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 56.0
Thanks for adding the data. Here is a simple way using loops. Loop through the df2 (upcoming matches). Find the equivalent rows of the participating teams from df1 (team statistics). Now you will have 2 rows from df1. Average the desired columns and add it to df2.
Considering similar structure as your data set, here is an example:
df1 = pd.DataFrame({'team': ['one', 'two', 'three', 'four', 'five'],
'home0-0': [86, 78, 65, 67, 100],
'home1-0': [76, 86, 67, 100, 0],
'home0-1': [91, 88, 75, 100, 67],
'home1-1': [75, 67, 67, 100, 100],
'away0-0': [57, 86, 71, 91, 50],
'away1-0': [73, 50, 71, 100, 100],
'away0-1': [78, 62, 40, 80, 0],
'away1-1': [50, 71, 33, 100, 0]})
df2 = pd.DataFrame({'date': ['2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17', '2020-06-17'],
'time': [1800, 1200, 1100, 2005, 1000, 1800, 1800],
'team1': ['one', 'two', 'three', 'four', 'five', 'one', 'three'],
'team2': ['five', 'four', 'two', 'one', 'three', 'two', 'four']})
for i, row in df2.iterrows():
team1 = df1[df1['team']==row['team1']]
team2 = df1[df1['team']==row['team2']]
for col in df1.columns[1:]:
df2.loc[i, col]=(np.mean([team1[col].values[0], team2[col].values[0]]))
print(df2)
For your sample data set:
for i, row in data1.iterrows():
team1 = data2[data2['Team']==row['TXTECHIPA1']]
team2 = data2[data2['Team']==row['TXTECHIPA2']]
for col in data2.columns[1:]:
data1.loc[i, col]=(np.mean([int(team1[col].values[0]), int(team2[col].values[0])]))
print(data1)
Result:
DATAMECI ORAMECI TXTECHIPA1 TXTECHIPA2 Home0-0 Home1-0 Away0-1
0 17/06/2020 11:30 Everton Hull 80.0 85.5 81.5
1 17/06/2020 15:30 Man City Leeds 72.0 85.0 55.5

Dataframe using unique but the rows is a list not a "1d array-like"

I have a dataframe that looks like:
import pandas as pd
data = {
'Other':['A1', 'A2', 'A3', 'A4', 'A5'],
'category':[['Transfer'], ['Unknown'], ['Transfer','Facebook'], ['Facebook', 'Google', 'Other'], ['C3']]
}
df = pd.DataFrame(data)
I am trying to get a list of unique values for categories, unfortunately using
categories = df['category'].unique()
doesnt work, I am not sure what should be the approach to end with an outcome of
['Transfer', 'Unknown','Facebook','Google','Other','c3']
Let us try explode
df.category.explode().unique()
array(['Transfer', 'Unknown', 'Facebook', 'Google', 'Other', 'C3'],
dtype=object)
If you need the unique list, you may use pd.unique on flatten of df.category using np.concatenate
l = pd.unique(np.concatenate(df.category))
Out[100]:
array(['Transfer', 'Unknown', 'Facebook', 'Google', 'Other', 'C3'],
dtype=object)

Categories