How do I turn the headers inside the rows into columns?
For example I have the Dataframe below.
enter image description here
and would like it to be
enter image description here
EDIT:
Code to produce current df example
import pandas as pd
df = pd.DataFrame({'Date':[2020,2021,2022], 'James':'', ' Sales': [3,4,5], ' City':'NY', ' DIV':'a', 'KIM':'', ' Sales ': [3,4,5], ' City ':'SF', ' DIV ':'b'}).T.reset_index()
index 0 1 2
0 Date 2020 2021 2022
1 James
2 Sales 3 4 5
3 City NY NY NY
4 DIV a a a
5 KIM
6 Sales 3 4 5
7 City SF SF SF
8 DIV b b b
looking to get
Name City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 3 4 5
I think the best way is to iterate over the first column if the name(eg James) has no indent its turn into a column until it hits a other value (KIM). So to find a way to categories the header which is not indent into a new column which stops when a new header comes up (KIM).
#Edit 2 there not only two names (KIM or JAMES) there is like 20 names. Or only the three second levels (Sales, City, Div). Different names have more that 3 second levels some have 7 levels. The only thing that is consistent is the Names are not indent but the second levels are.
Using a slightly simpler example, this works, but it sure ain't pretty:
df = pd.DataFrame({
'date': ['James', 'Sales', 'City', 'Kim', 'Sales', 'City',],
'2020': ['', '3', 'NY', '', '4', 'SF'],
'2021': ['', '4', 'NY', '', '5', 'SF'],
})
def rows_to_columns(group):
for value in group.date.values:
if value != group.person.values[0] and value != 'Sales':
temp_column = '_'+value
group.loc[group['date']==value, temp_column] = group['2020']
group[value.lower()] = (
group[temp_column]
.fillna(method='ffill')
.fillna(method='bfill')
)
group.drop([temp_column], axis=1, inplace=True)
pass
pass
return group
df.loc[df['2020']=='', 'person'] = df.date
df.person = df.person.fillna(method='ffill')
new_df = (df
.groupby('person')
.apply(lambda x:rows_to_columns(x))
.drop(['date'], axis=1)
.loc[df.date=='Sales']
)
The basic idea is to
Copy the name into a separate column and fill that column using .fillna(method='ffill'). This works if the assumption holds that every person's block begins with the person's name. Otherwise it wreaks havoc.
All other values, such as 'div' and 'city' will be converted by row_to_columns(group). The function iterates over all rows in a group that are neither the person's name nor 'Sales', copies the value from the row into a temp column, creates a new column for that row and uses ffill and bfill to fill it out. It then deletes the temp column and returns the group.
The resulting data frame is the intended format once the column 'Sales' is dropped.
Note: This solution probably does not work well on larger datasets.
You gave more details, and I see you are not working with multi-level indexes. The best way for you would be to create the DataFrame already in the format you need in this case. The way you are creating the first DataFrame is not well structured and the information is not indexed by name (James/KIM) as they are columns with empty values, no link with the other values. The stacking you did use blank spaces on a string. Take a look at multi-indexing and generate a data frame you can work with, or create the data frame in the format you need in the end.
-- Answer considering multi-level indexes --
Using the few information provided, I see your Dataframe is stacked, it means, you have multiple indexes. The first level is person (James/KIM) and the second level is Sales/City/DIV. So your Dataframe should be created like this:
import pandas
multi_index = pandas.MultiIndex.from_tuples([
('James', 'Sales'), ('James', 'City'), ('James', 'DIV'),
('KIM', 'Sales'), ('KIM', 'City'), ('KIM', 'DIV')])
year_2020 = pandas.Series([3, 'NY', 'a', 4, 'SF', 'b'], index=multi_index)
year_2021 = pandas.Series([4, 'NY', 'a', 5, 'SF', 'b'], index=multi_index)
year_2022 = pandas.Series([5, 'NY', 'a', 6, 'SF', 'b'], index=multi_index)
frame = { '2020': year_2020, '2021': year_2021, '2022': year_2022}
df = pandas.DataFrame(frame)
print(df)
2020 2021 2022
James Sales 3 4 5
City NY NY NY
DIV a a a
KIM Sales 4 5 6
City SF SF SF
DIV b b b
Now that you have the multi_level DataFrame, you have many ways to transform it. This is what we will do to make it one level:
sales_df = df.xs('Sales', axis=0, level=1).copy()
div_df = df.xs('DIV', axis=0, level=1).copy()
city_df = df.xs('City', axis=0, level=1).copy()
The results will be:
print(sales)
2020 2021 2022
James 3 4 5
KIM 4 5 6
print(div_df)
2020 2021 2022
James a a a
KIM b b b
print(city_df)
2020 2021 2022
James NY NY NY
KIM SF SF SF
You are discarding any information regarding DIV or City changes from years, so we can reduce the City and DIV dataframe to a Series, taking the first one as reference:
div_series = div_df.iloc[:,0]
city_series = city_df.iloc[:,0]
Take the sales DF as reference, and add the City and DIV series:
sales_df['DIV'] = div_series
sales_df['City'] = city_series
sales_df['Account'] = 'Sales'
Now reorder the columns as you wish:
sales_df = sales_df[['City', 'DIV', 'Account', '2020', '2021', '2022']]
print(sales_df)
City DIV Account 2020 2021 2022
James NY a Sales 3 4 5
KIM SF b Sales 4 5 6
Related
I'd like to compare the difference in data frames. xyz has all of the same columns as abc, but it has an additional column.
In the comparison, I'd like match up the two like columns (Sport) but only show the SportLeague in the output (if a difference exists, that is). Example, instead of showing 'Soccer' as a difference, show 'Soccer:MLS', which is the adjacent column in xyz)
Here's a screenshot of the two data frames:
import pandas as pd
import numpy as np
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey'], 'Year' : ['2021','2021','2022','2022'], 'ID' : ['1','2','3','4']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
abc
xyz = {'Sport' : ['Football', 'Football', 'Basketball', 'Baseball', 'Hockey', 'Soccer'], 'SportLeague' : ['Football:NFL', 'Football:XFL', 'Basketball:NBA', 'Baseball:MLB', 'Hockey:NHL', 'Soccer:MLS'], 'Year' : ['2022','2019', '2022','2022','2022', '2022'], 'ID' : ['2','0', '3','2','4', '1']}
xyz = pd.DataFrame({k: pd.Series(v) for k, v in xyz.items()})
xyz = xyz.sort_values(by = ['ID'], ascending = True)
xyz
Code already tried:
abc.compare(xyz, align_axis=1, keep_shape=False, keep_equal=False)
The error I get is the following (since the data frames don't have the exact same columns):
Example. If xyz['Sport'] does not show up anywhere within abc['Sport'], then show xyz['SportLeague]' as the difference between the data frames
Further clarification of the logic:
Does abc['Sport'] appear anywhere in xyz['Sport']? If not, indicate "Not Found in xyz data frame". If it does exist, are its corresponding abc['Year'] and abc['ID'] values the same? If not, show "Change from xyz['Year'] and xyz['ID'] to abc['Year'] and abc['ID'].
Does xyz['Sport'] appear anywhere in abc['Sport']? If not, indicate "Remove xyz['SportLeague']".
What I've explained above is similar to the .compare method. However, the data frames in this example may not be the same length and have different amounts of variables.
If I understand you correctly, we basically want to merge both DataFrames, and then apply a number of comparisons between both DataFrames, and add a column that explains the course of action to be taken, given a certain result of a given comparison.
Note: in the example here I have added one sport ('Cricket') to your df abc, to trigger the condition abc['Sport'] does not exist in xyz['Sport'].
abc = {'Sport' : ['Football', 'Basketball', 'Baseball', 'Hockey','Cricket'], 'Year' : ['2021','2021','2022','2022','2022'], 'ID' : ['1','2','3','4','5']}
abc = pd.DataFrame({k: pd.Series(v) for k, v in abc.items()})
print(abc)
Sport Year ID
0 Football 2021 1
1 Basketball 2021 2
2 Baseball 2022 3
3 Hockey 2022 4
4 Cricket 2022 5
I've left xyz unaltered. Now, let's merge these two dfs:
df = xyz.merge(abc, on='Sport', how='outer', suffixes=('_xyz','_abc'))
print(df)
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
Now, we have a df where we can evaluate your set of conditions using np.select(conditions, choices, default). Like this:
conditions = [ df.Year_abc.isnull(),
df.Year_xyz.isnull(),
(df.Year_xyz != df.Year_abc) & (df.ID_xyz != df.ID_abc),
df.Year_xyz != df.Year_abc,
df.ID_xyz != df.ID_abc
]
choices = [ 'Sport not in abc',
'Sport not in xyz',
'Change year and ID to xyz',
'Change year to xyz',
'Change ID to xyz']
df['action'] = np.select(conditions, choices, default=np.nan)
Result as below with a new column action with notes on which course of action to take.
Sport SportLeague Year_xyz ID_xyz Year_abc ID_abc \
0 Football Football:XFL 2019 0 2021 1
1 Football Football:NFL 2022 2 2021 1
2 Soccer Soccer:MLS 2022 1 NaN NaN
3 Baseball Baseball:MLB 2022 2 2022 3
4 Basketball Basketball:NBA 2022 3 2021 2
5 Hockey Hockey:NHL 2022 4 2022 4
6 Cricket NaN NaN NaN 2022 5
action
0 Change year and ID to xyz # match, but mismatch year and ID
1 Change year and ID to xyz # match, but mismatch year and ID
2 Sport not in abc # no match: Sport in xyz, but not in abc
3 Change ID to xyz # match, but mismatch ID
4 Change year and ID to xyz # match, but mismatch year and ID
5 nan # complete match: no action needed
6 Sport not in xyz # no match: Sport in abc, but not in xyz
Let me know if this is a correct interpretation of what you are looking to achieve.
I have two dataframes with different rows numbers contain information about players. The first has all names that I need.
df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})
The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.
df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'],
'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
'Age': [22, 21, 26, 18, 19, 25]})
The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:
print(final_df)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25
You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.
First, import the following libraries :
import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
You can use the text matching capabilities of the fuzzywuzzy python library :
#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
#iterration on dataframe filtered by team
for index, row in df1.loc[df1['Team']==team].iterrows():
#get list of players in this team
lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
#use of fuzzywuzzy to make text matching
output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
#check if there is players from df2 in this team
if output_ratio !=[]:
#put arbitrary threshold to get most similar text
if output_ratio[0][1]>thres:
df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')
with this code and a threshold defined as 70, you get the following result:
print(df1)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25
It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.
Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.
You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/
here is one way:
df1 = df1.merge(df2,how='left', on=['Players','Team']).fillna(20)
Let's say I have a dataframe of leads as such:
import pandas as pd
leads = {'Unique Identifier':['1','2','3','4','5','6','7','8'],
'Name': ['brad','stacy','holly','mike','phil', 'chris','jane','glenn'],
'Channel': [None,None,None,None,'facebook', 'facebook','google', 'facebook'],
'Campaign': [None,None,None,None,'A', 'B','B', 'C'],
'Gender': ['M','F','F','M','M', 'M','F','M'],
'Signup Month':['Mar','Mar','Apr','May','May','May','Jun','Jun']
}
leads_df = pd.DataFrame(leads)
leads_df
which looks like the following. It has missing data for Channel and Campaign for the first 4 leads.
leads table
I have a separate dataframe with the missing data:
missing = {'Unique Identifier':['1','2','3','4'],
'Channel': ['google', 'email','facebook', 'google'],
'Campaign': ['B', 'A','C', 'B']
}
missing_df = pd.DataFrame(missing)
missing_df
table with missing data
Using the Unique Identifiers in both tables, how would I go about plugging in the missing data into the main leads table? For context there are about 6,000 leads with missing data.
You can merge the two dataframes together, update the columns using the results from the merge and then proceed to drop the merged columns.
data = leads_df.merge(missing_df, how='outer', on='Unique Identifier')
data['Channel'] = data['Channel_y'].fillna(data['Channel_x'])
data['Campaign'] = data['Campaign_y'].fillna(data['Campaign_x'])
data.drop(['Channel_x', 'Channel_y', 'Campaign_x', 'Campaign_y'], 1, inplace=True)
The result:
data
Unique Identifier Name Gender Signup Month Channel Campaign
0 1 brad M Mar google B
1 2 stacy F Mar email A
2 3 holly F Apr facebook C
3 4 mike M May google B
4 5 phil M May facebook A
5 6 chris M May facebook B
6 7 jane F Jun google B
7 8 glenn M Jun facebook C
You can set the index of both dataframes to unique identifier and use combine_first to fill the null values in leads_df
(leads_df
.set_index("Unique Identifier")
.combine_first(missing_df.set_index("Unique Identifier"))
.reset_index()
)
The way I use in this kind of case is similar to vlookup function of Excel.
leads_df.loc[leads_df.Channel.isna(),'Channel']=pd.merge(leads_df.loc[leads_df.Channel.isna(),'Unique Identifier'],
missing_df,
how='left')['Channel']
This code will result in :
Unique Identifier Name Channel Campaign Gender Signup Month
0 1 brad google None M Mar
1 2 stacy email None F Mar
2 3 holly facebook None F Apr
3 4 mike google None M May
4 5 phil facebook A M May
5 6 chris facebook B M May
6 7 jane google B F Jun
7 8 glenn facebook C M Jun
You can do same to 'Campaign'.
You just need to fill it using fillna() ...
leads_df.fillna(missing_df, inplace=True)
There is a pandas DataFrame method for this called combine_first:
voltron = leads_df.combine_first(missing_df)
Let's say I have a dataframe as follows:
d = {'name': ['spain', 'greece','belgium','germany','italy'], 'davalue': [3, 4, 6, 9, 3]}
df = pd.DataFrame(data=d)
index name davalue
0 spain 3
1 greece 4
2 belgium 6
3 germany 9
4 italy 3
I would like to aggregate and sum based on a list of strings in the name column. So for example, I may have: southern=['spain', 'greece', 'italy'] and northern=['belgium','germany'].
My goal is to aggregate by using sum, and obtain:
index name davalue
0 southern 10
1 northen 15
where 10=3+4+3 and 15=6+9
I imagined something like:
df.groupby(by=[['spain','greece','italy'],['belgium','germany']])
could exist. The docs say
A label or list of labels may be passed to group by the columns in self
but I'm not sure I understand what that means in terms of syntax.
I would build a dictionary and map:
d = {v:'southern' for v in southern}
d.update({v:'northern' for v in northern})
df['davalue'].groupby(df['name'].map(d)).sum()
Output:
name
northern 15
southern 10
Name: davalue, dtype: int64
One way could be using np.select and using the result as a grouper:
import numpy as np
southern=['spain', 'greece', 'italy']
northern=['belgium','germany']
g = np.select([df.name.isin(southern),
df.name.isin(northern)],
['southern', 'northern'],
'others')
df.groupby(g).sum()
davalue
northern 15
southern 10
df["regional_group"]=df.apply(lambda x: "north" if x["home_team_name"] in ['belgium','germany'] else "south",axis=1)
You create a new column by which you later groubpy.
df.groupby("regional_group")["davavalue"].sum()
I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven