Create multiindex from existing dataframe - python

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!

You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.

For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex

Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])

There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:

The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Related

Highlight result of dataframe comparison where values differ

I need to compare 2 DataFrames (which should be identical) and output an Excel sheet that shows the comparison between them, with any mismatched values highlighted. This was the format requested by the analysts working with the reports.
I'm currently using df.compare() to do this, which gives a result like the below, where orig is the original df and new is the new df.
In the below, both values in col_1 at index 3 should be highlighted, because they didn't match between the dataframes:
index col_1 col_2 col_3
orig new orig new orig new
1 1 1 2 2 3 3
2 1 1 2 2 3 3
3 1 2 2 2 3 3
While I can do this on my own, the dataframes could be very large, and there will be hundreds of comparisons. So I need your help in doing it efficiently!
My idea was to do
orig.compare(new, keep_equal=False)
and use that to create a mask. This would work because keep_equal=False only returns values that differ, all other cells are NaN. Then I could run the comparison again with keep_equal=True, which populates all cells. Then finally apply the mask using
df.style.apply
to highlight the values that didn't match.
Is there a faster way to do this? It requires processing all the cells in the df several times.
Thanks for any help you can provide.
orig and new are the two dataframes you want to compare.
Use:
def highlight_diffs(orig, props=''):
return np.where(orig != new, props, '')
orig.style.apply(highlight_diffs, props='color:white;background-color:darkblue', axis=None)
Reference: Styler Functions. Acting on Data.

python pandas how to organize similar group data

I want to organize similar group data. Here is my data frame
SKU
FATUT
GUYGE
FATUT-01
SUPAU
GUYPE
SUPAU-01
FATUT-02
GUYGE-01
my expected dataframe will be look like this:
SKU
FATUT
FATUT-01
FATUT-02
GUYGE
GUYGE-01
SUPAU
SUPAU-01
GUYPE
I want to organize similar group of data sequentially.
One option is to use groupby with the parameter sort=False; then concatenate the split DataFrames.
How it works:
Group df by the strings before the dash
groupby sorts by the groupby keys by default; when we specify sort=False, we make sure that the keys are stored in the same order as they first appear in df, i.e. "GUYPE" stays behind "SUPAU".
groupby object contains information about the groups that you can unpack like a dictionary. Then unpack it and build a generator expression that returns the grouped DataFrames.
Using concat, concatenate the split DataFrames into one; by using ignore_index=True, we ignore index coming from the split DataFrames and reset the index.
out = pd.concat((d for _, d in df.groupby(df['SKU'].str.split('-').str[0], sort=False)), ignore_index=True)
Output:
SKU
0 FATUT
1 FATUT-01
2 FATUT-02
3 GUYGE
4 GUYGE-01
5 SUPAU
6 SUPAU-01
7 GUYPE
But I feel like, for your task, sort_values might work as well, even if the orders are not exactly the same as in the desired output:
df = df.sort_values(by='SKU', ignore_index=True)
Output:
SKU
0 FATUT
1 FATUT-01
2 FATUT-02
3 GUYGE
4 GUYGE-01
5 GUYPE
6 SUPAU
7 SUPAU-01

Create dataframe column using another column for source variable suffix

Difficult to title, so apologies for that...
Here is some example data:
region FC_EA FC_EM FC_GL FC_XX FC_YY ...
GL 4 2 8 6 1 ...
YY 9 7 2 1 3 ...
There are many columns with a suffix, hence the ...
[edit] And there are many other columns. I want to keep all columns.
The aim is to create a column called FC that is the value according to the region column value.
So, for this data the resultant column would be:
FC
8
3
I have a couple of ways to achieve this at present - one way is minimal code (perhaps fine for a small dataset):
df['FC'] = df.apply(lambda x: x['FC_'+x.region], axis=1)
Another way is a stacked np.where query - faster for large datasets I am advised...:
df['FC'] = np.where(df.region=='EA', df.FC_EA,
np.where(df.region=='EM', df.FC_EM,
np.where(df.region=='GL', df.FC_GL, ...
I am wondering if anyone out there can suggest the best way to do this, if there is something better than these options?
That would be great.
Thanks!
You could use melt:
(df.melt(id_vars='region', value_name='FC')
.loc[lambda d: d['region'].eq(d['variable'].str[3:]), ['region', 'FC']]
)
or using apply (probably quite slower):
df['FC'] = (df.set_index('region')
.apply(lambda r: r.loc[f'FC_{r.name}'], axis=1)
.values
)
output:
region FC
4 GL 8
9 YY 3

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

How do I groupby a dataframe based on values that are common to multiple columns?

I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.

Categories