Create dataframe column using another column for source variable suffix - python

Difficult to title, so apologies for that...
Here is some example data:
region FC_EA FC_EM FC_GL FC_XX FC_YY ...
GL 4 2 8 6 1 ...
YY 9 7 2 1 3 ...
There are many columns with a suffix, hence the ...
[edit] And there are many other columns. I want to keep all columns.
The aim is to create a column called FC that is the value according to the region column value.
So, for this data the resultant column would be:
FC
8
3
I have a couple of ways to achieve this at present - one way is minimal code (perhaps fine for a small dataset):
df['FC'] = df.apply(lambda x: x['FC_'+x.region], axis=1)
Another way is a stacked np.where query - faster for large datasets I am advised...:
df['FC'] = np.where(df.region=='EA', df.FC_EA,
np.where(df.region=='EM', df.FC_EM,
np.where(df.region=='GL', df.FC_GL, ...
I am wondering if anyone out there can suggest the best way to do this, if there is something better than these options?
That would be great.
Thanks!

You could use melt:
(df.melt(id_vars='region', value_name='FC')
.loc[lambda d: d['region'].eq(d['variable'].str[3:]), ['region', 'FC']]
)
or using apply (probably quite slower):
df['FC'] = (df.set_index('region')
.apply(lambda r: r.loc[f'FC_{r.name}'], axis=1)
.values
)
output:
region FC
4 GL 8
9 YY 3

Related

How to create calculated column off variable result of same row? Pandas & Python 3

Fairly new to python, I have been struggling with creating a calculated column based off of the variable values of each item.
I Have this table below with DF being the dataframe name
I am trying to create a 'PE Comp' Column that gets the PE value for each ticker, and divides it by the **Industry ** average PE Ratio.
My most successful attempt required me creating a .groupby industry dataframe (y) which has calculated the mean per industry. These numbers are correct. Once I did that I created this code block:
for i in DF['Industry']:
DF['PE Comp'] = DF['PE Ratio'] / y.loc[i,'PE Ratio']
However the numbers are coming out incorrect. I've tested this and the y.loc divisor is working fine with the right numbers, meaning that the issue is coming from the dividend.
Any suggestions on how I can overcome this?
Thanks in advance!
You can use the Pandas Groupby transform:
The following takes the PE Ratio column and divides it by the mean of the grouped industries (expressed three different ways in order of speed of calculation):
import pandas as pd
df = pd.DataFrame({"PE Ratio": [1,2,3,4,5,6,7],
"Industry": list("AABCBBC")})
# option 1
df["PE Comp"] = df["PE Ratio"] / df.groupby("Industry")["PE Ratio"].transform("mean")
# option 2
df["PE Comp"] = df.groupby("Industry")["PE Ratio"].transform(lambda x: x/x.mean())
# option 3
import numpy as np
df["PE Comp"] = df.groupby("Industry")["PE Ratio"].transform(lambda x: x/np.mean(x))
df
#Out[]:
# PE Ratio Industry PE Comp
#0 1 A 0.666667
#1 2 A 1.333333
#2 3 B 0.642857
#3 4 C 0.727273
#4 5 B 1.071429
#5 6 B 1.285714
#6 7 C 1.272727
First, you MUST NOT ITERATE through a dataframe. It is not optimized at all and it is a misused of Pandas' DataFrame.
Creating a new dataframe containing the averages is a good approach in my opinion. I think the line you want to write after is :
df['PE comp'] = df['PE ratio'] / y.loc[df['Industry']].value
I just have a doubt about y.loc[df['Industry']].value maybe you don't need .value or maybe you need to cast the value, I didn't test. But the spirit is that you new y DataFrame is like a dict containing the average of each Industry.

Python: Simplest way to join two DataFrames by unique combinations?

I have two DataFrames:
fuels = pd.DataFrame({'Fuel_Type':['Gasoline', 'Diesel', 'E85']})
years = pd.DataFrame()
years['Year_Model'] = range(2012, 2041)
My desired output is a single new DataFrame which combines these two dataframes as two columns, but for each value in 'years', have it repeated for every unique fuel type in 'fuels'.
In other words, there should be three repetitions for each distinct year, one for each type of fuel.
I can do this very simply in R with:
df <- merge(
data.frame(years = c(2012:2040)),
data.frame(fuels = c("Gasoline", "Diesel", "E85")),
allow.cartesian = T)
I have looked at answers for similar questions such as:
Create all possible combinations of multiple columns in a Pandas DataFrame
Performant cartesian product (CROSS JOIN) with pandas
cartesian product in pandas
But, either I cannot seem to apply the answers' code to my own data, or the answers are too complex for me to understand (as I am very new to Python).
Is there a nice and 'easy to understand' way of doing this?
The second link you posted has a good solution, but it also has a lot of other stuff, so it might be hard to extract if you're new to python. You want:
df = fuels.assign(key=0).merge(years.assign(key=0), on = 'key').drop('key', 1)
This is kind of a slick one liner, because we're doing a few things at once. We're essentially adding a column of 0s to each dataframe, joining on that, and then getting rid of that column. Here is it broken down into steps:
fuels = fuels.assign(key=0) #add a 'key' column to fuels with all 0s for values
years = years.addign(key=0) #add a 'key' column to years with all 0s for values
df = fuels.merge(years, on = 'key') #sql-style join on the key column
df = df.drop('key', 1) #get rid of the key column in the final product
The merge method defaults to an inner join, so we don't need to specify since that's fine. We just have to tell it to join on the right column with on = 'key'. The 1 in the .drop('key', 1) is telling it to drop the column called key (the 1 axis), if we didn't specify (.drop('key')), or gave it a 0 (.drop('key', 0)), it would try to drop a row called key.
The below answer should help you:
import pandas as pd
fuels = pd.DataFrame({'Fuel_Type': ['Gasoline', 'Diesel', 'E85']})
years = pd.DataFrame()
years['Year_Model'] = range(2012, 2041)
fuels['key'] = 1
years['key'] = 1
print(pd.merge(fuels, years, on='key').drop("key", 1))
Output:
Fuel_Type Year_Model
0 Gasoline 2012
1 Gasoline 2013
2 Gasoline 2014
3 Gasoline 2015
4 Gasoline 2016
.. ... ...
82 E85 2036
83 E85 2037
84 E85 2038
85 E85 2039
86 E85 2040
[87 rows x 2 columns]

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1
Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.
You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

How do I groupby a dataframe based on values that are common to multiple columns?

I am trying to aggregate a dataframe based on values that are found in two columns. I am trying to aggregate the dataframe such that the rows that have some value X in either column A or column B are aggregated together.
More concretely, I am trying to do something like this. Let's say I have a dataframe gameStats:
awayTeam homeTeam awayGoals homeGoals
Chelsea Barca 1 2
R. Madrid Barca 2 5
Barca Valencia 2 2
Barca Sevilla 1 0
... and so on
I want to construct a dataframe such that among my rows I would have something like:
team goalsFor goalsAgainst
Barca 10 5
One obvious solution, since the set of unique elements is small, is something like this:
for team in teamList:
aggregateDf = gameStats[(gameStats['homeTeam'] == team) | (gameStats['awayTeam'] == team)]
# do other manipulations of the data then append it to a final dataframe
However, going through a loop seems less elegant. And since I have had this problem before with many unique identifiers, I was wondering if there was a way to do this without using a loop as that seems very inefficient to me.
The solution is 2 folds, first compute goals for each team when they are home and away, then combine them. Something like:
goals_when_away = gameStats.groupby(['awayTeam'])['awayGoals', 'homeGoals'].agg('sum').reset_index().sort_values('awayTeam')
goals_when_home = gameStats.groupby(['homeTeam'])['homeGoals', 'awayGoals'].agg('sum').reset_index().sort_values('homeTeam')
then combine them
np_result = goals_when_away.iloc[:, 1:].values + goals_when_home.iloc[:, 1:].values
pd_result = pd.DataFrame(np_result, columns=['goal_for', 'goal_against'])
result = pd.concat([goals_when_away.iloc[:, :1], pd_result], axis=1, ignore_index=True)
Note .values when summing to get result in numpy array, and ignore_index=True when concat, these are to avoid pandas trap when it sums by column and index names.

Create multiindex from existing dataframe

I've spent hours browsing everywhere now to try to create a multiindex from dataframe in pandas. This is the dataframe I have (posting excel sheet mockup. I do have this in pandas dataframe):
And this is what I want:
I have tried
newmulti = currentDataFrame.set_index(['user_id','account_num'])
But it returns a dataframe, not a multiindex. Also, I could not figure out how to make 'user_id' level 0 and 'account_num' level 1. I think this must be trivial but I've read so many posts, tutorials, etc. and still could not figure it out. Partly because I'm a very visual person and most posts are not. Please help!
You could simply use groupby in this case, which will create the multi-index automatically when it sums the sales along the requested columns.
df.groupby(['user_id', 'account_num', 'dates']).sales.sum().to_frame()
You should also be able to simply do this:
df.set_index(['user_id', 'account_num', 'dates'])
Although you probably want to avoid any duplicates (e.g. two or more rows with identical user_id, account_num and date values but different sales figures) by summing them, which is why I recommended using groupby.
If you need the multi-index, you can simply access viat new_df.index where new_df is the new dataframe created from either of the two operations above.
And user_id will be level 0 and account_num will be level 1.
For clarification of future users I would like to add the following:
As said by Alexander,
df.set_index(['user_id', 'account_num', 'dates'])
with a possible inplace=True does the job.
The type(df) gives
pandas.core.frame.DataFrame
whereas type(df.index) is indeed the expected
pandas.core.indexes.multi.MultiIndex
Use pd.MultiIndex.from_arrays
lvl0 = currentDataFrame.user_id.values
lvl1 = currentDataFrame.account_num.values
midx = pd.MultiIndex.from_arrays([lvl0, lvl1], names=['level 0', 'level 1'])
There are two ways to do it, albeit not exactly like you have shown, but it works.
Say you have the following df:
A B C D
0 nil one 1 NaN
1 bar one 5 5.0
2 foo two 3 8.0
3 bar three 2 1.0
4 foo two 4 2.0
5 bar two 6 NaN
1. Workaround 1:
df.set_index('A', append = True, drop = False).reorder_levels(order = [1,0]).sort_index()
This will return:
2. Workaround 2:
df.set_index(['A', 'B']).sort_index()
This will return:
The DataFrame returned by currentDataFrame.set_index(['user_id','account_num']) has it's index set to ['user_id','account_num']
newmulti.index will return the MultiIndex object.

Categories