I have a df that looks like this
data = [{'Stock': 'Apple', 'Weight': 0.2, 'Price': 101.99, 'Beta': 1.1},
{'Stock': 'MCSFT', 'Weight': 0.1, 'Price': 143.12, 'Beta': 0.9},
{'Stock': 'WARNER','Weight': 0.15,'Price': 76.12, 'Beta': -1.1},
{'Stock': 'ASOS', 'Weight': 0.35,'Price': 76.12, 'Beta': -1.1 },
{'Stock': 'TESCO', 'Weight': 0.2, 'Price': 76.12, 'Beta': -1.1 }]
data_df = pd.DataFrame(data)
and a custom function that will calculate weighted averages
def calc_weighted_averages(data_in, weighted_by):
return sum(x * y for x, y in zip(data_in, weighted_by)) / sum(weighted_by)
I want to apply this custom formula to a all the columns in my df, my first idea was to do s.th. like this
data_df = data_df[['Weight','Price','Beta']]
data_df = data_df.apply(lambda x: calc_weighted_averages(x['Price'], x['Weight']), axis=1)
How can I keep my weighted_by column fixed and apply the custom function to the other columns? I should end up with a weighted average number for Price and Beta.
I think you need subset of all columns first and then use second argument Weight column:
s1 = data_df[['Price','Beta']].apply(lambda x: calc_weighted_averages(x, data_df['Weight']))
print (s1)
Price 87.994
Beta -0.460
dtype: float64
Another solution without apply is faster:
s1 = data_df[['Price','Beta']].mul(data_df['Weight'], 0).sum().div(data_df['Weight'].sum())
print (s1)
Price 87.994
Beta -0.460
dtype: float64
Related
I have a large DataFrame with multiple cols, one of which has ID of string, and the others are of float
example of the dataframe:
df = pd.DataFrame({'ID': ['Child', 'Child', 'Child', 'Child', 'Baby', 'Baby', 'Baby', 'Baby'],
'income': [40000, 50000, 42000, 300, 2000, 4000, 2000, 3000],
'Height': [1.3, 1.5, 1.9, 2.0, 2.3, 1.4, 0.9, 0.8]})
What I want to do is a combination of calculating the average of every n rows of all cols, inside every ID group.
desired output:
steps = 3
df = pd.DataFrame({'ID': ['Child', 'Child', 'Baby', 'Baby'],
'income': [44000, 300, 2600 , 3000],
'Height': [1.567, 2.0, 1.533, 0.8],
'Values': [3, 1, 3, 1]})
Where the rows are first grouped by ID and then the mean is taken over every 3 values in the same group. I added Values such that i can track how many rows are taken for that row's average of all cols.
I have found similar questions but I cannot seem to combine them to solve my problem:
This question gives averages of n rows.
[This question2 covers pd.cut which I might need as well, I just dont understand how the bins work.
How can I make this happen?
You can use a double groupby:
# set up secondary grouper
group = df.groupby('ID').cumcount().floordiv(steps)
# groupy+agg
(df.groupby(['ID', group], as_index=False, sort=False)
.agg(**{'income': ('income', 'mean'),
'Height': ('Height', 'mean'),
'Values': ('Height', 'count'),
})
)
output:
ID income Height Values
0 Child 44000.000000 1.566667 3
1 Child 300.000000 2.000000 1
2 Baby 2666.666667 1.533333 3
3 Baby 3000.000000 0.800000 1
I have a data frame where each row is some permutation of (an ordered) list of elements. The same row can not have the same element twice, but it may not have any. For example if a row contains five values and the possible values are "alpha" through "epsilon", {"alpha","beta", ",","} is allowed, {"beta","alpha", ",","} is also allowed, but {"alpha","alpha", ",","} is not. It cannot appear in the frame, by construction.
The rows of the data frame can therefore be un-ordered. What I want is to sort each row according to a predefined relation, eg a dict. For example, the data frame may look like
yy = {
'x1': ['alpha', '', 'beta', '', 'gamma'],
'x2': ['', '', '', '','alpha'],
'x3': ['', 'beta', '', 'alpha',''],
}
df = pd.DataFrame(yy)
df
The given (= predefined) order is sort_order = {'alpha': 0, 'beta': 1, 'gamma': 2} and using this the desired output is
# Desired output
yy = {
'x1': ['alpha', '', '', 'alpha', 'alpha'],
'x2': ['', 'beta', 'beta', '', ''],
'x3': ['', '', '', '', 'gamma']
}
df = pd.DataFrame(yy)
df
How is it possible to do that? My actual data frame is not really big, but it's still ~ 20K x 200, so it pays to (1) avoid looping over all rows and use if-then statements to order each row within each loop iteration and (2) pass all the columns at once and not have to specify something like [['x1', 'x2', ... , 'x200']].
First create helper Series by ordering with same index like columns names and then per rows test if match by Series.isin with Series.where for set not matched values by empty string:
sort_order = {'alpha': 0, 'beta': 1, 'gamma': 2}
s = pd.Series(list(sort_order), index=df.columns)
df = df.apply(lambda x: s.where(s.isin(x), ''), axis=1)
print (df)
x1 x2 x3
0 alpha
1 beta
2 beta
3 alpha
4 alpha gamma
Alternative solution with numpy.where:
s = pd.Series(list(sort_order), index=df.columns)
df = pd.DataFrame(np.where((df.apply(lambda x: s.isin(x), axis=1)), s, ''),
index=df.index,
columns=df.columns)
print (df)
x1 x2 x3
0 alpha
1 beta
2 beta
3 alpha
4 alpha gamma
I have a pandas dataframe with a column which is a list containing a single dictionary.
For example:
col1
[{'type': 'yellow', 'id': 2, ...}]
[{'type': 'brown', 'id': 13, ...}]
...
I need to extract the value associated with the 'type' keyword. There are different ways to do it, but since my dataframe is huge (several million rows) I need an efficient way to do this but I am not sure which method is the best.
Let us try this:
data = {
'col': [[{'type': 'yellow', 'id': 2}], [{'type': 'brown', 'id': 13}], np.nan]
}
df = pd.DataFrame(data)
print(df)
col
0 [{'type': 'yellow', 'id': 2}]
1 [{'type': 'brown', 'id': 13}]
2 NaN
Use explode and str accessor:
df['result'] = df.col.explode().str['type']
output:
col result
0 [{'type': 'yellow', 'id': 2}] yellow
1 [{'type': 'brown', 'id': 13}] brown
2 NaN NaN
Accessing any element in most data structures is O(1) operation. I'm sure pandas data frame is no different. The only issue you will face is: looping through the rows. There's probably no way around it.
I am looking into Pythonian ways of extracting part of the dictionary below and turning it into a pandas DataFrame as shown. Appreciate your help with that!
{'data': [{'x': {'name': 'Gamma', 'unit': 'cps', 'values': [10, 20, 30]},
'y': {'name': 'Depth', 'unit': 'm', 'values': [34.3, 34.5, 34.7]}}]}
Depth
Gamma
1
34.3
10
2
34.4
20
3
34.5
30
Sure, basically, you need to iterate over the values of each dict in the 'data' list, which is itself a dict of column information:
In [1]: data = {'data': [{'x': {'name': 'Gamma', 'unit': 'cps', 'values': [10, 20, 30]},
...: 'y': {'name': 'Depth', 'unit': 'm', 'values': [34.3, 34.5, 34.7]}}]}
In [2]: import pandas as pd
In [3]: pd.DataFrame({
...: col["name"]: col["values"]
...: for d in data['data']
...: for col in d.values()
...: })
Out[3]:
Gamma Depth
0 10 34.3
1 20 34.5
2 30 34.7
I would like to know how to use a start date from a data frame column and have it add rows to the dataframe from the number of days in another column. A new date per day.
Essentially, I am trying to turn this data frame:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
...into this data frame:
df_2 = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter', 'Peter'],
'Date':['1/1/2019', '1/2/2019', '1/2/2019', '1/3/2019', '1/4/2019','1/10/2019', '1/15/2019', '1/16/2019'],
'Hrs':[0.6, 0.6, 1, 1, 1, 1.2, 0.3, 0.3]})
I'm new to programming in general and have tried the following:
df_2 = pd.DataFrame({
'date': pd.date_range(
start = df.Planned_Start,
end = pd.to_timedelta(df.Duration, unit='D'),
freq = 'D'
)
})
... and ...
df["date"] = df.Planned_Start + timedelta(int(df.Duration))
with no luck.
I am not entirely sure what you are trying to achieve as your df_2 looks a bit wrong from what I can see.
If you want the take the duration column as days and add this many dates to a Date column, then the below code achieves that:
You can also drop any columns you don't need with pd.Series.drop() method:
df = pd.DataFrame({
'Name':['Peter', 'Peter', 'Peter', 'Peter'],
'Planned_Start':['1/1/2019', '1/2/2019', '1/15/2019', '1/2/2019'],
'Duration':[2, 3, 5, 6],
'Hrs':[0.6, 1, 1.2, 0.3]})
df_new = pd.DataFrame()
for i, row in df.iterrows():
for duration in range(row.Duration):
date = pd.Series([pd.datetime.strptime(row.Planned_Start, '%m/%d/%Y') + timedelta(days=duration)], index=['date'])
newrow = row.append(date)
df_new = df_new.append(newrow, ignore_index=True)