How to save the result from equation(float) to column, python - python

I have data frame look line this:
df:
1 2 3.4
-2 2 1.1
2 3 4
-5 5 5
I can use this data on my equation like:
result=abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)
So after this calculation I receive a list with results for each line from df , and the resulting type is a float.
Question: How can I save it to one column or dataframe and add this one column with result to the another dataframe that's same as df ?
I've tried pd.DataFrame(result), which doesn't work.

Assign directly to the new column you're trying to create.
df[3] = abs(int(df[0])) +( int(df[1]) / 2 + float(df[2]) / 32)

I think you need Series.astype for cast columns with Series.abs:
df = pd.DataFrame({0: [1, -2, 2, -5], 1: [2, 2, 3, 5], 2: [3.4, 1.1, 4.0, 5.0]})
print (df)
0 1 2
0 1 2 3.4
1 -2 2 1.1
2 2 3 4.0
3 -5 5 5.0
df[3] = df[1].astype(int).abs() +df[1].astype(int) / 2 + df[2].astype(float) / 32
print (df)
0 1 2 3
0 1 2 3.4 3.106250
1 -2 2 1.1 3.034375
2 2 3 4.0 4.625000
3 -5 5 5.0 7.656250

Related

Trying to multiply a certain data cell by another certain data cell in pandas

Due to misunderstanding using my real scenario I am going to create one.
Here is the DataFrame.
import pandas as pd
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, 'NaN', 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
print(numsdf)
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 NaN 3
3 2 4 1000
4 100 5 0
I want to be able to do the follow addition. Column Number 1 row 4 plus column Number 3 row 3 = Column Number 2 row 2. 100 + 1000 = 1100 (the answer should be in place of the NaN)
This should be the expected outcome:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0
How would I do that? I cannot figure it out.
Notice: Solution working only if sme indices in all 3 DataFrames.
If possible replace non numeric values by missing values and then forward filling last non missng values in same column use:
marketcapdf['Market Cap'] = stockpricedf['Stock Price'] *
pd.to_numeric(outstandingdf['Outstanding'],
errors='coerce').ffill()
If working in one DataFrame:
df['Market Cap'] = df['Stock Price'] *
pd.to_numeric(df['Outstanding'],
errors='coerce').ffill()
EDIT: If need multiple by shifted second column with no change first value use:
numsdf['new'] = numsdf['Number 1'] * numsdf['Number 2'].shift(fill_value=1)
print(numsdf)
Number 1 Number 2 new
0 5 1 5
1 4 2 4
2 3 3 6
3 2 4 6
4 1 5 4
EDIT1: I create new columns for better understanding:
num1df = pd.DataFrame({'Number 1': [1, 4, 3, 2, 100]})
num2df = pd.DataFrame({'Number 2': [1, 2, np.nan, 4, 5]})
num3df = pd.DataFrame({'Number 3': [1, 2, 3, 1000, 0]})
numsdf = pd.concat([num1df, num2df, num3df], axis=1, join="inner")
#add by shifted values
numsdf['new'] = numsdf['Number 1'].shift(-1, fill_value=0) + numsdf['Number 3']
#shift again
numsdf['new1'] = numsdf['new'].shift(-1, fill_value=0)
#replace NaN by another column
numsdf['new2'] = numsdf['Number 2'].fillna(numsdf['new1'])
print(numsdf)
Number 1 Number 2 Number 3 new new1 new2
0 1 1.0 1 5 5 1.0
1 4 2.0 2 5 5 2.0
2 3 NaN 3 5 1100 1100.0
3 2 4.0 1000 1100 0 4.0
4 100 5.0 0 0 0 5.0
foo = numsdf.iloc[4, 0]
bar = numsdf.iloc[3, 2]
numsdf.at[2, 'Number 2'] = foo + bar
Output:
Number 1 Number 2 Number 3
0 1 1 1
1 4 2 2
2 3 1100 3
3 2 4 1000
4 100 5 0

speed up drop rows base on pandas column values

I have a very large pandas data frame, which looks something like
df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
"Day": [1, 2, 3, 1, 2, 3, 1, 2],
"Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})
and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.
stations = pd.unique(df['Station'])
n = 3
def complete(temp):
for k in range(len(stations)):
if len(temp[temp['Station']== stations[k]].Temp) != n:
temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)
I've been looking into using #jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use #jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?
Use value_counts:
out = df[df['Station'].isin(df['Station'].value_counts().loc[lambda x: x==n].index)]
print(out)
# Output
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5
Result of value_counts:
>>> df['Station'].value_counts()
2 3
5 3
6 2
Name: Station, dtype: int64
You can groupby "Station" and transform count method and compare the counts with n to create a boolean Series. Then use this mask to filter the relevant rows:
n=3
msk = df.groupby('Station')['Temp'].transform('count').eq(n)
df = df[msk]
Output:
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5

Why do i get an error whilst slicing using df_grouped.loc[ ] after grouping a few columns from df?

I am a SAS user. Playing around with some data manipulation in Python
isc_summary_sales=isc.groupby(['country','sales_channel','item_type'],as_index=False).aggregate({'order_id':['count'],'units_sold':['sum'],'unit_cost':['mean'],'unit_price':['mean'],'total_cost':['sum'])
The above code works just fine, but on trying to slice, lets say
isc_summary_sales.loc[:,'country':'total_cost']
I get an error
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
However with isc_summary_sales.iloc[:,0:7] it works fine.
I don't understand what this means. Why does it occur?
The reason it throws that error is because after you aggregate you have 2 level indexing for your columns.
For example
import pandas as pd
df = pd.DataFrame({"a":[1, 1, 1, 2, 3, 2], "b":[1, 1, 3, 1, 2, 4], "c":[1, 2, 3, 1, 2, 4], "d":[1, 2, 3, 1, 2, 4]})
df_summary = df.groupby(["a", "b"], as_index=False).aggregate({"c":["mean", "sum"], "d":['sum']})
print(df_summary)
a b c d
mean sum sum
0 1 1 1.5 3 3
1 1 3 3.0 3 3
2 2 1 1.0 1 1
3 2 4 4.0 4 4
4 3 2 2.0 2 2
As you can see now you no longer have the simple columns "a", "b", "c" and "d", but instead you have multilevel columns. It seems like method "loc" requires that our DataFrame is lexically sorted, and when we aggregated the original DataFrame we created new columns that are no longer sorted. We can however sort them again with:
df_summary = df_summary.sortlevel(0, axis=1)
# And now this works
print(df_summary.loc[:, "b" : "d"])
b c d
mean sum sum
0 1 1.5 3 3
1 3 3.0 3 3
2 1 1.0 1 1
3 4 4.0 4 4
4 2 2.0 2 2
You may also want to reduce by one level the columns. I can do this with:
df_summary.columns = ['_'.join(col[0] if col[1] == '' else col) for col in df_summary.columns]
# Which makes my DataFrame look like this
print(df_summary)
a b c_mean c_sum d_sum
0 1 1 1.5 3 3
1 1 3 3.0 3 3
2 2 1 1.0 1 1
3 2 4 4.0 4 4
4 3 2 2.0 2 2
More informaton about MultiLevel indexing can be found here: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html

How to create conditional columns in Pandas Data Frame in which column values are based on other columns

I am new to python, I am attempting what would be a conditional mutate in R DPLYR.
In short I would like to create a new column in the Data-frame called Result where : if df.['test'] is greater than 1 df.['Result'] equals the respective df.['count'] for that row, if it lower than 1 then df.['Result'] is
df.['count'] *df.['test']
I have tried df['Result']=df['test'].apply(lambda x: df['count'] if x >=1 else ...) Unfortunately this results in a series, I have also attempted to write small functions which also return series
I would like the final Dataframe to look like this...
no_ Test Count Result
1 2 1 1
2 3 5 5
3 4 1 1
4 6 2 2
5 0.5 2 1
You can use np.where:
df['Result'] = np.where(df['Test'] > 1, df['Count'], df['Count'] * df['Test'])
Output:
No_ Test Count Result
0 1 2.0 1 1.0
1 2 3.0 5 5.0
2 3 4.0 1 1.0
3 4 6.0 2 2.0
4 5 0.5 2 1.0
You can work it out with lists comprehensions:
df['Result'] = [ df['count'][i] if df['test'][i]>1 else
df['count'][i] * df['test'][i]
for i in range(df.shape[0]) ]
Here is a way to do this:
import pandas as pd
df = pd.DataFrame(columns = ['Test', 'Count'],
data={'Test':[2, 3, 4, 6, 0.5], 'Count':[1, 5, 1, 2, 2]})
df['Result'] = df['Count']
df.loc[df['Test'] < 1, 'Result'] = df['Test'] * df['Count']
Output:
Test Count Result
0 2.0 1 1.0
1 3.0 5 5.0
2 4.0 1 1.0
3 6.0 2 2.0
4 0.5 2 1.0

DataFrame calculating average purchase price

I have a dataframe with two columns: quantity and price.
df = pd.DataFrame([
[ 1, 5],
[-1, 6],
[ 2, 3],
[-1, 2],
[-1, 4],
[ 1, 2],
[ 1, 3],
[ 1, 4],
[-2, 5]], columns=['quantity', 'price'])
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
I have added two new columns amount and cum_qty (cumulative quantity).
Now dataframe looks like this (positive quantity represents buys, negative quantity represents sells):
quantity price amount cum_qty
0 1 5 5 1
1 -1 6 -6 0
2 2 3 6 2
3 -1 2 -2 1
4 -1 4 -4 0
5 1 2 2 1
6 1 3 3 2
7 1 4 4 3
8 -2 5 -10 1
I would like to calculate average buy price.
Every time when cum_qty = 0, qantity and amount should be reset to zero.
So we are looking at rows with index = [5,6,7].
For each row one item is bought at prices 2, 3 and 4, which means I have on stock 3 each at average price of 3 [(2 + 3 + 4)/3].
After sell at index = 8 has happened (sell transactions doesn't change buy price), I will have one each at price 3.
So, basically, I have to divide all cumulative buy amounts by cumulative quantities from last cumulative quantity that is not zero.
How to calculate buy on hand as result of all transactions with pandas DataFrame?
Here is a different solution using a loop:
import pandas as pd
import numpy as np
# Original data
df = pd.DataFrame({
'quantity': [ 1, -1, 2, -1, -1, 1, 1, 1, -2],
'price': [5, 6, 3, 2, 4, 2, 3, 4, 5]
})
# Process the data and add the new columns
df['amount'] = df['quantity'] * df['price']
df['cum_qty'] = df['quantity'].cumsum()
df['prev_cum_qty'] = df['cum_qty'].shift(1, fill_value=0)
df['average_price'] = np.nan
for i, row in df.iterrows():
if row['quantity'] > 0:
df.iloc[i, df.columns == 'average_price' ] = (
row['amount'] +
df['average_price'].shift(1, fill_value=df['price'][0])[i] *
df['prev_cum_qty'][i]
)/df['cum_qty'][i]
else:
df.iloc[i, df.columns == 'average_price' ] = df['average_price'][i-1]
df.drop('prev_cum_qty', axis=1)
An advantage of this approach is that it will also work if there are new buys
before the cum_qty gets to zero. As an example, suppose there was a new buy
of 5 at the price of 3, that is, run the following line before processing the
data:
# Add more data, exemplifying a different situation
df = df.append({'quantity': 5, 'price': 3}, ignore_index=True)
I would expect the following result:
quantity price amount cum_qty average_price
0 1 5 5 1 5.0
1 -1 6 -6 0 5.0
2 2 3 6 2 3.0
3 -1 2 -2 1 3.0
4 -1 4 -4 0 3.0
5 1 2 2 1 2.0
6 1 3 3 2 2.5
7 1 4 4 3 3.0
8 -2 5 -10 1 3.0
9 5 3 15 6 3.0 # Not 4.0
That is, since there was still 1 item bought at the price 3, the cum_qty is now 6, and the average price is still 3.
Base on my understanding , you need buy price for each trading circle, then you can try this.
df['new_index'] = df.cum_qty.eq(0).shift().cumsum().fillna(0.)#give back the group id for each trading circle.*
df=df.loc[df.quantity>0]# kick out the selling action
df.groupby('new_index').apply(lambda x:(x.amount.sum()/x.quantity.sum()))
new_index
0.0 5.0# 1st ave price 5
1.0 3.0# 2nd ave price 3
2.0 3.0# 3nd ave price 3 ps: this circle no end , your position still pos 1
dtype: float64
EDIT1 for you additional requirement
DF=df.groupby('new_index',as_index=False).apply(lambda x : x.amount.cumsum()/ x.cum_qty).reset_index()
DF.columns=['Index','AvePrice']
DF.index=DF.level_1
DF.drop(['level_0', 'level_1'],axis=1,inplace=True)
pd.concat([df,DF],axis=1)
Out[572]:
quantity price amount cum_qty new_index 0
level_1
0 1 5 5 1 0.0 5.0
2 2 3 6 2 1.0 3.0
5 1 2 2 1 2.0 2.0
6 1 3 3 2 2.0 2.5
7 1 4 4 3 2.0 3.0
df[df['cum_qty'].map(lambda x: x == 0)].index
will give you at which rows you have a cum_qty of 0
df[df['cum_qty'].map(lambda x: x == 0)].index.max()
gives you the last row with 0 cum_qty
start = df[df['cum_qty'].map(lambda x: x == 0)].index.max() + 1
end = len(df) - 1
gives you the start and end row numbers that are the range you are referring to
df['price'][start:end].sum() / df['quantity'][start:end].sum()
gives you the answer you did in the example you gave
If you want to know this value for each occurrence of cum_qty 0, then you can apply the start/end logic by using the index of each (the result of my first line of code).

Categories