I have the following pandas data frame:
code:
df = pd.DataFrame({'A1': [0.1,0.5,3.0, 9.0], 'A2':[2.0,4.5,1.2,9.0],'Random
data':[300,4500,800,900],'Random data2':[3000,450,80,90]})
output:
A1 A2 Randomdata Randomdata2
0 0.1 2.0 300 3000
1 0.5 4.5 4500 450
2 3.0 1.2 800 80
3 9.0 9.0 900 90
It is only showing A1 and A2 but it actually goes from A1 to A30 of data. I want to calculate the average and standard deviation for each row but only columns A1 to A30 (not including the columns Randomdata and Randomdata2) and add 2 new columns with the average and standard deviation like shown below.
A1 A2 Randomdata Randomdata2 Average Stddev
0 0.1 2.0 300 3000
1 0.5 4.5 4500 450
2 3.0 1.2 800 80
3 9.0 9.0 900 90
Preferred Approach
Use pd.DataFrame.filter
Your choice for regex pattern can be as explicit as you'd like. In this case, I specified that the column must start with 'A' and have 1 or more digits afterwards.
d = df.filter(regex='^A\d+')
df.assign(Average=d.mean(1), Stddev=d.std(1))
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
Alt 1
This is trying too hard.
rnm = dict(mean='Average', std='Stddev')
df.join(df.T[df.columns.str.startswith('A')].agg(['mean', 'std']).T.rename(columns=rnm))
A1 A2 Random data Random data2 Average Stddev
0 0.1 2.0 300 3000 1.05 1.343503
1 0.5 4.5 4500 450 2.50 2.828427
2 3.0 1.2 800 80 2.10 1.272792
3 9.0 9.0 900 90 9.00 0.000000
Related
I have a DataFrame similar to this:
MACD
0 -2.3
1 -0.3
2 0.8
3 0.1
4 0.6
5 -0.7
6 1.1
7 2.4
How can I add an extra column showing the number of rows since MACD was on the opposite side of the origin (positive/negative)?
Desired Outcome:
MACD RowsSince
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1
3 0.1 2
4 0.6 3
5 -0.7 1
6 1.1 1
7 2.4 2
We can try with use np.sign with diff create the subgroup , then with groupby + cumcount
s = np.sign(df['MACD']).diff().ne(0).cumsum()
df['new'] = (df.groupby(s).cumcount()+1).mask(s.eq(1))
df
Out[80]:
MACD new
0 -2.3 NaN
1 -0.3 NaN
2 0.8 1.0
3 0.1 2.0
4 0.6 3.0
5 -0.7 1.0
6 1.1 1.0
7 2.4 2.0
I would like to copy a single row of dataframe2 to multiple rows in dataframe 1 based on postcode matching
Example dataframe1
postcode price type
1 2000 150 A
2 2000 250 B
3 2001 350 C
4 2001 550 A
5 2001 650 B
6 2004 750 C
Example dataframe2
postcode lat lon
1 2000 1.2 1.2
2 2001 1.3 1.5
3 2002 1.5 1.2
4 2003 1.6 1.5
5 2004 1.7 1.8
6 2005 1.9 1.98
7 2006 1.2 1.2
8 2007 1.3 1.5
9 2008 1.5 1.2
10 2009 1.6 1.5
11 2010 1.7 1.8
12 2011 1.9 1.98
Merged final dataframe according to postcode with unnecessary datas discarded from dataframe2
postcode price type lat lon
1 2000 150 A 1.2 1.2
2 2000 250 B 1.2 1.2
3 2001 350 C 1.3 1.5
4 2001 550 A 1.3 1.5
5 2001 650 B 1.3 1.5
6 2004 750 C 1.9 1.98
Please note I do not want to use geopandas or gmaps api, I want this merged as simple as possible using an if statement or something similar.
Thanks
Yes there is very simple way of doing this. you can use pandas merge function to do so:
final = pd.merge(df1,df2,on='postcode')
returns
postcode price type lat lon
0 2000 150 A 1.2 1.2
1 2000 250 B 1.2 1.2
2 2001 350 C 1.3 1.5
3 2001 550 A 1.3 1.5
4 2001 650 B 1.3 1.5
5 2004 750 C 1.7 1.8
How do I write a python code for calculating values for a new column --"UEC_saving" by OFFSET # of rows in the "UEC"column. The table format is pandas dataframe.
Positive numbers in the "Offset_rows" column means shift rows down and vice versa.
For example, "UEC_saving" for index 0 is 8.6-7.2 and for index 2 is 0.2-7.0
The output for "UEC_saving" should look like this:
Product_Class UEC Offset_rows UEC_saving
0 PC1 8.6 1 1.4
1 PC1 7.2 0 0.0
2 PC1 0.2 -1 -7.0
3 PC2 18.8 2 2.2
4 PC2 10.0 1 1.4
5 PC2 8.6 0 0.0
6 PC2 0.3 -1 -8.3
you can do:
for i,row in df.iterrows():
df.at[i,'UEC_saving'] = row['UEC'] - df.loc[i+row['Offset_rows']]['UEC']
df
UEC Offset_rows UEC_saving
0 8.6 1 1.4
1 7.2 0 0.0
2 0.2 -1 -7.0
3 18.8 2 10.2
4 10.0 1 1.4
5 8.6 0 0.0
6 0.3 -1 -8.3
this lines up with all your desired answers except for index 3, please let me know if there was a typo, or please explain further in your question exactly what occurs there
suppose i have the following pandas dataframe , and i need to rank rows at
new columns ( i meant if i want to rank 4 rows i will creat 4 new rows )
at the following dataframe , i have three numerical columns , i need to compare and rank each row , there is three rows so i need to craete three new columns to compare the value in each colmuns with the row
Revenue-SaleCount-salesprices-ranka-rankb-rankc
300------10-----------8000--------2--------1-----3
100----9000-----------1000--------1--------3-----2
how can i do that with simple code and using for loop
thanks in advance
import pandas as pd
df = pd.DataFrame({'Revenue':[300,9000,1000,750,500,2000,0,600,50,500],
'Date':['2016-12-02' for i in range(10)],
'SaleCount':[10,100,30,35,20,100,0,30,2,20],
'salesprices':[8000,1000,500,700,2500,3800,16,7400,3200,21]})
print(df)
We can write a loop with string.ascii_lowercase and make each column with rank over axis=1
import string
cols = ['Revenue', 'SaleCount', 'salesprices']
for index, col in enumerate(cols):
df[f'rank{string.ascii_lowercase[index]}'] = df[cols].rank(axis=1)[col]
Output:
print(df)
Revenue Date SaleCount salesprices ranka rankb rankc
0 300 2016-12-02 10 8000 2.0 1.0 3.0
1 9000 2016-12-02 100 1000 3.0 1.0 2.0
2 1000 2016-12-02 30 500 3.0 1.0 2.0
3 750 2016-12-02 35 700 3.0 1.0 2.0
4 500 2016-12-02 20 2500 2.0 1.0 3.0
5 2000 2016-12-02 100 3800 2.0 1.0 3.0
6 0 2016-12-02 0 16 1.5 1.5 3.0
7 600 2016-12-02 30 7400 2.0 1.0 3.0
8 50 2016-12-02 2 3200 2.0 1.0 3.0
9 500 2016-12-02 20 21 3.0 1.0 2.0
Note I used f-string which is only supported with Python version > 3.4.
Else use .format string formatting like following:
import string
cols = ['Revenue', 'SaleCount', 'salesprices']
for index, col in enumerate(cols):
df['rank{}'.format(string.ascii_lowercase[index])] = df[cols].rank(axis=1)[col]
I have a dataframe with multiple columns and rows
For all columns I need to say the row value is equal to 0.5 of this row + 0.5 of the row befores value.
I currently set up a loop which is working. But I feel there is a better way without using a loop. Does anyone have any thoughts?
dataframe = df_input
df_output=df_input.copy()
for i in range(1, df_input.shape[0]):
try:
df_output.iloc[[i]]= (df_input.iloc[[i-1]]*(1/2)).values+(df_input.iloc[[i]]*(1/2)).values
except:
pass
Do you mean sth like this:
First creating test data:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 20, [5, 3]), columns=['A', 'B', 'C'])
A B C
0 6 19 14
1 10 7 6
2 18 10 10
3 3 7 2
4 1 11 5
Your requested function:
(df*.5).rolling(2).sum()
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
EDIT:
for an unbalanced sum you can define an auxiliary function:
def weighted_mean(arr):
return sum(arr*[.25, .75])
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
EDIT2:
...and if the weights should be to be set at runtime:
def weighted_mean(arr, weights=[.5, .5]):
return sum(arr*weights/sum(weights))
No additional argument defaults to balanced mean:
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
An unbalanced mean:
df.rolling(2).apply(weighted_mean, raw=True, args=[[.25, .75]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
The division by sum(weights) enables the definition of weights not only restricted to fractions of one, but by any ratio:
df.rolling(2).apply(weighted_mean, raw=True, args=[[1, 3]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
df.rolling(window=2, min_periods=1).apply(lambda x: x[0]*0.5 + x[1] if len(x) > 1 else x)
This will do the same operation for all columns.
Explanation: For each rolling object the lambda chooses the columns and x are structured like [this_col[i], this_col[i+1]] for all cols, and then doing custom arithmetic is straightforward.
Some
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 1)), columns=['a'])
df["cumsum_a"] = 0.5*df["a"].cumsum() + 0.5*df["a"]
thing like below?