I am using Python3 pandas to read a CSV file which contains 4 columns, named {a,b,c,d}.
Now I want to add a new column e where the data is given by (d-last.d)/last.d.
How can I do it?
Use sub with div and for select last value iat:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,10],
})
df['e'] = df['d'].sub(df['d'].iat[-1]).div(df['d'].iat[-1])
print (df)
a b c d e
0 4 7 1 5 -0.5
1 5 8 3 3 -0.7
2 4 9 5 6 -0.4
3 5 4 7 9 -0.1
4 5 2 1 2 -0.8
5 4 3 0 10 0.0
Related
Having two data frames:
df1 = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
a b
0 1 4
1 2 5
2 3 6
df2 = pd.DataFrame({'c':[7],'d':[8]})
c d
0 7 8
The goal is to add all df2 column values to df1, repeated and create the following result. It is assumed that both data frames do not share any column names.
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If there are strings columns names is possible use DataFrame.assign with unpack Series created by selecing first row of df2:
df = df1.assign(**df2.iloc[0])
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
Another idea is repeat values by df1.index with DataFrame.reindex and use DataFrame.join (here first index value of df2 is same like first index value of df1.index):
df = df1.join(df2.reindex(df1.index, method='ffill'))
print (df)
a b c d
0 1 4 7 8
1 2 5 7 8
2 3 6 7 8
If no missing values in original df is possible use forward filling missing values in last step, but also are types changed to floats, thanks #Dishin H Goyan:
df = df1.join(df2).ffill()
print (df)
a b c d
0 1 4 7.0 8.0
1 2 5 7.0 8.0
2 3 6 7.0 8.0
I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2
If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3
You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.
Say I have an incomplete dataset in a Pandas DataFrame such as:
incData = pd.DataFrame({'comp': ['A']*3 + ['B']*5 + ['C']*4,
'x': [1,2,3] + [1,2,3,4,5] + [1,2,3,4],
'y': [3,None,7] + [1,4,7,None,None] + [4,None,2,1]})
And also a DataFrame with fitting parameters that I could use to fill holes:
fitTable = pd.DataFrame({'slope': [2,3,-1],
'intercept': [1,-2,5]},
index=['A','B','C'])
I would like to achieve the following using y=x*slope+intercept for the None entries only:
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
One way I envisioned is by using join and drop:
incData = incData.join(fitTable,on='comp')
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
incData[incData['y'].isnull()]['slope']+\
incData[incData['y'].isnull()]['intercept']
incData.drop(['slope','intercept'], axis=1, inplace=True)
However, that does not seem very efficient, because it adds and removes columns. It seems that I am making this too complicated, do I overlook a simple more direct solution? Something more like this non-functional code:
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
fitTable[incData[incData['y'].isnull()]['comp']]['slope']+\
fitTable[incData[incData['y'].isnull()]['comp']]['intercept']
I am pretty new to Pandas, so I sometimes get a bit mixed up with the strict indexing rules...
you can use map on the column 'comp' once mask with null value in 'y' like:
mask = incData['y'].isna()
incData.loc[mask, 'y'] = incData.loc[mask, 'x']*\
incData.loc[mask,'comp'].map(fitTable['slope']) +\
incData.loc[mask,'comp'].map(fitTable['intercept'])
and your non-functional code, I guess it would be something like:
incData.loc[mask,'y'] = incData.loc[mask, 'x']*\
fitTable.loc[incData.loc[mask, 'comp'],'slope'].to_numpy()+\
fitTable.loc[incData.loc[mask, 'comp'],'intercept'].to_numpy()
IIUC:
incData.loc[pd.isna(incData['y']), 'y'] = incData[pd.isna(incData['y'])].apply(lambda row: row['x']*fitTable.loc[row['comp'], 'slope']+fitTable.loc[row['comp'], 'intercept'], axis=1)
incData
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
merge is another option
# merge two dataframe together on comp
m = incData.merge(fitTable, left_on='comp', right_index=True)
# y = mx+b
m['y'] = m['x']*m['slope']+m['intercept']
comp x y slope intercept
0 A 1 3 2 1
1 A 2 5 2 1
2 A 3 7 2 1
3 B 1 1 3 -2
4 B 2 4 3 -2
5 B 3 7 3 -2
6 B 4 10 3 -2
7 B 5 13 3 -2
8 C 1 4 -1 5
9 C 2 3 -1 5
10 C 3 2 -1 5
11 C 4 1 -1 5
I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0
I have this pandas Dataframe :
A B C
20 6 7
5 3.8 9
34 4 1
I want to create duplicate rows if value in A is say >10.
So the Dataframe should finally look like:
A B C
10 6 7
10 6 7
5 3.8 9
10 4 1
10 4 1
10 4 1
4 4 1
Is there a way in pandas to do this elegantly? Or I will have to loop over rows and do it manually..?
I have already browsed similar queries on StackOverflow, but none of them does exactly what I want.
Use:
#create default index
df = df.reset_index(drop=True)
#get floor and modulo divisions
a = df['A'] // 10
b = (df['A'] % 10)
#repeat once if not 0
df2 = df.loc[df.index.repeat(b.ne(0).astype(int))]
#repplace values of A with map by index
df2['A'] = df2.index.map(b.get)
#repeat with assign scalar 10
df1 = df.loc[df.index.repeat(a)].assign(A=10)
#join together, sort index and create default RangeIndex
df = df1.append(df2).sort_index().reset_index(drop=True)
print (df)
A B C
0 10 6.0 7
1 10 6.0 7
2 5 3.8 9
3 10 4.0 1
4 10 4.0 1
5 10 4.0 1
6 4 4.0 1