Adding values in a column by "formula" in a pandas dataframe - python

I am trying to add values to a column using a formula, using the information from this question: Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I already have the first number of the column B and I want to make a formula for the rest of column B.
The dataframe looks something like this:
A B C
0.16 0.001433 25.775485
0.28 0 25.784443
0.28 0 25.792396
...
And the method I tried was:
for i in range(1, len(df)):
df.loc[i, "B"] = df.loc[i-1, "B"] + df.loc[i,"A"]*((df.loc[i,"C"]) - (df.loc[i-1,"C"]))
But this code produces an infinite loop, can someone help me with this?

you can use shift and a simple assignment.
The general rule in pandas if you use loops you're doing something wrong, it's considered an anti pattern.
df['B_new'] = df['B'].shift(-1) - df['A'] * ((df['C'] - df['C'].shift(-1)))
A B C B_new
0 0.16 0.001433 25.775485 0.001433
1 0.28 0.000000 25.784443 0.002227
2 0.28 0.000000 25.792396 NaN

Related

How to divide one dataframe by the other without converting to numpy first?

I have a dataframe with two columns, x and y, and a few hundred rows.
I have another dataframe with only one row and two columns, x and y.
I want to divide column x of the big dataframe by the value in x of the small dataframe, and column y by column y.
If I divide one dataframe by the other, I get all NaNs. For the division to work, I must convert the small dataframe to numpy.
Why can't I divide one dataframe by the other? What am I missing? I have a toy example below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
r = int(10)
df['x'] = np.arange(0,r)
df['y'] = df['x'] * 2
other_df = pd.DataFrame()
other_df['x'] = [100]
other_df['y'] = [400]
# This doesn't work - I get all nans
new = df / other_df
# this works - it gives me what I want
new2 = df / [100,400]
# this also works
new3 = df / other_df.to_numpy()
You can convert one row DataFrame to Series for correct align columns, e.g. by selecting first row by DataFrame.iloc:
new = df / other_df.iloc[0]
print (new)
x y
0 0.00 0.000
1 0.01 0.005
2 0.02 0.010
3 0.03 0.015
4 0.04 0.020
5 0.05 0.025
6 0.06 0.030
7 0.07 0.035
8 0.08 0.040
9 0.09 0.045
You can use numpy.divide() to divide as numpy has a great property that is Broadcasting.
new = np.divide(df,other_df)
Please check this link for more details.

Merging pandas dataframes, alternating rows without soritng rows

I'm trying to mimic an spss style correlation table in my Pandas output to make it easier to read for supervisors who are used to seeing matrices laid out this way (and are annoyed that I don't use SPSS anymore because it's harder for them to read).
This means that there is a table where the p-value is placed directly above the correlation coeff in the table. I have easily produced both the p-values and the coeffs and saved each into a separate dataframes like the ones below.
pvals
T 4 Rw Af
T |0.00|0.05|0.24|0.01
4 |0.05|0.00|0.76|0.03
Rw|0.24|0.76|0.00|0.44
...
rs
T 4 Rw Af
T |1.00|0.65|0.28|0.44
4 |0.65|1.00|0.01|0.03
Rw|-0.03|0.01|1.00|0.32
...
What I'd like to do is make a table where the two dataframes are merged without changing the order of the index. It would look like
T |P |0.00|0.05|0.24|0.01
|r |1.00|0.65|0.28|0.44
Rw|P |0.05|0.00|0.76|0.03
|r |0.65|1.00|0.01|0.03
...
Now, I understand that if my columns had alphabetically ordered names I could use something like
pd.concat([pvals, rs]).sort_index(kind='merge')
However, my columns are named with descriptive, non-ordered names and so this doesn't work because it reorders the index into alphabetical order. I also know that
df.corr()
will produce a matrix like the rs example I've given above but this is not what I'm looking for.
If anyone has any advice I'd really appreciate it.
Kev
You can use helper MultiIndex with np.arange and DataFrame.set_index with append=True, add keys parameter for P, r values, sorting by ranges, remove this level and last change order of levels by DataFrame.swaplevel:
s1 = pvals.set_index(np.arange(len(pvals)), append=True)
s2 = rs.set_index(np.arange(len(rs)), append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
print (df)
T 4 Rw Af
T P 0.00 0.05 0.24 0.01
r 1.00 0.65 0.28 0.44
4 P 0.05 0.00 0.76 0.03
r 0.65 1.00 0.01 0.03
Rw P 0.24 0.76 0.00 0.44
r -0.03 0.01 1.00 0.32
Asker Edit
This answer worked once the code was changed to
s1 = pvals.assign(a = np.arange(len(pvals))).set_index('a', append=True)
s2 = rs.assign(a = np.arange(len(rs))).set_index('a', append=True)
df = (pd.concat([s1, s2], keys=('P','r'))
.sort_index(kind='merge', level=2)
.reset_index(level=2, drop=True)
.swaplevel(0,1))
which was recomended by the answerer.

iterating re.split() on a dataframe

I am trying to use re.split() to split a single variable in a pandas dataframe into two other variables.
My data looks like:
xg
0.05+0.43
0.93+0.05
0.00
0.11+0.11
0.00
3.94-2.06
I want to create
e a
0.05 0.43
0.93 0.05
0.00
0.11 0.11
0.00
3.94 2.06
I can do this using a for loop and and indexing.
for i in range(len(df)):
if df['xg'].str.len()[i] < 5:
df['e'][i] = df['xg'][i]
else:
df['e'][i], df['a'][i] = re.split("[\+ \-]", df['xg'][i])
However this is slow and I do not believe is a good way of doing this and I am trying to improve my code/python understanding.
I had made various attempts by trying to write it using np.where, or using a list comprehension or apply lambda but I can't get it too run. I think all the issues I have are because I am trying to apply the functions to the whole series rather than the positional value.
If anyone has an idea of a better method than my ugly for loop I would be very interested.
Borrowed from this answer using the str.split method with the expand argument:
https://stackoverflow.com/a/14745484/3084939
df = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
df[['left','right']] = df['col'].str.split('[+|-]', expand=True)
df.head()
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6+1.6 0.6 1.6
This may be what you want. Not sure it's elegant, but should be faster than a python loop.
import pandas as pd
import numpy as np
data = ['0.05+0.43','0.93+0.05','0.00','0.11+0.11','0.00','3.94-2.06']
df = pd.DataFrame(data, columns=['xg'])
# Solution
tmp = df['xg'].str.split(r'[ \-+]')
df['e'] = tmp.apply(lambda x: x[0])
df['a'] = tmp.apply(lambda x: x[1] if len(x) > 1 else np.nan)
del(tmp)
Regex to retain - ve sign
import pandas as pd
import re
df1 = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
data = [[i] + re.findall('-*[0-9.]+', i) for i in df1['col']]
df = pd.DataFrame(data, columns=["col", "left", "right"])
print(df.head())
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6-1.6 0.6 -1.6
[Program finished]

Pandas data frame excluding rows in a particular range

I have a DataFrame df as below. I am just wondering to exclude rows in a particular column, say Vader_Sentiment, which has values in range -0.1 to 0.1 and keep the remaining.
I have tried df = [df['Vader_Sentiment'] < -0.1 & df['Vader_Sentiment] > 0.1] but it doesn't seem to work.
Text Vader_Sentiment
A -0.010
B 0.206
C 0.003
D -0.089
E 0.025
You can use Series.between():
df.loc[~df.Vader_Sentiment.between(-0.1, 0.1)]
Text Vader_Sentiment
1 B 0.206
Three things:
The tilde (~) operator denotes an inverse/complement.
Make sure you have numeric data. df.dtypes should show float for Vader_Sentiment, not "object"
You can specify an inclusive parameter to note if you want intervals to be closed or open

Pandas - remove cells based on value

I have a dataframe with z-scores for several values. It looks like this:
ID Cat1 Cat2 Cat3
A 1.05 -1.67 0.94
B -0.88 0.22 -0.56
C 1.33 0.84 1.19
I want to write a script that will tell me which IDs correspond with values in each category relative to a cut-off value I specify as needed. Because I am working with z-scores, I will need to compare the absolute value against my cut-off.
So if I set my cut-off at 0.75, the resulting dataframe would be:
Cat1 Cat2 Cat3
A A A
B C C
C
If I set 1.0 as my cut-off value: the dataframe above would return:
Cat1 Cat2 Cat3
A A C
C
I know that I can do queries like this:
df1 = df[df['Cat1'] > 1]
df1
df1 = df[df['Cat1'] < -1]
df1
to individually query each column and find the information I'm looking for but this is tedious even if I figure out how to use the abs function to combine the two queries into one.How can I apply this filtration to the whole dataframe?
I've come up with this skeleton of a script:
cut_off = 1.0
cols = list(df.columns)
cols.remove('ID')
for col in cols:
# FOR CELL IN VALUE OF EACH CELL IN COLUMN:
if (abs.CELL < cut_off):
CELL = NaN
to basically just eliminate any values that don't meet the cut-off. If I can get this to work, it will bring me closer to my goal but I am stuck and don't even know if I am on the right track. Again, the overall goal is to quickly figure out which cells have absolute-values above the cut-off in each category be able to list the corresponding IDs.
I apologize if anything is confusing or vague; let me know in comments and I'll fix it. I've been trying to figure this out for most of today and my brain is somewhat fried
You don't have to apply the filtration to columns, you can also do
df[df > 1]
, and also,
df[df > 1] = np.NaN

Categories