iterating re.split() on a dataframe - python

I am trying to use re.split() to split a single variable in a pandas dataframe into two other variables.
My data looks like:
xg
0.05+0.43
0.93+0.05
0.00
0.11+0.11
0.00
3.94-2.06
I want to create
e a
0.05 0.43
0.93 0.05
0.00
0.11 0.11
0.00
3.94 2.06
I can do this using a for loop and and indexing.
for i in range(len(df)):
if df['xg'].str.len()[i] < 5:
df['e'][i] = df['xg'][i]
else:
df['e'][i], df['a'][i] = re.split("[\+ \-]", df['xg'][i])
However this is slow and I do not believe is a good way of doing this and I am trying to improve my code/python understanding.
I had made various attempts by trying to write it using np.where, or using a list comprehension or apply lambda but I can't get it too run. I think all the issues I have are because I am trying to apply the functions to the whole series rather than the positional value.
If anyone has an idea of a better method than my ugly for loop I would be very interested.

Borrowed from this answer using the str.split method with the expand argument:
https://stackoverflow.com/a/14745484/3084939
df = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
df[['left','right']] = df['col'].str.split('[+|-]', expand=True)
df.head()
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6+1.6 0.6 1.6

This may be what you want. Not sure it's elegant, but should be faster than a python loop.
import pandas as pd
import numpy as np
data = ['0.05+0.43','0.93+0.05','0.00','0.11+0.11','0.00','3.94-2.06']
df = pd.DataFrame(data, columns=['xg'])
# Solution
tmp = df['xg'].str.split(r'[ \-+]')
df['e'] = tmp.apply(lambda x: x[0])
df['a'] = tmp.apply(lambda x: x[1] if len(x) > 1 else np.nan)
del(tmp)

Regex to retain - ve sign
import pandas as pd
import re
df1 = pd.DataFrame({'col': ['1+2','3+4','20','0.6-1.6']})
data = [[i] + re.findall('-*[0-9.]+', i) for i in df1['col']]
df = pd.DataFrame(data, columns=["col", "left", "right"])
print(df.head())
col left right
0 1+2 1 2
1 3+4 3 4
2 20 20 None
3 0.6-1.6 0.6 -1.6
[Program finished]

Related

Adding values in a column by "formula" in a pandas dataframe

I am trying to add values to a column using a formula, using the information from this question: Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
I already have the first number of the column B and I want to make a formula for the rest of column B.
The dataframe looks something like this:
A B C
0.16 0.001433 25.775485
0.28 0 25.784443
0.28 0 25.792396
...
And the method I tried was:
for i in range(1, len(df)):
df.loc[i, "B"] = df.loc[i-1, "B"] + df.loc[i,"A"]*((df.loc[i,"C"]) - (df.loc[i-1,"C"]))
But this code produces an infinite loop, can someone help me with this?
you can use shift and a simple assignment.
The general rule in pandas if you use loops you're doing something wrong, it's considered an anti pattern.
df['B_new'] = df['B'].shift(-1) - df['A'] * ((df['C'] - df['C'].shift(-1)))
A B C B_new
0 0.16 0.001433 25.775485 0.001433
1 0.28 0.000000 25.784443 0.002227
2 0.28 0.000000 25.792396 NaN

How to decode column value from rare label by matching column names

I have two dataframes like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
tdf = pd.DataFrame({'Id': [1,1,1,1,3,3,3],
'feature': ['grade=Rare','dash=Q','dumma=rare','dumeel=R','dash=Rare','dumma=rare','grade=D'],
'value': [0.2,0.45,-0.32,0.56,1.3,1.5,3.7]})
My objective is to
a) Replace the Rare or rare values in feature column of tdf dataframe by original value from cdf dataframe.
b) To identify original value, we can make use of the string before = Rare or =rare or = rare etc. That string represents the column name in cdf dataframe (from where original value to replace rare can be found)
I was trying something like the below but not sure how to go from here
replace_df = cdf.merge(tdf,how='inner',on='Id')
replace_df ["replaced_feature"] = np.where(((replace_df["feature"].str.contains('rare',regex=True)]) & (replace_df["feature"].str.split('='))])
I have to apply this on a big data where I have million rows and more than 1000 replacements to be made like this.
I expect my output to be like as shown below
Here is one possible approach using MultiIndex.map to substitute values from cdf into tdf:
s = tdf['feature'].str.split('=')
m = s.str[1].isin(['rare', 'Rare'])
v = tdf[m].set_index(['Id', s[m].str[0]]).index.map(cdf.set_index('Id').stack())
tdf.loc[m, 'feature'] = s[m].str[0] + '=' + v.astype(str)
print(tdf)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
# list comprehension to find where rare is in the feature col
tdf['feature'] = [x if y.lower()=='rare' else x+'='+y for x,y in tdf['feature'].str.split('=')]
# create a mask where feature is in columns of cdf
mask = tdf['feature'].isin(cdf.columns)
# use loc to filter your frame and use merge to join cdf on the id and feature column - after you use stack
tdf.loc[mask, 'feature'] = tdf.loc[mask, 'feature']+'='+tdf.loc[mask].merge(cdf.set_index('Id').stack().to_frame(),
right_index=True, left_on=['Id', 'feature'])[0].astype(str)
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=D 3.70
My feeling is there's no need to look for Rare values.
Extract the column name from tdf to lookup in cdf. After, flatten your cdf dataframe to extract the right values:
r = tdf.set_index('Id')['feature'].str.split('=').str[0].str.lower()
tdf['feature'] = r.values + '=' + cdf.set_index('Id').unstack() \
.loc[zip(r.values, r.index)] \
.astype(str).values
Output:
>>> tdf
Id feature value
0 1 grade=D 0.20
1 1 dash=Q 0.45
2 1 dumma=1123 -0.32
3 1 dumeel=R 0.56
4 3 dash=P 1.30
5 3 dumma=849 1.50
6 3 grade=A 3.70
>>> r
Id # <- the index is the row of cdf
1 grade # <- the values are the column of cdf
1 dash
1 dumma
1 dumeel
3 dash
3 dumma
3 grade
Name: feature, dtype: object

How to divide one dataframe by the other without converting to numpy first?

I have a dataframe with two columns, x and y, and a few hundred rows.
I have another dataframe with only one row and two columns, x and y.
I want to divide column x of the big dataframe by the value in x of the small dataframe, and column y by column y.
If I divide one dataframe by the other, I get all NaNs. For the division to work, I must convert the small dataframe to numpy.
Why can't I divide one dataframe by the other? What am I missing? I have a toy example below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
r = int(10)
df['x'] = np.arange(0,r)
df['y'] = df['x'] * 2
other_df = pd.DataFrame()
other_df['x'] = [100]
other_df['y'] = [400]
# This doesn't work - I get all nans
new = df / other_df
# this works - it gives me what I want
new2 = df / [100,400]
# this also works
new3 = df / other_df.to_numpy()
You can convert one row DataFrame to Series for correct align columns, e.g. by selecting first row by DataFrame.iloc:
new = df / other_df.iloc[0]
print (new)
x y
0 0.00 0.000
1 0.01 0.005
2 0.02 0.010
3 0.03 0.015
4 0.04 0.020
5 0.05 0.025
6 0.06 0.030
7 0.07 0.035
8 0.08 0.040
9 0.09 0.045
You can use numpy.divide() to divide as numpy has a great property that is Broadcasting.
new = np.divide(df,other_df)
Please check this link for more details.

Pandas divide one row by another and output to another row in the same dataframe

For a Dataframe such as:
dt
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
What's the easiest way to append to the same data frame the result of dividing Row1 by Row2? i.e. the desired outcome is:
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
Newrow 2 0.5
Sorry if this is a simple question, I'm slowly getting to grips with pandas from an R background.
Thanks in advance!!!
The code below will create a new row with index d which is formed from dividing rows a and b.
import pandas as pd
df = pd.DataFrame(data={'x':[1,2,3], 'y':[4,5,6]}, index=['a', 'b', 'c'])
df.loc['d'] = df.loc['a'] / df.loc['b']
print(df)
# x y
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 0.5 0.8
in order to access the first two rows without caring about the index, you can use:
df.loc['newrow'] = df.iloc[0] / df.iloc[1]
then just follow #Ffisegydd's solution...
in addition, if you want to append multiple rows, use the pd.DataFrame.append function.
pandas does all the work row by row. By including another element it also interprets you want a new column:
data['new_row_with_division'] = data['row_name1_values'] / data['row_name2_values']

Creating DataFrame with Hierarchical Columns

What is the easiest way to create a DataFrame with hierarchical columns?
I am currently creating a DataFrame from a dict of names -> Series using:
df = pd.DataFrame(data=serieses)
I would like to use the same columns names but add an additional level of hierarchy on the columns. For the time being I want the additional level to have the same value for columns, let's say "Estimates".
I am trying the following but that does not seem to work:
pd.DataFrame(data=serieses,columns=pd.MultiIndex.from_tuples([(x, "Estimates") for x in serieses.keys()]))
All I get is a DataFrame with all NaNs.
For example, what I am looking for is roughly:
l1 Estimates
l2 one two one two one two one two
r1 1 2 3 4 5 6 7 8
r2 1.1 2 3 4 5 6 71 8.2
where l1 and l2 are the labels for the MultiIndex
This appears to work:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.concat({"Estimates": pd.DataFrame(data)}, axis=1, names=["l1", "l2"])
l1 Estimates
l2 a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
I know the question is really old but for pandas version 0.19.1 one can use direct dict-initialization:
d = {('a','b'):[1,2,3,4], ('a','c'):[5,6,7,8]}
df = pd.DataFrame(d, index=['r1','r2','r3','r4'])
df.columns.names = ('l1','l2')
print df
l1 a
l2 b c
r1 1 5
r2 2 6
r3 3 7
r4 4 8
Im not sure but i think the use of a dict as input for your DF and a MulitIndex dont play well together. Using an array as input instead makes it work.
I often prefer dicts as input though, one way is to set the columns after creating the df:
import pandas as pd
data = {'a': [1,2,3,4], 'b': [10,20,30,40],'c': [100,200,300,400]}
df = pd.DataFrame(np.array(data.values()).T, index=['r1','r2','r3','r4'])
tups = zip(*[['Estimates']*len(data),data.keys()])
df.columns = pd.MultiIndex.from_tuples(tups, names=['l1','l2'])
l1 Estimates
l2 a c b
r1 1 10 100
r2 2 20 200
r3 3 30 300
r4 4 40 400
Or when using an array as input for the df:
data_arr = np.array([[1,2,3,4],[10,20,30,40],[100,200,300,400]])
tups = zip(*[['Estimates']*data_arr.shape[0],['a','b','c'])
df = pd.DataFrame(data_arr.T, index=['r1','r2','r3','r4'], columns=pd.MultiIndex.from_tuples(tups, names=['l1','l2']))
Which gives the same result.
The solution by Rutger Kassies worked in my case, but I have
more than one column in the "upper level" of the column hierarchy.
Just want to provide what worked for me as an example since it is a more general case.
First, I have data with that looks like this:
> df
(A, a) (A, b) (B, a) (B, b)
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
2 8.51 9.60 66.67 50.70
3 0.03 508.99 56.00 8.58
I would like it to look like this:
> df
A B
a b a b
0 0.00 9.75 0.00 0.00
1 8.85 8.86 35.75 35.50
...
The solution is:
tuples = df.transpose().index
new_columns = pd.MultiIndex.from_tuples(tuples, names=['Upper', 'Lower'])
df.columns = new_columns
This is counter-intuitive because in order to create columns, I have to do it through index.

Categories