I'm trying to add values on a pandas dataframe based on the inputs of a user and an agent. This is an example that I am working so far.
import numpy as np
import pandas as pd
import random
ls = np.zeros((9,3))
choices = ['R','P','S']
df = pd.DataFrame(ls, columns=['R','P','S'], index = ['RR','RP','RS','PR','PP','PS','SR','SP','SS'])
for _ in range(100):
user_choice = random.choice(choices)
agent_choice = random.choice(choices)
#print(user_choice,agent_choice)
for _ in range(len(df)):
for _ in range(len(df['R'])):
df[user_choice + agent_choice][agent_choice] += 1
Desired result will look something like:
Any help will be much appreciated
Not sure this is really what you want, but Python, NumPy, and Pandas provide some nice conveniences to do these things:
>>> import random
>>> import numpy as np, pandas as pd
>>> from itertools import product
>>> choices = 'RPS'
>>> df = pd.DataFrame(np.zeros((9,3)), columns=list(choices), index=[''.join(l) for l in product(choices, repeat=2)])
>>> user_choices = np.array([random.choice(choices) for _ in range(100)], dtype=str)
>>> agent_choices = np.array([random.choice(choices) for _ in range(100)], dtype=str)
for ac, cc in zip(agent_choices, np.char.add(user_choices, agent_choices)):
... if ac == cc[-1]:
... df[ac][cc] += 1
...
>>> df
R P S
RR 14.0 0.0 0.0
RP 0.0 14.0 0.0
RS 0.0 0.0 7.0
PR 11.0 0.0 0.0
PP 0.0 13.0 0.0
PS 0.0 0.0 8.0
SR 10.0 0.0 0.0
SP 0.0 8.0 0.0
SS 0.0 0.0 15.0
Since you seem to want it normalized to a percentage:
>>> df / 100
R P S
RR 0.14 0.00 0.00
RP 0.00 0.14 0.00
RS 0.00 0.00 0.07
PR 0.11 0.00 0.00
PP 0.00 0.13 0.00
PS 0.00 0.00 0.08
SR 0.10 0.00 0.00
SP 0.00 0.08 0.00
SS 0.00 0.00 0.15
The obvious issue is this is always going to give you a sparse matrix. You're looking to count (user_choice, agent_choice) by agent_choice, then the only cells that can ever be filled in are those where the second char of the index matches the char of the column header. You may as well just collapse that to simply make the index and column header both ['R', 'P', 'S'] and count how many times a user chose 'R' when an agent chose 'R' etc.
>>> df = pd.DataFrame(np.zeros((3,3)), columns=list(choices), index=list(choices))
>>> for a, c in zip(agent_choices, user_choices):
... df[a][c] += 1
...
>>> df
R P S
R 14.0 14.0 7.0
P 11.0 13.0 8.0
S 10.0 8.0 15.0
>>> df / 100
R P S
R 0.14 0.14 0.07
P 0.11 0.13 0.08
S 0.10 0.08 0.15
You can see that contains the same information in a smaller matrix.
Related
I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
I have two dataframes df1 and df2.
One with clients debt, the other with client payments with dates.
I want to create a new data frame with the % of the debt paid in the month of the payment until 01-2017.
import pandas as pd
d1 = {'client number': ['2', '2','3','6','7','7','8','8','8','8','8','8','8','8'],
'month': [1, 2, 3,1,10,12,3,5,8,1,2,4,5,8],
'year':[2013,2013,2013,2019,2013,2013,2013,2013,2013,2014,2014,2015,2016,2017],
'payment' :[100,100,200,10000,200,100,300,500,200,100,200,200,500,50]}
df1 = pd.DataFrame(data=d1).set_index('client number')
df1
d2 = {'client number': ['2','3','6','7','8'],
'debt': [200, 600,10000,300,3000]}
df2 = pd.DataFrame(data=d2)
x=[1,2,3,4,5,6,7,8,9,10]
y=[2013,2014,2015,2016,2017]
for x in month and y in year
if df1['month']=x and df1['year']=year :
df2[month&year] = df1['payment']/df2['debt']
the result needs to be something like this for all the clients
what am I missing?
thank you for your time and help
First set the index of both the dataframes df1 and df2 to client number, then use Index.map to map the client numbers in df1 to their corresponding debt's from df2, then use Series.div to divide the payments of each client by their respective debt's, thus obtaining the fraction of debt which is paid, then create a new column date in df1 from month and year columns finally use DataFrame.join along with DataFrame.pivot_table:
df1 = df1.set_index('client number')
df2 = df2.set_index('client number')
df1['pct'] = df1['payment'].div(df1.index.map(df2['debt'])).round(2)
df1['date'] = df1['year'].astype(str) + '-' + df1['month'].astype(str).str.zfill(2)
df3 = (
df2.join(
df1.pivot_table(index=df1.index, columns='date', values='pct', aggfunc='sum').fillna(0))
.reset_index()
)
Result:
# print(df3)
client number debt 2013-01 2013-02 2013-03 2013-05 2013-08 ... 2013-12 2014-01 2014-02 2015-04 2016-05 2017-08 2019-01
0 2 200 0.5 0.5 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
1 3 600 0.0 0.0 0.33 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 0.0
2 6 10000 0.0 0.0 0.00 0.00 0.00 ... 0.00 0.00 0.00 0.00 0.00 0.00 1.0
3 7 300 0.0 0.0 0.00 0.00 0.00 ... 0.33 0.00 0.00 0.00 0.00 0.00 0.0
4 8 3000 0.0 0.0 0.10 0.17 0.07 ... 0.00 0.03 0.07 0.07 0.17 0.02 0.0
I have a dataframe like this,
ds 0 1 2 4 5 6
0 1991Q3 nan nan nan nan 1.0 nan
1 2014Q2 1.0 3.0 nan nan 1.0 nan
2 2014Q3 1.0 nan nan 1.0 4.0 nan
3 2014Q4 nan nan nan 2.0 3.0 nan
4 2015Q1 nan 1.0 2.0 4.0 4.0 nan
I would like the proportions for each column 0-6 like this,
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.16 0.00 0.00 0.16 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Is there a pandas way to this? Any suggestion would be great.
You can do this:
df = df.replace(np.nan, 0)
df = df.set_index('ds')
In [3194]: df.div(df.sum(1),0).reset_index()
Out[3194]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
OR you can use df.apply:
In [3196]: df = df.replace(np.nan, 0)
In [3197]: df.iloc[:,1:] = df.iloc[:,1:].apply(lambda x: x/x.sum(), axis=1)
In [3198]: df
Out[3197]:
ds 0 1 2 4 5 6
0 1991Q3 0.00 0.00 0.00 0.00 1.00 0.00
1 2014Q2 0.20 0.60 0.00 0.00 0.20 0.00
2 2014Q3 0.17 0.00 0.00 0.17 0.67 0.00
3 2014Q4 0.00 0.00 0.00 0.40 0.60 0.00
4 2015Q1 0.00 0.09 0.18 0.36 0.36 0.00
Set the first column as the index, get the sum of each row, and divide the main dataframe by the sums, and filling the null entries with 0
res = df.set_index("ds")
res.fillna(0).div(res.sum(1),axis=0)
is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.
I have a large dataframe, ~ 1 million rows and 9 columns with some rows missing data in a few of the columns.
dat = pd.read_table( 'file path', delimiter = ';')
I z Sp S B B/T r gf k
0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 0.0779 2 0.00 0.00 0.05 0.01 0.00
The first few columns are being read in as a string, and the last few as NaN, even when there is a numeric value there. When I include dtype = 'float64' I get:
ValueError: could not convert string to float:
Any help in fixing this?
You can use replace by regex - one or more whitespaces to NaN, then cast to float
Empty strings in data are converted to NaN in read_table.
df = df.replace({'\s+':np.nan}, regex=True).astype(float)
print (df)
I z Sp S B B/T r gf k
0 0.0 0.0303 2.0 0.606 0.31 0.04 0.23 0.03 0.38
1 1.0 0.0779 2.0 NaN 0.00 0.00 0.05 0.01 0.00
If data contains some strings which need be replaced to NaN is possible use to_numeric with apply:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
print (df)
I z Sp S B B/T r gf k
0 0 0.0303 2 0.606 0.31 0.04 0.23 0.03 0.38
1 1 0.0779 2 NaN 0.00 0.00 0.05 0.01 0.00