I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10
Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']
Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]
df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']
Related
I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))
I am trying to create a python function that takes in 2 dataframes (dfA, dfB) and merges them based on their date column. When merging, B looks for the nearest date in A that is either equal to or comes before the given date. This is to prevent the data in dfAB from looking into the future (which is why dfAB.iloc[4]['date'] = 1/4/21 and not 1/9/21)
dfA
date i
0 1/1/21 0
1 1/3/21 0
2 1/4/21 0
3 1/10/21 0
dfB
date j k
0 1/1/21 0 0
1 1/2/21 0 0
2 1/3/21 0 0
3 1/9/21 0 0
4 1/12/21 0 0
dfAB (note that for each row of dfB, there is a row of dfAB)
date j k i
0 1/1/21 0 0 0
1 1/1/21 0 0 0
2 1/3/21 0 0 0
3 1/4/21 0 0 0
4 1/10/21 0 0 0
The values in columns i, j, k are just arbitrary values
So to do this we can use pd.merge_asof and a bit of trickery to push the date column from dfB back to the date column from dfA
# a.csv
date i
1/1/21 0
1/3/21 0
1/4/21 0
1/10/21 0
# b.csv
date j k
1/1/21 0 0
1/2/21 0 0
1/3/21 0 0
1/9/21 0 0
1/12/21 0 0
# merge_ab.py
import pandas as pd
dfA = pd.read_csv(
'a.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfB = pd.read_csv(
'b.csv',
delim_whitespace=True,
parse_dates=['date'],
dayfirst=True,
)
dfA['new_date'] = dfA['date']
dfAB = pd.merge_asof(dfB, dfA, on='date', direction='backward')
dfAB['date'] = dfAB['new_date']
dfAB = dfAB.drop(columns=['new_date'])
print(dfAB)
# date j k i
# 0 2021-01-01 0 0 0
# 1 2021-01-01 0 0 0
# 2 2021-03-01 0 0 0
# 3 2021-04-01 0 0 0
# 4 2021-10-01 0 0 0
Here pd.merge_asof is doing the heavy lifting. We are merge the rows of dfB backwardswith the rows ofdfA. This should make it so the data in any row of dfABonly has data from equal to or before the corresponding row indfB. We do a little song and dance to copy the datecolumn indfAand then copy that over to thedatecolumn indfAB` to get the desired output.
It's not 100% clear to me that you want direction='backward' since all your sample data is 0, but if it doesn't look right you can always switch to direction='forward'.
In continuation to my previous Question I need some more help.
The dataframe is like
time eve_id sub_id flag
0 5 2 0
1 5 2 0
2 5 2 1
3 5 2 1
4 5 2 0
5 4 25 0
6 4 30 0
7 5 2 1
I need to count the eve_id in the time flag goes 0 to 1,
and count the eve_id for the time flag is 1 to 1
the output will look like this
time flag count
0 0 2
2 1 2
4 0 3
Can someone help me here ?
First we make a grouper indicator which checks if the difference between two rows is not equal to 0, which indicates a difference.
Then we groupby on this indicator and use agg. Since pandas 0.25.0 we have named aggregations:
s = df['flag'].diff().ne(0).cumsum()
grpd = df.groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
Output
time flag count
0 0 0 2
1 2 1 2
2 4 0 3
3 7 1 1
If time is your index, use:
grpd = df.assign(time=df.index).groupby(s).agg(time=('time', 'first'),
flag=('flag', 'first'),
count=('flag', 'size')).reset_index(drop=True)
notice: the row extra is because there's a difference between the last row and the row before as well
Change aggregate function sum to GroupBy.size:
df1 = (df.groupby([df['flag'].ne(df['flag'].shift()).cumsum(), 'flag'])
.size()
.reset_index(level=0, drop=True)
.reset_index(name='count'))
print (df1)
flag count
0 0 2
1 1 2
2 0 3
3 1 1
I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p
I thought by using the huge pandas.DataFrame library it should be pretty straight forward to do all the standard stuff you can do with an SQL table .. but after looking into many options I still haven't found a good working solution.
Requirements:
table with a 4 columns with different data types (uint32, string,...) , 3 off them should work as index
many (>10k) additional columns of type int8
initially I had the idea to add rows and columns dynamically, but that turned out to be very slow (using df.at[row, col] = y)
I ended up creating a DataFrame with a few columns with different types and join it with another large DataFrame created from a numpy array with elements of type uint8
... that looked quite good, but now nothing works to access, add or set array elements using the index
import numpy as np
import pandas as pd
# create DataFrame
idx_names = ['A','B','C']
col_names = ['y']
df = pd.DataFrame(columns = idx_names + col_names)
# create DataFrame from numpy array
npa = np.zeros((5,10),dtype=np.uint8)
dfa = pd.DataFrame(npa)
# add DataFrames column-wise
t = pd.concat([df,dfa], axis=1)
# set index columns
t.set_index(idx_names,inplace=True)
y 0 1 2 3 4 5 6 7 8 9
A B C
NaN NaN NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
NaN NaN 0 0 0 0 0 0 0 0 0 0
Now I would like to set values in the columns (y,0, ...9) by providing an index.
If the index is not already available it should be added to the table.
t( (t['A']='US',t['B']='CA',t['C']='SFO') , 'y') = "IT"
t( (t['A']='US',t['B']='CA',t['C']='LA' ) , '1') = 255
Assuming you have the following multi-index DataFrame:
In [44]: df
Out[44]:
d
a b c
0 0 1 1
4 4 4 3
0 1 4 4
2 6 1 3
0 1 3 6
and you want to add the following 2D array as 10 new columns:
In [45]: data
Out[45]:
array([[ 0.76021523, 0.92020945, 0.20205685, 0.03888115, 0.41166093, 0.67509844, 0.15351393, 0.00926459, 0.09297956, 0.72930072],
[ 0.38229582, 0.88199428, 0.08153019, 0.08367272, 0.88548522, 0.50332168, 0.94652147, 0.83362442, 0.219431 , 0.09399454],
[ 0.43743926, 0.79447959, 0.18430898, 0.31534202, 0.63229928, 0.80921108, 0.76570853, 0.09890863, 0.33604303, 0.92960105],
[ 0.6561763 , 0.26731786, 0.1266551 , 0.78960943, 0.900017 , 0.02468355, 0.99110764, 0.40402032, 0.46224193, 0.44569296],
[ 0.1509643 , 0.26830514, 0.69337022, 0.1339183 , 0.42711838, 0.0883597 , 0.6923594 , 0.01451872, 0.56684861, 0.46792245]])
Solution:
In [47]: df = df.join(pd.DataFrame(data, index=df.index))
In [48]: df
Out[48]:
d 0 1 2 3 4 5 6 7 8 9
a b c
0 0 1 1 0.760215 0.920209 0.202057 0.038881 0.411661 0.675098 0.153514 0.009265 0.092980 0.729301
4 4 4 3 0.382296 0.881994 0.081530 0.083673 0.885485 0.503322 0.946521 0.833624 0.219431 0.093995
0 1 4 4 0.437439 0.794480 0.184309 0.315342 0.632299 0.809211 0.765709 0.098909 0.336043 0.929601
2 6 1 3 0.656176 0.267318 0.126655 0.789609 0.900017 0.024684 0.991108 0.404020 0.462242 0.445693
0 1 3 6 0.150964 0.268305 0.693370 0.133918 0.427118 0.088360 0.692359 0.014519 0.566849 0.467922