I am not sure if it's a good idea. I am using transfer learning to train some new data. The model shape has 180 columns(features) and the new data input has 500 columns. It 's not good to cut columns from the new data. So I am thinking to add more columns to the dataset used in the original model. So if I want to add e.g. columns from 181 to 499 and assign 0 to those cells, how can I do it? Please ignore label column now. Thanks for your help
Original df:
0 1 2 3 4 5 ... 179 (to column 179) label
0 0.28001 0.32042 0.93222. 0.87534. 0.44252 0.2321
1
2
Expected output
0 1 2 3 4 5 ... 179 180 181 182 ....499 label
0 0.28001 0.32042 0.93222. 0.87534. 0.44252 0.2321 0 0 0 0 0
1 0.38001 0.42042 0.13222. 0.67534. 0.64252 0.4321 0 0 0 0 0
2
Since you don't care about columns label, use pd.concat on new construct dataframe from np.zeros
Sample df
In [336]: df
Out[336]:
0 1 2 3 4 5
0 0.28001 0.32042 0.93222. 0.87534. 0.44252 0.2321
1 0.38001 0.42042 0.13222. 0.67534. 0.64252 0.4321
m = 20 #use 20 to show demo. You need change it to 500 for your real data
x, y = df.shape
df_final = pd.concat([df, pd.DataFrame(np.zeros((x, m - y))).add_prefix('n_')], axis=1)
In [340]: df_final
Out[340]:
0 1 2 3 4 5 n_0 n_1 n_2 n_3 \
0 0.28001 0.32042 0.93222. 0.87534. 0.44252 0.2321 0.0 0.0 0.0 0.0
1 0.38001 0.42042 0.13222. 0.67534. 0.64252 0.4321 0.0 0.0 0.0 0.0
n_4 n_5 n_6 n_7 n_8 n_9 n_10 n_11 n_12 n_13
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
If you need columns in sequential numbers
m = 20
x, y = df.shape
df_final = pd.concat([df, pd.DataFrame(np.zeros((x, m - y)), columns=range(y, m))], axis=1)
Out[341]:
0 1 2 3 4 5 6 7 8 9 \
0 0.28001 0.32042 0.93222. 0.87534. 0.44252 0.2321 0.0 0.0 0.0 0.0
1 0.38001 0.42042 0.13222. 0.67534. 0.64252 0.4321 0.0 0.0 0.0 0.0
10 11 12 13 14 15 16 17 18 19
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Related
I have two medium-sized datasets which looks like:
books_df.head()
ISBN Book-Title Book-Author
0 0195153448 Classical Mythology Mark P. O. Morford
1 0002005018 Clara Callan Richard Bruce Wright
2 0060973129 Decision in Normandy Carlo D'Este
3 0374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata
4 0393045218 The Mummies of Urumchi E. J. W. Barber
and
ratings_df.head()
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
And I wanna get a pivot table like this:
ISBN 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
User-ID
1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I've tried:
R_df = ratings_df.pivot(index = 'User-ID', columns ='ISBN', values = 'Book-Rating').fillna(0) # Memory overflow
which failed for:
MemoryError:
and this:
R_df = q_data.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().unstack()
which failed for the same.
I want to use it for singular value decomposition and matrix factorization.
Any ideas?
The dataset I'm working with is: http://www2.informatik.uni-freiburg.de/~cziegler/BX/
One option is to use pandas Sparse functionality, since your data here is (very) sparse:
In [11]: df
Out[11]:
User-ID ISBN Book-Rating
0 276725 034545104X 0
1 276726 0155061224 5
2 276727 0446520802 0
3 276729 052165615X 3
4 276729 0521795028 6
In [12]: res = df.groupby(['User-ID', 'ISBN'])['Book-Rating'].mean().astype('Sparse[int]')
In [13]: res.unstack(fill_value=0)
Out[13]:
ISBN 0155061224 034545104X 0446520802 052165615X 0521795028
User-ID
276725 0 0 0 0 0
276726 5 0 0 0 0
276727 0 0 0 0 0
276729 0 0 0 3 6
In [14]: _.dtypes
Out[14]:
ISBN
0155061224 Sparse[int64, 0]
034545104X Sparse[int64, 0]
0446520802 Sparse[int64, 0]
052165615X Sparse[int64, 0]
0521795028 Sparse[int64, 0]
dtype: object
My understanding is that you can then use this with scipy e.g. for SVD:
In [15]: res.unstack(fill_value=0).sparse.to_coo()
Out[15]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in COOrdinate format>
Let's say I have this DataFrame:
df = pd.DataFrame({'age':[10,11,10,20,25,10],'field':['cat','cat','cat','dog','cow','cat']})
>>> df
age field
0 10 cat
1 11 cat
2 10 cat
3 20 dog
4 25 cow
5 10 cat
My goal is to groupby('field'), use it as index and have age columns from 1 to 90 and get percentage distribution for each field, like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ...
field
cat 0 0 0 0 0 0 0 0 0 75 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
dog 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 0 0 0 0 0 ...
cow 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 ...
Please help me... thanks for your support!
I believe what you're looking for is pivot_table:
df = pd.DataFrame({'age':[10,11,10,20,25,10],'field':['cat','cat','cat','dog','cow','cat']})
pivot = \
(df
.assign(vals=1)
.pivot_table(values='vals', index='field', columns='age', aggfunc='sum')
.fillna(0)
)
row_totals = pivot.sum(axis=1)
percentages = pivot.div(row_totals, axis=0) * 100
final = percentages.reindex(range(1,91), axis=1, fill_value=0.0)
pivot_table calculates frequencies of occurences for crossess between age and field
row_totals calculate occurrencies in rows, which in turn are used to caluclate percentages in rows percentages
finally, we need to add empty columns 1-90
age 1 2 3 4 5 6 7 8 9 10 ... 81 82 83 \
field ...
cat 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 75.0 ... 0.0 0.0 0.0
cow 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
dog 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
age 84 85 86 87 88 89 90
field
cat 0.0 0.0 0.0 0.0 0.0 0.0 0.0
cow 0.0 0.0 0.0 0.0 0.0 0.0 0.0
dog 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I am attempting to multiple specific columns a value in their respective row.
For example:
X Y Z
A 10 1 0 1
B 50 0 0 0
C 80 1 1 1
Would become:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
The problem I am having is that it is timing out when I use mul(). My real dataset is very large. I tried to iterate it with loop in my real code as follows:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[3:248] = df_final_small.iloc[i].values[3:248] * df_final_small.iloc[i].values[2]
Which when applied to the example dataframe would look like this:
for i in range(1,df_final_small.shape[0]):
df_final_small.iloc[i].values[1:4] = df_final_small.iloc[i].values[1:4] * df_final_small.iloc[i].values[0]
There must be a better way to do this, I am having problems figuring out how to only cast the multiplication to certain columns in the row rather than the entire row.
EDIT:
To detail further here is my df.head(5).
id gross 150413 Welcome Email 150413 Welcome Email Repeat Cust 151001 Welcome Email 151001 Welcome Email Repeat Cust 161116 eKomi 1702 Hot Leads Email 1702 Welcome Email - All Purchases 1804 Hot Leads ... SILVER GOLD PLATINUM Acquisition Direct Mail Conversion Direct Mail Retention Direct Mail Retention eMail cluster x y
0 0033333 46.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 10 -0.230876 0.461990
1 0033331 2359.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 9 0.231935 -0.648713
2 0033332 117.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
3 0033334 89.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 1.0 0.0 5 -0.812921 -0.139403
4 0033335 1908.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 1.0 0.0 0.0 7 -0.974142 0.145032
Just specify the columns you want to multiply. Example
df=pd.DataFrame({'A':10,'X':1,'Y':1,'Z':1},index=[1])
df.loc[:,['X', 'Y', 'Z']]=df.loc[:,['X', 'Y', 'Z']].values*df.iloc[:,0:1].values
If want to provide an arbitrary range of columns use iloc
range_of_columns= range(10,5001)+range(5030,10001)
df.iloc[:,range_of_columns].values*df.iloc[:,0:1].values #multiplying the range of columns with the first column
Using mul with axis = 0 also get the index value by get_level_values
df.mul(df.index.get_level_values(1),axis=0)
Out[167]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also when the dataframe is way to big , you can split it and do it by chunk .
dfs = np.split(df, [2], axis=0)
pd.concat([x.mul(x.index.get_level_values(1), axis=0) for x in dfs])
Out[174]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
Also I will recommend numpy broadcast
df.values*df.index.get_level_values(1)[:,None]
Out[177]: Int64Index([[10, 0, 10], [0, 0, 0], [80, 80, 80]], dtype='int64')
pd.DataFrame(df.values*df.index.get_level_values(1)[:,None],index=df.index,columns=df.columns)
Out[181]:
X Y Z
A 10 10 0 10
B 50 0 0 0
C 80 80 80 80
I have a dataframe like this
value
msno features days
B num_50 1 0
C num_100 3 1
A num_100 400 2
I used
df = df.unstack(level=-1,fill_value = '0')
df = df.unstack(level=-1,fill_value = '0')
df = df.stack()
then df looks like :
value
days 1 3 400
msno features
B num_50 0 0 0
num_100 0 0 0
C num_50 0 0 0
num_100 0 1 0
A num_50 0 0 0
num_100 0 0 2
Now, I want to fill this df with 0. But still keep original data ,like this:
value
days 1 2 3 4 ... 400
msno features
B num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 0
C num_50 0 0 0 0 ... 0
num_100 0 0 1 0 ... 0
A num_50 0 0 0 0 ... 0
num_100 0 0 0 0 ... 2
I want to add columns which are in 1 - 400 and fill the columns by 0.
Could someone tell me how to do that?
By using reindex
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,20))).fillna(0)
Out[414]:
days 1 2 3 4 5 6 7 8 9 10 11 12 13 \
msno features
A num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0 0.0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
days 14 15 16 17 18 19
msno features
A num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
B num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
C num_100 0.0 0.0 0.0 0.0 0.0 0.0
num_50 0.0 0.0 0.0 0.0 0.0 0.0
I have two dataframes:
dayData
power_comparison final_average_delta_power calculated_power
1 0.0 0.0 0
2 0.0 0.0 0
3 0.0 0.0 0
4 0.0 0.0 0
5 0.0 0.0 0
7 0.0 0.0 0
and
historicPower
power
0 0.0
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
I'm trying to reindex the historicPower dataframe to have the same shape as the dayData dataframe (so in this example it would looks like):
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0
The dataframes in reality will be alot larger with different shapes.
I think you can use reindex if index has no duplicates:
historicPower = historicPower.reindex(dayData.index)
print (historicPower)
power
1 0.0
2 0.0
3 -1.0
4 0.0
5 1.0
7 0.0