I am trying to make dot products of some columns in my dataset:
df_disorders_3col = df.iloc[:,disorders_indexes]
df_disorders_3col.drop([5278, 10122, 10124, 10125, 10126], axis=0, inplace=True)
df_disorders_3col = df_disorders_3col.astype(int)
df_disorders_3col['Disorders'] = df_disorders_3col.dot(df.columns + ',').str.rstrip(',')
df_disorders_3col.head()
but I get this error when running this block of code:
ValueError: Dot product shape mismatch, (10133, 38) vs (498,)
this is a sample of my data:
>>>df_disorders_3col.sample(5)
HasDiabetes HasHypertension HasCardiacDisease ... HasMS HasPregnancyHypertension HasPregnancyDiabetes
752 0 0 0 1 0 0
6312 0 0 0 0 0 0
6984 1 0 0 0 0 0
9016 0 0 0 0 0 1
8923 0 0 0 0 0 0
5 rows × 38 columns
also this is the shape of df_disorders_3col:
>>>df_disorders_3col.shape
(10133, 38)
and df:
>>>df.shape
(10138, 498)
Related
I'm trying to append prediction to my original data which is:
product_id date views wishlists cartadds orders order_units gmv score
mp000000000001321 01-09-2022 0 0 0 0 0 0 0
mp000000000001321 02-09-2022 0 0 0 0 0 0 0
mp000000000001321 03-09-2022 0 0 0 0 0 0 0
mp000000000001321 04-09-2022 0 0 0 0 0 0 0
I have sequence length of [1,3] and each for each sequence length I have prediction. I want to add those prediction to my original data so that my output is like this:
product_id date views wishlists cartadds orders order_units gmv score prediction sequence_length
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.75 1
mp000000000001321 01-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 02-09-2022 0 0 0 0 0 0 0 5.88 3
mp000000000001321 03-09-2022 0 0 0 0 0 0 0 5.88 3
I have tried the following:
df1 = df_batch.head(sequence_length)
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
merged_df.to_csv('data_prediction'+str(sequence_length)+'.csv', index_label='product_id')
but this only saves the data of last product_id which was sent and it saves for each sequence length in a different csv. I want everything to be in 1 csv instead. How do that?
Edit: sample predictions_dict:
{'mp000000000001321': {'sequence_length': 1, 'prediction': 5.75}}
{'mp000000000001321': {'sequence_length': 3, 'prediction': 5.88}}
So, I found a fix
df1 = df_batch[df_batch['product_id'] == product_id].iloc[:sequence_length]
dfff = pd.DataFrame.from_dict(predictions_dict, orient='index')
dfff.index.names = ['product_id']
merged_df = df1.merge(dfff, on='product_id')
new_df = pd.concat([new_df, merged_df], ignore_index=True)
This way I'm able to get the desired output for unique product id's
#df =
order_number Product1 ... Product 16 Product17
0 4329374937 1 ... 0 0
1 3483872349 1 ... 0 0
2 2394287383 1 ... 0 0
3 3423984902 1 ... 1 0
4 9378374873 0 ... 0 0
Batch1 = ["Product1", "Product2", "Product 6"]
for indices in df.index:
for column in columns:
if df[column] > 0 and in Batch1 df[B1] = True
else df[B1] = False
print(df.head))
I am trying to determine a way to look through each order number and see if the orders which are greater than 0 are within my listed batch. I want to create a new column for each row that is a boolean. I am getting a syntax error.
From what I understand you want to take the columns in your batch, add up the order numbers and see where it is > 0. So
batch_1 = ["Product1", "Product2", "Product6"]
df['B1'] = df[batch_1].sum(axis=1)>0
df
output:
order_number Product1 Product2 Product6 Product16 Product17 B1
-- -------------- ---------- ---------- ---------- ----------- ----------- -----
0 4329374937 1 1 3 0 0 True
1 3483872349 1 0 1 0 0 True
2 2394287383 1 0 2 0 0 True
3 3423984902 1 0 1 1 0 True
4 9378374873 0 0 0 0 0 False
I am creating four columns which are labeled as flagMin, flagMax, flagLow, flagUp. I am updating these dataframe columns each time it runs through the loop whoever my original data is being override. I would like to keep the previous data I had in the 4 columns since they contain 1s when true.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'] = np.where(df['Min'] == colMin, '1', '0')
df['flagMax'] = np.where(df['Max'] == colMax, '1', '0')
df['flagLow'] = np.where(df['Lower'] == colLow, '1', '0')
df['flagUp'] = np.where(df['Upper'] == colUp, '1', '0')
print(df)
4 Dataframes for each flag printed above
Name Min
0 Vo 12.8
1 Vi -51.3
Name Max
0 Vo 39.9
1 Vi -25.7
Name Low
0 Vo -46.0
1 Vi -66.1
Name Up
0 Vo 94.3
1 Vi -14.1
Output 1st iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 0
4 0 0 0
5 0 0 0
6 0 0 1
7 0 1 0
8 0 0 0
9 0 0 0
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 0 0
16 0 0 0
17 0 0 0
Output 2nd Iteration
flagMax flagLow flagUp
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 1 0 1
10 0 0 0
11 0 0 0
12 0 0 0
13 0 0 0
14 0 0 0
15 0 1 0
16 0 0 0
17 0 0 0
I lose the 1s from row 3,6,7. I would like to keep the 1s from both sets of data. Thank you
Just set to '1' only those elements you want to update and not the whole column.
import pandas as pd
import numpy as np
df = pd.read_excel('help test 1.xlsx')
#groupby function separates the different Name parameters within the Name column and performing functions like finding the lowest of the "minimum" and "lower" columns and highest of the "maximum" and "upper" columns.
flagMin = df.groupby(['Name'], as_index=False)['Min'].min()
flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
flagLow = df.groupby(['Name'], as_index=False)['Lower'].min()
flagUp = df.groupby(['Name'], as_index=False)['Upper'].max()
print(flagMin)
print(flagMax)
print(flagLow)
print(flagUp)
num = len(flagMin) #size of 2, works for all flags in this case
df['flagMin'] = '0'
df['flagMax'] = '0'
df['flagLow'] = '0'
df['flagUp'] = '0'
for i in range(num):
#iterating through each row of parameters and column number 1(min,max,lower,upper column)
colMin = flagMin.iloc[i, 1]
colMax = flagMax.iloc[i, 1]
colLow = flagLow.iloc[i, 1]
colUp = flagUp.iloc[i, 1]
#setting flags if any column's parameter matches the flag dataframe's parameter, sets a 1 if true, sets a 0 if false
df['flagMin'][df['Min'] == colMin] = '1'
df['flagMax'][df['Max'] == colMax] = '1'
df['flagLow'][df['Lower'] == colLow] = '1'
df['flagUp'][df['Upper'] == colUp] = '1'
print(df)
P.S. I don't know why you are using strings of '0' and '1' instead of just using 0 and 1 but that's up to you.
I have a dict as follows:
data_dict = {'1.160.139.117': ['712907','742068'],
'1.161.135.205': ['667386','742068'],
'1.162.51.21': ['326136', '663056', '742068']}
I want to convert the dict into a dataframe:
df= pd.DataFrame.from_dict(data_dict, orient='index')
How can I create a dataframe that has columns representing the values of the dictionary and rows representing the keys of the dictionary?, as below:
The best option is #4
pd.get_dummies(df.stack()).sum(level=0)
Option 1:
One way you could do it:
df.stack().reset_index(level=1)\
.set_index(0,append=True)['level_1']\
.unstack().notnull().mul(1)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 2
Or with a litte reshaping and pd.crosstab:
df2 = df.stack().reset_index(name='Values')
pd.crosstab(df2.level_0,df2.Values)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 3
df.stack().reset_index(name="Values")\
.pivot(index='level_0',columns='Values')['level_1']\
.notnull().astype(int)
Output:
Values 326136 663056 667386 712907 742068
level_0
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
Option 4 (#Wen pointed out a short solution and fastest so far)
pd.get_dummies(df.stack()).sum(level=0)
Output:
326136 663056 667386 712907 742068
1.160.139.117 0 0 0 1 1
1.161.135.205 0 0 1 0 1
1.162.51.21 1 1 0 0 1
i have a huge cooccurence matrix with indexes as skill_id and column names as skill_id, and the matrix is filled with the co-occurence of the same. please find the sample below
I want the data in a 3 column dataframe: skillid1 skillid2 count
Any help would be highly appreciated.
Supposing your cooccurrence matrix is called df and looks like that :
4044 4092 4651 6168 6229 6284 6295
4044 0 0 0 1 1 0 0
4092 0 0 1 0 0 0 0
4651 0 1 0 0 0 0 0
6168 1 0 0 0 1 0 0
6229 1 0 0 1 0 0 0
6284 0 0 0 0 0 0 1
6295 0 0 0 0 0 1 0
I'd suggest the following :
import itertools
# get all possible pairs of (skillid1, skillid2)
edges = list(itertools.combinations(df.columns, 2))
# find associated weights in the original df
edges_with_weights = [(node1, node2, df.loc[node1][node2]) for (node1, node2) in edges]
# put it all in a new dataframe
new_df = pd.DataFrame(vertices_with_weights, columns=["skillid1", "skillid2", "count"])
Such that now you have your desired new_df:
skillid1 skillid2 count
0 4044 4092 0
1 4044 4651 0
2 4044 6168 1
3 4044 6229 1
4 4044 6284 0
5 4044 6295 0
6 4092 4651 1
7 4092 6168 0
...
...
...
from itertools import combinations
weights = []`
for skill_id in skills.skill_id:
if str(skill_id) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(skill_id)]
j = count_model.vocabulary_[str(skill_id)]
if (skills_occurrences[i][j] > 0) and () :
weights.append([skill_id, skill_id, skills_occurrences[i][j]])
for combination in combinations(skills.skill_id, 2):
if str(combination[0]) in count_model.vocabulary_.keys() and str(combination[1]) in count_model.vocabulary_.keys():
i = count_model.vocabulary_[str(combination[0])]
j = count_model.vocabulary_[str(combination[1])]
if skills_occurrences[i][j] > 0:
weights.append([str(combination[0]), str(combination[1]), skills_occurrences[i][j]])
Had one more data set to process, after that just nested looped both skillids and compared them and kept on appending the value and the value in the indices.