After a GroupBy operation I have the following DataFrame:
The user_id is grouped with their respective aisle_id as I want. Now, I want to turn the aisle_id values into columns, having the user_id as a index, and all the aisle_id as columns. Then, in the values I want to have the amount of times the user_id and aisle_id have matched in the previous DataSet. For example, if the user_id 1 has bought from the aisle_id 12 in 3 occasions, the value in DF[1,12] would be 3.
With Pandas pivot tables I can get the template of the user_id as index, and the aisle_id as columns, but I can't seem to find the way to create the values specified above.
considering your first dataframe is df, I think you could try this :
df2=pd.DataFrame(index=df['user_id'].unique(),columns=df['aisle_id'].unique())
for i in df2.index :
for j in df2.columns :
df2.at[i,j]=len(df.loc[(df['user_id']==i) & (df['aisle_id']==j)])
Related
I'd like to lookup several columns from another dataframe that I have in a list to bring them over to my main dataframe, essentially doing a "v-lookup" of ~30 columns using ID as the key or lookup value for all columns.
However, for the columns that are the same between the two dataframes, I don't want to bring over the duplicate columns but have those values be filled in df1 from df2.
I've tried below:
df = pd.merge(df,df2[['ID', [look_up_cols]]] ,
on ='ID',
how ='left',
#suffixes=(False,False)
)
but it brings in the shared columns from df2 when I want df2's values filled into the same columns in df1.
I've tried also created a dictionary with the column pairs from each df and doing this for loop to lookup each item in the dictionary (lookup_map) in the other df using ID as the key:
for col in look_up_cols:
df1[col] = df2['ID'].map(lookup_map)
but this just returns NaNs.
You should be able to do something like the following:
df = pd.merge(df,df2[look_up_cols + ['ID']] ,
on ='ID',
how ='left')
This just adds the ID column to the look_up_cols list and thereby allows it to be used in the merge function
I have a big .csv file, and I needed to group products with the same name based on ordered quantity, which I have done through groupby(). However, I need to use all 7 of my columns in my file, but after joining the rows, only the qty_ordered and name_hash are left, the rest of my columns disappear. Is there any way to keep all 7 of my columns in my dataframe while joining rows based on name_hash and qty_ordered ?
this is my code:
import pandas as pd
data = pd.read_csv('in/tables/sales-order-item.csv')
data = data.groupby('qty_ordered')['name_hash'].apply(' '.join).reset_index()
The output of this is name_hash and qty_ordered columns combined, but I need to keep the rest of my columns as well. Any ideas on how to do this ?
Lets say you have 20 rows in your original dataframe, when grouped by qty_ordered it may have only 10 unique values of qty_ordered.
What will be your other column values for the 10 unique values of qty_ordered? You have that figured for the column name_hash. If you know what to do with other columns similarly, you could use agg.
For example if you want the same agg as ' '.join for name_hash, then you could do
data.groupby('qty_ordered', as_index = False).agg(' '.join)
which will do the same operation ' '.join for all columns.
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
My df1 is something like first table in the below image with the key column being Name. I want to add new rows from another dataframe, df2, which has only Name, Year, and Value columns. The new rows should get added based on Name. Other columns would just repeat the same value per Name. Results should be similar to the second table in the below image. How can I do this in pandas ?
Create a sub table df3 of df1 consist of Group, Name, and Other and only keep distinct records. And left join df2 and df3 to get desired result.
I have above dataframe df,and I have following dataframe as df2
I want to fill missing values in df with values in df2 corresponding to the id.
Also for Player1,Player2,Player3.If the value is missing.I want to replace Player1,Player2,Player3 of df with the corresponding values of df2.
Thus the resultant dataframe would look like this
Notice.Rick,Scott,Andrew are still forward as they are in df.I just replaced players in df with the corresponding players in df2.
So far,I have attempted to fill the blank values in df with the values in df2.
df=pd.read_csv('file1.csv')
for s in list(df.columns[1:]):
df[s]=df[s].str.strip()
df.fillna('',inplace=True)
df.replace(r'',np.nan,regex=True,inplace=True)
df2=pd.read_csv('file2.csv')
for s in list(df2.columns[1:]):
df2[s]=df2[s].str.strip()
df.set_index('Team ID',inplace=True)
df2.set_index('Team ID',inplace=True)
df.fillna(df2,inplace=True)
df.reset_index(inplace=True)
I am getting above result.How can I get result in Image Number 3?
Using combine_first
df1=df1.set_index('Team ID')
df2=df2.set_index('Team ID')
df2=df2.combine_first(df1.replace('',np.nan)).reset_index()