Get unique values of multiple columns as a new dataframe in pandas - python

Having pandas data frame df with at least columns C1,C2,C3 how would you get all the unique C1,C2,C3 values as a new DataFrame?
in other words, similiar to :
SELECT C1,C2,C3
FROM T
GROUP BY C1,C2,C3
Tried that
print df.groupby(by=['C1','C2','C3'])
but im getting
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000769A9E8>

I believe you need drop_duplicates if want all unique triples:
df = df.drop_duplicates(subset=['C1','C2','C3'])
If want use groupby add first:
df = df.groupby(by=['C1','C2','C3'], as_index=False).first()

Related

How can I take values from a selection of columns in my Pandas dataframe to create a new column that contains a list of those values?

I have a dataframe in Pandas where I would like to turn the values of a set of columns (more specifically, from column index 3 to the end) into a new single column that contains a list of those values in each row.
Right now, I have code that can print out a list of the values in the columns, but only for single row. How can I do this for the whole dataframe?
import pandas as pd
orig_df = pd.read_csv('zipcode_price_dataset.csv')
df = orig_df.loc[(orig_df['State'] == "CA")]
row = df.head(1)
print(row[df.columns[3:].values].values[0])
I could iterate through the rows using a for loop, but is there a more concise way to do this?
Something like the following:
df['new'] = df[df.columns[3:]].values.tolist()
Use .iloc:
df.iloc[: , 3:].agg(list, axis=1)

Merge Dataframes using List of Columns (Pandas Vlookup)

I'd like to lookup several columns from another dataframe that I have in a list to bring them over to my main dataframe, essentially doing a "v-lookup" of ~30 columns using ID as the key or lookup value for all columns.
However, for the columns that are the same between the two dataframes, I don't want to bring over the duplicate columns but have those values be filled in df1 from df2.
I've tried below:
df = pd.merge(df,df2[['ID', [look_up_cols]]] ,
on ='ID',
how ='left',
#suffixes=(False,False)
)
but it brings in the shared columns from df2 when I want df2's values filled into the same columns in df1.
I've tried also created a dictionary with the column pairs from each df and doing this for loop to lookup each item in the dictionary (lookup_map) in the other df using ID as the key:
for col in look_up_cols:
df1[col] = df2['ID'].map(lookup_map)
but this just returns NaNs.
You should be able to do something like the following:
df = pd.merge(df,df2[look_up_cols + ['ID']] ,
on ='ID',
how ='left')
This just adds the ID column to the look_up_cols list and thereby allows it to be used in the merge function

how to get column names in pandas of getdummies

After i created a data frame and make the function get_dummies on my dataframe:
df_final=pd.get_dummies(df,columns=['type'])
I got the new columns that I want and everything is working.
My question is, how can I get the new columns names of the get dummies? my dataframe is dynamic so I can't call is staticly, I wish to save all the new columns names on List.
An option would be:
df_dummy = pd.get_dummies(df, columns=target_cols)
df_dummy.columns.difference(df.columns).tolist()
where df is your original dataframe, df_dummy the output from pd.get_dummies, and target_cols your list of columns to get the dummies.

Python Pandas - Create list of values AND count by two columns

So I have the following pandas dataframe:
What I would like to do is create a new column that contains a unique list of all the dest_hostnames by the user_agent and user columns.
I also want another column that has the total count of events based on the useragent and user columns.
So the final dataset should look like:
I've been doing the following but can't figure out a way to do both so it's one in dataframe:
browsers.groupby(['user','user_agent'])['dest_hostname'].apply(list).reset_index(name='browser_hosts')
browsers.value_counts(["user", "user_agent"])
IIUC use agg
df.groupby(['user', 'user_agent'])['dest_hostname'].agg(['unique', 'count'])

Averaging data of dataframe columns based on redundancy of another column

I want to average the data of one column in a pandas dataframe is they share the same 'id' which is stored in another column in the same dataframe. To make it simple i have:
and i want:
Were is clear that 'nx' and 'ny' columns' elements have been averaged if for them the value of 'nodes' was the same. The column 'maille' on the other hand has to remain untouched.
I'm trying with groupby but couldn't manage till now to keep the column 'maille' as it is.
Any idea?
Use GroupBy.transform with specify columns names in list for aggregates and assign back:
cols = ['nx','ny']
df[cols] = df.groupby('nodes')[cols].transform('mean')
print (df)
Another idea with DataFrame.update:
df.update(df.groupby('nodes')[cols].transform('mean'))
print (df)

Categories