How to create a dataframe with the column included in groupby clause? - python

I have a data frame. It has 3 columns A, Amount. I have done a group by using 'A'. Now I want to insert these values into a new data frame how can I achieve this?
top_plt=pd.DataFrame(top_plt.groupby('A')['Amount'].sum())
The resulting dataframe contains only the Amount column but the groupby 'A' column is missing.
Example:
Result:

DataFrame constructor is not necessary, better is add as_index=False to groupby:
top_plt= top_plt.groupby('A', as_index=False)['Amount'].sum()
Or add DataFrame.reset_index:
top_plt= top_plt.groupby('A')['Amount'].sum().reset_index()

Related

how to get column names in pandas of getdummies

After i created a data frame and make the function get_dummies on my dataframe:
df_final=pd.get_dummies(df,columns=['type'])
I got the new columns that I want and everything is working.
My question is, how can I get the new columns names of the get dummies? my dataframe is dynamic so I can't call is staticly, I wish to save all the new columns names on List.
An option would be:
df_dummy = pd.get_dummies(df, columns=target_cols)
df_dummy.columns.difference(df.columns).tolist()
where df is your original dataframe, df_dummy the output from pd.get_dummies, and target_cols your list of columns to get the dummies.

How to get rows from one dataframe based on another dataframe

I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.

Averaging data of dataframe columns based on redundancy of another column

I want to average the data of one column in a pandas dataframe is they share the same 'id' which is stored in another column in the same dataframe. To make it simple i have:
and i want:
Were is clear that 'nx' and 'ny' columns' elements have been averaged if for them the value of 'nodes' was the same. The column 'maille' on the other hand has to remain untouched.
I'm trying with groupby but couldn't manage till now to keep the column 'maille' as it is.
Any idea?
Use GroupBy.transform with specify columns names in list for aggregates and assign back:
cols = ['nx','ny']
df[cols] = df.groupby('nodes')[cols].transform('mean')
print (df)
Another idea with DataFrame.update:
df.update(df.groupby('nodes')[cols].transform('mean'))
print (df)

Dropping rows with same values from another dataframe

I have one dataframe (df) with a column called "id". I have another dataframe (df2) with only one column called "id". I want to drop the rows in df that have the same values in "id" as df2.
How would I go about doing this?
use boolean indexing with the isin method.
Note that the tilde ~ indicates that I take the negation of the boolean series returned by df['id'].isin(df2['id'])
df[~df['id'].isin(df2['id'])]
query
Using a query string we refer df2 using the # symbol.
df.query('id not in #df2.id')

Get unique values of multiple columns as a new dataframe in pandas

Having pandas data frame df with at least columns C1,C2,C3 how would you get all the unique C1,C2,C3 values as a new DataFrame?
in other words, similiar to :
SELECT C1,C2,C3
FROM T
GROUP BY C1,C2,C3
Tried that
print df.groupby(by=['C1','C2','C3'])
but im getting
<pandas.core.groupby.DataFrameGroupBy object at 0x000000000769A9E8>
I believe you need drop_duplicates if want all unique triples:
df = df.drop_duplicates(subset=['C1','C2','C3'])
If want use groupby add first:
df = df.groupby(by=['C1','C2','C3'], as_index=False).first()

Categories