How can I convert a Group By series to a Dataframe? - python

I have this DataFrame:
import pandas as pd
df = pd.DataFrame( {
"Name" : ["Bob", "Bryan", "Bob", "Bryan" , "Bryan"] ,
"Value" : [10,20,15,50,45] } )
Then I got the minimum value per person:
df1 = df.groupby(["Name"])["Value"].min()
This is quite simple. However, I want to keep working with dataframes, but df1 is a serie:
type(df1)
How can I convert it to a dataframe again?

Use parameter as_index=False in DataFrame.groupby:
df1 = df.groupby(["Name"],as_index=False)["Value"].min()
Or add Series.reset_index:
df1 = df.groupby(["Name"])["Value"].min().reset_index()
print (df1)
Name Value
0 Bob 10
1 Bryan 20

Also you may use agg() which return a dataframe.
df1 = df.groupby("Name").agg({'Value': 'min'}).reset_index()

Related

How can i add a column that has the same value

I was trying to add a new Column to my dataset but when i did the column only had 1 index
is there a way to make one value be in al indexes in a column
import pandas as pd
df = pd.read_json('file_1.json', lines=True)
df2 = pd.read_json('file_2.json', lines=True)
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
görüş_column = ['Milet İttifakı']
df3['Siyasi Yönelim'] = görüş_column
As per my understanding, this could be your possible solution:-
You have mentioned these lines of code:-
df3 = pd.concat([df,df2])
df3 = df.loc[:, ['renderedContent']]
You can modify them into
df3 = pd.concat([df,df2],axis=1) ## axis=1 means second dataframe will add to columns, default value is axis=0 which adds to the rows
Second point is,
df3 = df3.loc[:, ['renderedContent']]
I think you want to write this one , instead of df3=df.loc[:,['renderedContent']].
Hope it will solve your problem.

How can I show only some columns using Python Pandas?

I have tried the following code and it works however it shows excess columns that I don't require. This is the output showing the extra columns:
import pandas as pd
df = pd.read_csv("data.csv")
df = df.groupby(['City1', 'City2']).sum('PassengerTrips')
df['Vacancy'] = 1-df['PassengerTrips'] / df['Seats']
df = df.groupby(['City1','City2']).max('Vacancy')
df = df.sort_values('Vacancy', ascending =False)
print('The 10 routes with the highest proportion of vacant seats:')
print(df[:11])
I have tried to add the following code in after sorting the vacancy values however it gives me an error:
df = df[['City1', 'City2', 'Vacancy']]
City1 and City2 are in index since you applied a groupby on it.
You can put those in columns using reset_index to get the expected result :
df = df.reset_index(drop=False)
df = df[['City1', 'City2', 'Vacancy']]
Or, if you want to let City1 and City2 in index, you can do as #Corralien said in his comment : df = df['Vacancy']
And even df = df['Vacancy'].to_frame() to get a DataFrame instead of a Serie.

Efficient method comparing 2 different tables columns

Hi all guys,
I have got 2 dfs and I need to check if the values from the first are matching on the second, only for a specific column on each, and save the values matching in a new list. This is what I did but it is taking quite a lot of time and I was wandering if there's a more efficient way. The lists are like in the image above from 2 different tables.
for x in df_bd_names['Building_Name']:
for y in df_sup['Source_String']:
if x == y:
matching_words_sup.append(x)
Thanks
Let's create both dataframes:
df1 = pd.DataFrame({
'Building_Name': ['Exces', 'Excs', 'Exec', 'Executer', 'Executor']
})
df2 = pd.DataFrame({
'Source_String': ['Executer', 'Executor', 'Executor Of', 'Executor For', 'Exeutor']
})
Perform inner merge between dataframes and convert first column to list:
pd.merge(df1, df2, left_on='Building_Name', right_on='Source_String', how='inner')['Building_Name'].tolist()
Output:
['Executer', 'Executor']
def __init__(self, df1, df2):
self.df1 = df1
self.df2 = df2
def compareDFsEffectively(self):
np1 = self.df1.to_numpy()
np2 = self.df2.to_numpy()
np_new = np.intersect1d(np1,np2)
print(np_new)
df_new = pd.DataFrame(np_new)
print(df_new)

Retrieve multiple lookup values in large dataset?

I have two dataframes:
import pandas as pd
data = [['138249','Cat']
,['103669','Cat']
,['191826','Cat']
,['196655','Cat']
,['103669','Cat']
,['116780','Dog']
,['184831','Dog']
,['196655','Dog']
,['114333','Dog']
,['123757','Dog']]
df1 = pd.DataFrame(data, columns = ['Hash','Name'])
print(df1)
data2 = [
'138249',
'103669',
'191826',
'196655',
'116780',
'184831',
'114333',
'123757',]
df2 = pd.DataFrame(data2, columns = ['Hash'])
I want to write a code that will take the item in the second dataframe, scan the leftmost values in the first dataframe, then return all matching values from the first dataframe into a single cell in the second dataframe.
Here's the result I am aiming for:
Here's what I have tried:
#attempt one: use groupby to squish up the dataset. No results
past = df1.groupby('Hash')
print(past)
#attempt two: use merge. Result: empty dataframe
past1 = pd.merge(df1, df2, right_index=True, left_on='Hash')
print(past1)
#attempt three: use pivot. Result: not the right format.
past2 = df1.pivot(index = None, columns = 'Hash', values = 'Name')
print(past2)
I can do this in Excel with the VBA code here but this code crashes when I apply to my real dataset (likely because it is too big - approximately 30,000 rows long)
IIUC first agg and join with df1 then reindex using df2
df1.groupby('Hash')['Name'].agg(','.join).reindex(df2.Hash).reset_index()
Hash Name
0 138249 Cat
1 103669 Cat,Cat
2 191826 Cat
3 196655 Cat,Dog
4 116780 Dog
5 184831 Dog
6 114333 Dog
7 123757 Dog

I want to extract QSTS_ID column and delimit by full stop and append it to the exisisting list as a seperate column

enter image description hereWhen applying the below code , i am getting NAN values in the entire column of QSTS_ID
df['QSTS_ID'] = df['QSTS_ID'].str.split('.',expand=True)
df
I want to copy the entire QSTS_ID column and append it at the end. I also have to delimit it by fullstop and apply new headers
Problem is if add parameter expand=True it return DataFrame with one or more columns, so assign return NaNs.
Solution is add new columns with join or concat to original DataFrame, also add_prefix is for change new columns names:
df = df.join(df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df, df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
If want also remove original column:
df = df.join(df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_'))
df = pd.concat([df,
df.pop('QSTS_ID').str.split('.',expand=True).add_prefix('QSTS_ID_')], axis=1)
Sample:
df = pd.DataFrame({
'QSTS_ID':['val_k.lo','val2.s','val3.t'],
'F':list('abc')
})
df1 = df['QSTS_ID'].str.split('.',expand=True).add_prefix('QSTS_ID_')
df = df.join(df1)
print (df)
QSTS_ID F QSTS_ID_0 QSTS_ID_1
0 val_k.lo a val_k lo
1 val2.s b val2 s
2 val3.t c val3 t
#check columns names of new columns
print (df1.columns)

Categories