how to create a dataframe aggregating (grouping?) a dataframe containing only strings - python

I would like to create a dataframe "aggregating" a larger data set.
Starting:
df:
col1 col2
1 A B
2 A C
3 A B
and getting:
df_aggregated:
col1 col2
1 A B
2 A C
without using any calclulation (count())
I would write:
df_aggreagated = df.groupby('col1')
but I do not get anything
print ( df_aggregated )
"error"
any help appreciated

You can accomplish this by simply dropping the duplicate entries using the df.drop_duplicates function:
df_aggregated = df.drop_duplicates(subset=['col1', 'col2'], keep=False)
print(df_aggregated)
col1 col2
1 A B
2 A C

You can use groupby with a function:
In [849]: df.groupby('col2', as_index=False).max()
Out[849]:
col2 col1
0 B A
1 C A

Related

Add a new column with matching values in a list in pandas

I have a dataframe such as :
the_list =['LjHH','Lhy_kd','Ljk']
COL1 COL2
A ADJJDUD878_Lhy_kd
B Y0_0099JJ_Ljk
C YTUUDBBDHHD
D POL0990E_LjHH'
And I would like to add a new COL3 column where if within COL2 I have a match with a value in the_list, I add in that column the matching element of the_list.
Expected result;
COL1 COL2 COL3
A ADJJDUD878_Lhy_kd Lhy_kd
B Y0_0099JJ_2_Ljk Ljk
C YTUUDBBDHHD NA
D POL0990E_LjHH' LjHH
For get only first matched values use Series.str.extract with joined values of lists by | for regex or:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3'] = df['COL2'].str.extract(f'({"|".join(the_list)})', expand=False)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH
For get all matched values (if possible multiple values) use Series.str.findall with Series.str.join and last repalce empty string to NaNs:
the_list =['LjHH','Lhy_kd','Ljk']
df['COL3']=df['COL2'].str.findall(f'{"|".join(the_list)}').str.join(',').replace('',np.nan)
print (df)
COL1 COL2 COL3
0 A ADJJDUD878_Lhy_kd Lhy_kd
1 B Y0_0099JJ_Ljk Ljk
2 C YTUUDBBDHHD NaN
3 D POL0990E_LjHH' LjHH

How do I stop aggregate functions from adding unwanted rows to dataframe?

I wrote a line of code that groups the dataframe by column
df = df.groupby(['where','when']).agg({'col1': ['max'], 'col2': ['sum']})
After using the above code, the aggregated columns in the output has two extra rows, with 'max' and 'sum' taking up a column below the 'col1' and 'col2' index. It looks like this:
col1
col2
max
sum
where
when
home
1
a
a
work
2
b
b
This is my expected outcome:
where
when
col1
col2
home
1
a
a
work
2
b
b
I want to bring down both col1 and col2 down to the same row as location and month, and at the same time remove 'max' and 'sum' from showing. I couldn't really think of a way to make this work so help would be appreciated.
What you need is reset_index and pass column name to aggregate function in advance.
Use followoing:
df = df.groupby(['where','when']).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum')).reset_index()
Dataframe:
where when col1 col2
0 home 1 1 1
1 work 2 2 2
2 home 1 3 3
Output:
where when col1 col2
0 home 1 3 3
1 work 2 2 2
Update:
We can pass as_index = False to groupby which will stop pandas to put keys as the index and hence we don't need to reset the index afterwards.
df = df.groupby(['where','when'], as_index = False).agg(col1 = ('col1', 'max'), col2 = ('col2', 'sum'))

Mapping a Column from One Dataframe to Another

I would like to map the values in df2['col2'] to df['col1']:
df col1 col2
0 w a
1 1 2
2 2 3
I would like to use a column from the dataframe as a dictionary to get:
col1 col2
0 w a
1 A 2
2 B 3
However the data dictionary is just a column in df2, which looks like
df2 col1 col2
1 1 A
2 2 B
I have tried using this:
di = {"df2['col1']: df2['col2']}
final = df1.replace({"df2['col2']": di})
But get an error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
I have about a 200,000 rows. Any help would be appreciated.
Edit:
The sample dictionary would look like di = {1: "A", 2: "B"}, but is in df2['col1']: df2['col2']. I have 200k+ rows, can I convert df2['col1']: df2['col2'] to a tuple, etc?
You can build a lookup dictionary based on the col1:col2 of df2 and then use that to replace the values in df1.col1.
import pandas as pd
df1 = pd.DataFrame({'col1':['w',1,2],'col2':['a',2,3]})
df2 = pd.DataFrame({'col1':[1,2],'col2':['A','B']})
print(df1)
# col1 col2
#0 w a
#1 1 2
#2 2 3
print(df2)
# col1 col2
#0 1 A
#1 2 B
dataLookUpDict = {row[1]:row[2] for row in df2[['col1','col2']].itertuples()}
final = df1.replace({'col1': dataLookUpDict})
print(final)
# col1 col2
#0 w a
#1 A 2
#2 B 3

Python-Pandas Join two columns by adding a character

There are three different columns, col2 and col3 need to be joined with the
character "/" between the two columns and after joining column name need to be col2. please help !!!
col1 col2 col3
B 0.0.0.0 0 0
B 2.145.26.0 24
B 2.145.27.0 24
B 10.0.0.0 8 20
Expected output:
col1 col2
B 0.0.0.0 0/0
B 2.145.26.0/24
B 2.145.27.0/24
B 10.0.0.0 8/20
df.col2 = df.col2.astype(str).str.cat(df.col3, sep='/')
See https://pandas.pydata.org/pandas-docs/version/0.23/api.html#string-handling for string operations.
IIUC
df['col2']+='/'+df.col3.astype(str)
df
Out[74]:
col1 col2 col3
0 B 0.0.0.00/0 0
1 B 2.145.26.0/24 24
2 B 2.145.27.0/24 24
3 B 10.0.0.08/20 20
Are you looking for something like this? Not sure if this can be labelled as Pythonic way?
temp_list = [{'col1':'B','col2':'0.0.0.0 0', 'col3':'0'},
{'col1':'B','col2':'2.145.26.0', 'col3':'24'},
{'col1':'B','col2':'2.145.27.0', 'col3':'24'},
{'col1':'B','col2':'10.0.0.0 8', 'col3':'10'}]
df = pd.DataFrame(data=temp_list)
df['col4'] = df['col2'] + "/" + df['col3']
df.drop(columns=['col2','col3'],inplace=True)
df.rename(columns = {'col4':'col2'})

Return groupby columns as new dataframe in Python Pandas

Input: CSV with 5 columns.
Expected Output: Unique combinations of 'col1', 'col2', 'col3'.
Sample Input:
col1 col2 col3 col4 col5
0 A B C 11 30
1 A B C 52 10
2 B C A 15 14
3 B C A 1 91
Sample Expected Output:
col1 col2 col3
A B C
B C A
Just expecting this as output. I don't need col4 and col5 in output. And also don't need any sum, count, mean etc. Tried using pandas to achieve this but no luck.
My code:
input_df = pd.read_csv("input.csv");
output_df = input_df.groupby(['col1', 'col2', 'col3'])
This code is returning 'pandas.core.groupby.DataFrameGroupBy object at 0x0000000009134278'.
But I need dataframe like above. Any help much appreciated.
df[['col1', 'col2', 'col3']].drop_duplicates()
First you can use .drop() to delete col4 and col5 as you said you don't need them.
df = df.drop(['col4', 'col5'], axis=1)
Then, you can use .drop_duplicates() to delete the duplicate rows in col1, col2 and col3.
df = df.drop_duplicates(['col1', 'col2', 'col3'])
df
The output:
col1 col2 col3
0 A B C
2 B C A
You noticed that in the output the index is 0, 2 instead of 0,1. To fix that you can do this:
df.index = range(len(df))
df
The output:
col1 col2 col3
0 A B C
1 B C A

Categories