Aggregate values pandas - python

I have a pandas dataframe like this:
Id A B C D
1 a b c d
2 a b d
2 a c d
3 a d
3 a b c
I want to aggregate the empty values for the columns B-C and D, using the values contained in the other rows, by using the information for the same Id.
The resulting data frame should be the following:
Id A B C D
1 a b c d
2 a b c d
3 a b c d
There can be the possibility to have different values in the first column (A), for the same Id. In this case instead of putting the first instance I prefer to put another value indicating this event.
So for e.g.
Id A B C D
1 a b c d
2 a b d
2 x c d
It becomes:
Id A B C D
1 a b c d
2 f b c d

IIUC, you can use groupby_agg:
>>> df.groupby('Id')
.agg({'A': lambda x: x.iloc[0] if len(x.unique()) == 1 else 'f',
'B': 'first', 'C': 'first', 'D': 'first'})
A B C D
Id
1 a b c d
2 f b c d

The best way I can think to do this is to iterate through each unique Id, slicing it out of the original dataframe, and constructing a new row as a product of merging the relevant rows:
def aggregate(df):
ids = df['Id'].unique()
rows = []
for id in ids:
relevant = df[df['Id'] == id]
newrow = {c: "" for c in df.columns}
for _, row in relevant.iterrows():
for col in newrow:
if row[col]:
if len(newrow[col]):
if newrow[col][-1] == row[col]:
continue
newrow[col] += row[col]
rows.append(newrow)
return pd.DataFrame(rows)

Related

How do I give score (0/1) to CSV rows

My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file
You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64

How to combine string from one column to another column at same index in pandas DataFrame?

I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?
You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

How to apply function on checking the specific column Null Values

I am trying to apply a function on the dataframe by checking for NULL values on each rows of an specific column.
However i have created the function but , i am not getting how to use the function on the rows having the values.
Input:
A B C D E F
0 f e b a d a
1 c b a c b
2 f f a b c c
3 d c c d c d
4 f b b b e b
5 b a f c d a
Expected Output
A B C D E F MATCHES Comments
0 f e b a d a AD, BC Unmatched
1 c b a c b BC Unmatched F is having blank values
2 f f a b c c AD, BC Unmatched
3 d c c d c d ALL MATCHED
4 f b b b e b AD Unmatched
5 b a f c d a AD, BC Unmatched
The script created is working when we don't have to check for the NaN values in df['F'] column, BUt when we check for the empty rows in df['F'] , It gives Error.
Code i have been trying:
def test(x):
try:
for idx in df.index:
unmatch_list = []
if not df.loc[idx, 'A'] == df.loc[idx, 'D']:
unmatch_list.append('AD')
if not df.loc[idx, 'B'] == df.loc[idx, 'C']:
unmatch_list.append('BC')
# etcetera...
if len(unmatch_list):
unmatch_string = ', '.join(unmatch_list) + ' Unmatched'
else:
unmatch_string = 'ALL MATCHED'
df.loc[idx, 'MATCHES'] = unmatch_string
except ValueError:
print ('error')
return df
## df = df.apply(lambda x: test(x) if(pd.notna(df['F'])) else x)
for row in df:
if row['F'].isna() == True:
row['Comments'] = "F is having blank values"
else:
df = test(df)
Please Suggest how can i use to function .
You could try something like this:
# get combis
df1 = df.copy().reset_index().melt(id_vars=['index'])
df1 = df1.merge(df1, on=['index', 'value'], how='inner')
df1 = df1[df1['variable_x'] != df1['variable_y']]
df1['combis'] = df1['variable_x'] + ':' + df1['variable_y']
df1 = df1.groupby(['index'])['combis'].apply(list)
# get empty rows
df2 = df.copy().reset_index().melt(id_vars=['index'])
df2 = df2[df2['value'].isna()]
df2 = df2.groupby(['index'])['variable'].apply(list)
# combine
df.join(df1).join(df2)
# A B C ... F combis variable
# 0 f e b ... a [D:F, F:D] NaN
# 1 c b a ... None [A:D, D:A, B:E, E:B] [F]
# 2 f f a ... c [A:B, B:A, E:F, F:E] NaN
# 3 d c c ... d [A:D, A:F, D:A, D:F, F:A, F:D, B:C, B:E, C:B, ... NaN
# 4 f b b ... b [B:C, B:D, B:F, C:B, C:D, C:F, D:B, D:C, D:F, ... NaN
# 5 b a f ... a [B:F, F:B] NaN
# [6 rows x 8 columns]
If you are only interested in the unmatched combinations you can use this:
import itertools
combis = [x+':'+y for x,y in itertools.permutations(df.columns, 2)]
df.join(df1).join(df2)['combis'].map(lambda lst: list(set(combis) - set(lst)))

Pandas find duplicates with reversed values between columns

What is the fastest way to find duplicates where value from Column A have been reversed with value from Column B?
For example, if I have a DataFrame with :
Column A Column B
0 C P
1 D C
2 L G
3 A D
4 B P
5 B G
6 P B
7 J T
8 P C
9 J T
The result will be :
Column A Column B
0 C P
8 P C
4 B P
6 P B
I tried :
df1 = df
df2 = df
for i in df2.index:
res = df1[(df1['Column A'] == df2['Column A'][i]) & (df1['Column B'] == df2['Column B'][i])]
But this is very slow and it iterates over the same values...
Use merge with renamed columns DataFrame:
d = {'Column A':'Column B','Column B':'Column A'}
df = df.merge(df.rename(columns=d))
print (df)
Column A Column B
0 C P
1 B P
2 P B
3 P C
You could try using reindex for inversion.
column_Names=["B","A"]
df=df.reindex(columns=column_Names)
Or you could just do this:
col_list = list(df) # get a list of the columns
col_list[0], col_list[1] = col_list[1], col_list[0]
df.columns = col_list # assign back

Categories