Aggregate values pandas

Aggregate values pandas - python

I have a pandas dataframe like this:
Id A B C D
1 a b c d
2 a b d
2 a c d
3 a d
3 a b c
I want to aggregate the empty values for the columns B-C and D, using the values contained in the other rows, by using the information for the same Id.
The resulting data frame should be the following:
Id A B C D
1 a b c d
2 a b c d
3 a b c d
There can be the possibility to have different values in the first column (A), for the same Id. In this case instead of putting the first instance I prefer to put another value indicating this event.
So for e.g.
Id A B C D
1 a b c d
2 a b d
2 x c d
It becomes:
Id A B C D
1 a b c d
2 f b c d

IIUC, you can use groupby_agg:
>>> df.groupby('Id')
.agg({'A': lambda x: x.iloc[0] if len(x.unique()) == 1 else 'f',
'B': 'first', 'C': 'first', 'D': 'first'})
A B C D
Id
1 a b c d
2 f b c d

The best way I can think to do this is to iterate through each unique Id, slicing it out of the original dataframe, and constructing a new row as a product of merging the relevant rows:
def aggregate(df):
ids = df['Id'].unique()
rows = []
for id in ids:
relevant = df[df['Id'] == id]
newrow = {c: "" for c in df.columns}
for _, row in relevant.iterrows():
for col in newrow:
if row[col]:
if len(newrow[col]):
if newrow[col][-1] == row[col]:
continue
newrow[col] += row[col]
rows.append(newrow)
return pd.DataFrame(rows)

Related

How do I give score (0/1) to CSV rows

My csv file row column data looks like this -
a a a a a
b b b b b
c c c c c
d d d d d
a b c d e
a d b c c
When I have patterns like row 1-5, I want to return value 0
When I have row like 6 or random alphabets (not like row 1-5), I want to return value 1.
How do I do it using python?It must be done by using csv file

You can read your csv file to pandas dataframe using:
df = pd.read_csv(header=None)
output:
0 1 2 3 4
0 a a a a a
1 b b b b b
2 c c c c c
3 d d d d d
4 a b c d e
5 a d b c c
Then, use nunique to count the number of unique values per row, if 1 or 5 (the max), then it is valid, else not. Use between for that.
df.nunique(1).between(2, len(df.columns)-1).astype(int)
output:
0 0
1 0
2 0
3 0
4 0
5 1
dtype: int64

How to combine string from one column to another column at same index in pandas DataFrame?

I was doing a project in nlp.
My input is:
index name lst
0 a c
0 d
0 e
1 f
1 b g
I need output like this:
index name lst combine
0 a c a c
0 d a d
0 e a e
1 f b f
1 b g b g
How can I achieve this?

You can use groupby+transform('max') to replace the empty cells with the letter per group as the letters have precedence over space. The rest is a simple string concatenation per column:
df['combine'] = df.groupby('index')['name'].transform('max') + ' ' + df['lst']
Used input:
df = pd.DataFrame({'index': [0,0,0,1,1],
'name': ['a','','','','b'],
'lst': list('cdefg'),
})
NB. I considered "index" to be a column here, if this is the index you should use df.index in the groupby
Output:
index name lst combine
0 0 a c a c
1 0 d a d
2 0 e a e
3 1 f b f
4 1 b g b g

perform df.loc to groupby df

I've a df consisted of person, origin and destination
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
the df:
PersonID O D
1 A B
1 B A
2 C B
2 B A
2 A B
3 X Y
I have grouped by the df with df_grouped = df.groupby(['O','D']) and match them with another dataframe, taxi.
TaxiID O D
T1 B A
T2 A B
T3 C B
similarly, I group by the taxi with their O and D. Then I merged them after aggregating and counting the PersonID and TaxiID per O-D pair. I did it to see how many taxis are available for how many people.
O D PersonID TaxiID
count count
A B 2 1
B A 2 1
C B 1 1
Now, I want to perform df.loc to take only those PersonID that was counted in the merged file. How can I do this? I've tried to us:
seek = df.loc[df.PersonID.isin(merged['PersonID'])]
but it returns an empty dataframe. What can I do to do this?
edit: I attach the complete code for this case using dummy data
df = pd.DataFrame({'PersonID':['1','1','2','2','2','3'],'O':['A','B','C','B','A','X'],'D':['B','A','B','A','B','Y']})
taxi = pd.DataFrame({'TaxiID':['T1','T2','T3'],'O':['B','A','C'],'D':['A','B','B']})
df_grouped = df.groupby(['O','D'])
taxi_grouped = taxi.groupby(['O','D'])
dfm = df_grouped.agg({'PersonID':['count',list]}).reset_index()
tgm = taxi_grouped.agg({'TaxiID':['count',list]}).reset_index()
merged = pd.merge(dfm, tgm, how='inner')
seek = df.loc[df.PersonID.isin(merged['PersonID'])]

Select MultiIndex by tuple with Series.explode for scalars from nested lists:
seek = df.loc[df.PersonID.isin(merged[('PersonID', 'list')].explode().unique())]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B
For better performance is possible use set comprehension with flatten:
seek = df.loc[df.PersonID.isin(set(z for x in merged[('PersonID', 'list')] for z in x))]
print (seek)
PersonID O D
0 1 A B
1 1 B A
2 2 C B
3 2 B A
4 2 A B

How to apply function on checking the specific column Null Values

I am trying to apply a function on the dataframe by checking for NULL values on each rows of an specific column.
However i have created the function but , i am not getting how to use the function on the rows having the values.
Input:
A B C D E F
0 f e b a d a
1 c b a c b
2 f f a b c c
3 d c c d c d
4 f b b b e b
5 b a f c d a
Expected Output
A B C D E F MATCHES Comments
0 f e b a d a AD, BC Unmatched
1 c b a c b BC Unmatched F is having blank values
2 f f a b c c AD, BC Unmatched
3 d c c d c d ALL MATCHED
4 f b b b e b AD Unmatched
5 b a f c d a AD, BC Unmatched
The script created is working when we don't have to check for the NaN values in df['F'] column, BUt when we check for the empty rows in df['F'] , It gives Error.
Code i have been trying:
def test(x):
try:
for idx in df.index:
unmatch_list = []
if not df.loc[idx, 'A'] == df.loc[idx, 'D']:
unmatch_list.append('AD')
if not df.loc[idx, 'B'] == df.loc[idx, 'C']:
unmatch_list.append('BC')
# etcetera...
if len(unmatch_list):
unmatch_string = ', '.join(unmatch_list) + ' Unmatched'
else:
unmatch_string = 'ALL MATCHED'
df.loc[idx, 'MATCHES'] = unmatch_string
except ValueError:
print ('error')
return df
## df = df.apply(lambda x: test(x) if(pd.notna(df['F'])) else x)
for row in df:
if row['F'].isna() == True:
row['Comments'] = "F is having blank values"
else:
df = test(df)
Please Suggest how can i use to function .

You could try something like this:
# get combis
df1 = df.copy().reset_index().melt(id_vars=['index'])
df1 = df1.merge(df1, on=['index', 'value'], how='inner')
df1 = df1[df1['variable_x'] != df1['variable_y']]
df1['combis'] = df1['variable_x'] + ':' + df1['variable_y']
df1 = df1.groupby(['index'])['combis'].apply(list)
# get empty rows
df2 = df.copy().reset_index().melt(id_vars=['index'])
df2 = df2[df2['value'].isna()]
df2 = df2.groupby(['index'])['variable'].apply(list)
# combine
df.join(df1).join(df2)
# A B C ... F combis variable
# 0 f e b ... a [D:F, F:D] NaN
# 1 c b a ... None [A:D, D:A, B:E, E:B] [F]
# 2 f f a ... c [A:B, B:A, E:F, F:E] NaN
# 3 d c c ... d [A:D, A:F, D:A, D:F, F:A, F:D, B:C, B:E, C:B, ... NaN
# 4 f b b ... b [B:C, B:D, B:F, C:B, C:D, C:F, D:B, D:C, D:F, ... NaN
# 5 b a f ... a [B:F, F:B] NaN
# [6 rows x 8 columns]
If you are only interested in the unmatched combinations you can use this:
import itertools
combis = [x+':'+y for x,y in itertools.permutations(df.columns, 2)]
df.join(df1).join(df2)['combis'].map(lambda lst: list(set(combis) - set(lst)))

Pandas find duplicates with reversed values between columns

What is the fastest way to find duplicates where value from Column A have been reversed with value from Column B?
For example, if I have a DataFrame with :
Column A Column B
0 C P
1 D C
2 L G
3 A D
4 B P
5 B G
6 P B
7 J T
8 P C
9 J T
The result will be :
Column A Column B
0 C P
8 P C
4 B P
6 P B
I tried :
df1 = df
df2 = df
for i in df2.index:
res = df1[(df1['Column A'] == df2['Column A'][i]) & (df1['Column B'] == df2['Column B'][i])]
But this is very slow and it iterates over the same values...

Use merge with renamed columns DataFrame:
d = {'Column A':'Column B','Column B':'Column A'}
df = df.merge(df.rename(columns=d))
print (df)
Column A Column B
0 C P
1 B P
2 P B
3 P C

You could try using reindex for inversion.
column_Names=["B","A"]
df=df.reindex(columns=column_Names)
Or you could just do this:
col_list = list(df) # get a list of the columns
col_list[0], col_list[1] = col_list[1], col_list[0]
df.columns = col_list # assign back

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregate values pandas - python

IIUC, you can use groupby_agg: >>> df.groupby('Id') .agg({'A': lambda x: x.iloc[0] if len(x.unique()) == 1 else 'f', 'B': 'first', 'C': 'first', 'D': 'first'}) A B C D Id 1 a b c d 2 f b c d

Related

How do I give score (0/1) to CSV rows

How to combine string from one column to another column at same index in pandas DataFrame?

perform df.loc to groupby df

How to apply function on checking the specific column Null Values

Pandas find duplicates with reversed values between columns

Categories

Resources