Replicating GROUP_CONCAT for pandas.DataFrame - python

I have a pandas DataFrame df:
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
| B | speedy |
| A | goofy |
| A | marvin |
| B | pepe |
| C | petunia |
| C | porky |
+------+---------
I want to find or write a function to return a DataFrame that I would return in MySQL using the following:
SELECT
team,
GROUP_CONCAT(user)
FROM
df
GROUP BY
team
for the following result:
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn,goofy,marvin |
| B | dawg,speedy,pepe |
| C | petunia,porky |
+------+---------------------------------------+
I can think of nasty ways to do this by iterating over rows and adding to a dictionary, but there's got to be a better way.

Do the following:
df.groupby('team').apply(lambda x: ','.join(x.user))
to get a Series of strings or
df.groupby('team').apply(lambda x: list(x.user))
to get a Series of lists of strings.
Here's what the results look like:
In [33]: df.groupby('team').apply(lambda x: ', '.join(x.user))
Out[33]:
team
a elmer, daffy, bugs, foghorn, goofy, marvin
b dawg, speedy, pepe
c petunia, porky
dtype: object
In [34]: df.groupby('team').apply(lambda x: list(x.user))
Out[34]:
team
a [elmer, daffy, bugs, foghorn, goofy, marvin]
b [dawg, speedy, pepe]
c [petunia, porky]
dtype: object
Note that in general any further operations on these types of Series will be slow and are generally discouraged. If there's another way to aggregate without putting a list inside of a Series you should consider using that approach instead.

A more general solution if you want to use agg:
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})

Related

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

| Index | col1 |
| -------- | -------------- |
| 0 | [0,0] |
| 2 | [7.9, 11.06] |
| 3 | [0.9, 4] |
| 4 | NAN |
I have data similar to like this.I want to add elements of the list and store it in other column say total using loop such that output looks like this:
| Index | col1 |Total |
| -------- | -------------- | --------|
| 0 | [0,0] |0 |
| 2 | [7.9, 11.06] |18.9 |
| 3 | [0.9, 4] |4.9 |
| 4 | NAN |NAN |
Using na_action parameter in map should work as well:
df['Total'] = df['col1'].map(sum,na_action='ignore')
Use apply with a lambda to sum the lists or return np.NA if the values are not a list:
df['Total'] = df['col1'].apply(lambda x: sum(x) if isinstance(x, list) else pd.NA)
I tried with df.fillna([]), but lists are not a valid parameters of fillna.
Edit: consider using awkward arrays instead of lists: https://awkward-array.readthedocs.io/en/latest/

Pandas: Replace list value to a string of values from another dataframe

I did my best to try to find any answer here or google without success.
I'm trying to replace a list of IDs inside of a cell with a ", ".join of values from another Dataframe which contains the "Id" and "name" of the element.
| id | setting | queues |
|-------------------------------------|
| 1ade | A | ['asdf'] |
| 2ade | B | |
| 3cfg | C | ['asdf', 'qwerty'] |
| id | name |
|----------------|
| asdf | 'Foo' |
| qwerty | 'Bar' |
Result:
| id | setting | queues |
|-------------------------------------|
| 1ade | A | Foo |
| 2ade | B | |
| 3cfg | C | Foo, Bar |
I'm losing my mind because I tried with merge, replace and lambda. For example using this:
merged["queues"] = merged["queues"].apply(lambda q: ", ".join(pd.merge(pd.DataFrame(data=list(q)), queues, right_on="id")["name"]))
Any answer will be appreciated because I am losing my mind.
First if possible some non list values repalce them to empty lists and then convert second DataFrame to dictionary and lookup in dict with filtration by if:
merged["queues"] = merged["queues"].apply(lambda x: x if isinstance(x, list) else [])
d = df2.set_index('id')['name'].to_dict()
merged["queues"] = merged["queues"].apply(lambda x: ",".join(d[y] for y in x if y in d))
print (merged)
id setting queues
0 1ade A Foo
1 2ade B
2 3cfg C Foo,Bar

Data Profiling using python

I have a data frame as below :
member_id | loan_amnt | Age | Marital_status
AK219 | 49539.09 | 34 | Married
AK314 | 1022454.00 | 37 | NA
BN204 | 75422.00 | 34 | Single
I want to create an output file in the below format
Columns | Null Values | Duplicate |
member_id | N | N |
loan_amnt | N | N |
Age | N | Y |
Marital Status| Y | N |
I know about one python package called PandasProfiling but I want build this in the above manner so that I can enhance my code with respect to the data sets.
Use something like:
m=df.apply(lambda x: x.duplicated())
n=df.isna()
df_new=(pd.concat([pd.Series(n.any(),name='Null_Values'),pd.Series(m.any(),name='Duplicates')],axis=1)
.replace({True:'Y',False:'N'}))
Here is python one-liner:
pd.concat([df.isnull().any() , df.apply(lambda x: x.count() != x.nunique())], 1).replace({True: "Y", False: "N"})
Actually the Pandas_Profiling gives you multiple options where you can figure out if there are repetitive values.

pandas dataframe get rows based on matched strings in cells

Given the following data frame
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 1 | you#you.com | 3.0 | World |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
How can I get all the rows where re.compile([Hh](i|ello)) would match in a cell? That is, from the above example, I would like to get the following output:
+-----+----------------+--------+---------+
| | A | B | C |
+-----+----------------+--------+---------+
| 0 | hello#me.com | 2.0 | Hello |
| 2 | us#world.com | hi | holiday |
+-----+----------------+--------+---------+
I am not able to get a solution for this. And help would be very much appreciated.
Using stack to avoid apply
df.loc[df.stack().str.match(r'[Hh](i|ello)').unstack().any(1)]
Using match generates a future warning. The warning is consistant with what we are doing, so that's good. However, findall accomplishes the same thing
df.loc[df.stack().str.findall(r'[Hh](i|ello)').unstack().any(1)]
You can use the findall function which takes regular expressions.
msk = df.apply(lambda x: x.str.findall(r'[Hh](i|ello)')).any(axis=1)
df[msk]
+---|------------|------|---------+
| | A | B | C |
+---|------------|------|---------+
| 0 |hello#me.com| 2 | Hello |
| 2 |us#world.com| hi | holiday |
+---|------------|------|---------+
any(axis=1) will check if any of the columns in a given row are true. So msk is a single column of True/False values indicating whether or not the regular expression was found in that row.

Python Pandas: count unique values in row [duplicate]

So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())

Categories