Python Pandas: count unique values in row [duplicate]

Python Pandas: count unique values in row [duplicate] - python

So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.

Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())

Related

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

| Index | col1 |
| -------- | -------------- |
| 0 | [0,0] |
| 2 | [7.9, 11.06] |
| 3 | [0.9, 4] |
| 4 | NAN |
I have data similar to like this.I want to add elements of the list and store it in other column say total using loop such that output looks like this:
| Index | col1 |Total |
| -------- | -------------- | --------|
| 0 | [0,0] |0 |
| 2 | [7.9, 11.06] |18.9 |
| 3 | [0.9, 4] |4.9 |
| 4 | NAN |NAN |

Using na_action parameter in map should work as well:
df['Total'] = df['col1'].map(sum,na_action='ignore')

Use apply with a lambda to sum the lists or return np.NA if the values are not a list:
df['Total'] = df['col1'].apply(lambda x: sum(x) if isinstance(x, list) else pd.NA)
I tried with df.fillna([]), but lists are not a valid parameters of fillna.
Edit: consider using awkward arrays instead of lists: https://awkward-array.readthedocs.io/en/latest/

Replace column name by Index

I have the below data in a Dataframe.
+----+------+----+------+
| Id | Name | Id | Name |
+----+------+----+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+----+------+
Though the column names are repeating, ideally, its a comparison of 1st 2 columns (old data) with the last 2 columns (new data).
I was trying to rename the 2nd last column by appending _New to it with Index using the below code. Unfortunately, the 1st column is also getting appended with _New.
df.rename(columns={df.columns[2]: df.columns[2] + '_New'}, inplace=True)
Here's the result I am getting using the above code.
+--------+------+--------+------+
| Id_New | Name | Id_New | Name |
+--------+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+--------+------+--------+------+
My understanding is that it should add _New to only the 2nd last column. Below is the expected result.
+----+------+--------+------+
| Id | Name | Id_New | Name |
+----+------+--------+------+
| 1 | A | 1 | C |
| 2 | B | 2 | B |
+----+------+--------+------+
Is there any way to accomplish this?

You can use a simple loop with a dictionary to keep track of the increments. I generalized the logic here to handle an arbitrary number of duplicates:
cols = {}
new_cols = []
for c in df.columns:
if c in cols:
new_cols.append(f'{c}_New{cols[c]}')
cols[c] += 1
else:
new_cols.append(c)
cols[c] = 1
df.columns = new_cols
output:
Id Name Id_New1 Name_New1
0 1 A 1 C
1 2 B 2 B
If you really want Id_New then Id_New2 etc. change:
new_cols.append(f'{c}_New{cols[c]}')
to
i = cols[c] if cols[c] != 1 else ''
new_cols.append(f'{c}_New{i}')

What is the most efficient way of replacing negative values in PySpark DataFrame column with zero?

My goal is to replace all negative elements in a column of a PySpark.DataFrame with zero.
input data
+------+
| col1 |
+------+
| -2 |
| 1 |
| 3 |
| 0 |
| 2 |
| -7 |
| -14 |
| 3 |
+------+
desired output data
+------+
| col1 |
+------+
| 0 |
| 1 |
| 3 |
| 0 |
| 2 |
| 0 |
| 0 |
| 3 |
+------+
Basically I can do this as below:
df = df.withColumn('col1', F.when(F.col('col1') < 0, 0).otherwise(F.col('col1'))
or udf can be defined as
import pyspark.sql.functions as F
smooth = F.udf(lambda x: x if x > 0 else 0, IntegerType())
df = df.withColumn('col1', smooth(F.col('col1')))
or
df = df.withColumn('col1', (F.col('col1') + F.abs('col1')) / 2)
or
df = df.withColumn('col1', F.greatest(F.col('col1'), F.lit(0))
My question is, which one is the most efficient way of doing this? Udf has optimization issues, so absolutely it's not the correct way of doing this. But I don't know how to approach comparing the other two cases. One answer should be absolutely making experiments and comparing the mean running times and so on. But I want to compare these approaches (and new approaches) theoretically.
Thanks in advance...

You can simply make a column where you say, if x > 0: x else 0. This would be the best approach.
The question has already been addressed, theoretically: Spark functions vs UDF performance?
import pyspark.sql.functions as F
df = df.withColumn("only_positive", F.when(F.col("col1") > 0, F.col("col1")).otherwise(0))
You can overwrite col1 in the original dataframe, if you pass that to withColumn()

What is the smartest way to get the rest of a pandas.DataFrame?

Here is a pandas.DataFrame df.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 1 | B |
| 2 | C |
| 3 | D |
| 4 | E |
I selected some rows and defined a new dataframe, by df1 = df.iloc[[1,3],:].
| Foo | Bar |
|-----|-----|
| 1 | B |
| 3 | D |
What is the best way to get the rest of df, like the following.
| Foo | Bar |
|-----|-----|
| 0 | A |
| 2 | C |
| 4 | E |

Fast set-based diffing.
df2 = df.loc[df.index.difference(df1.index)]
df2
Foo Bar
0 0 A
2 2 C
4 4 E
Works as long as your index values are unique.

If I'm understanding correctly, you want to take a dataframe, select some rows from it and store those in a variable df2, and then select rows in df that are not in df2.
If that's the case, you can do df[~df.isin(df2)].dropna().
df[ x ] subsets the dataframe df based on the condition x
~df.isin(df2) is the negation of df.isin(df2), which evaluates to True for rows of df belonging to df2.
.dropna() drops rows with a NaN value. In this case the rows we don't want were coerced to NaN in the filtering expression above, so we get rid of those.

I assume that Foo can be treated as a unique index.
First select Foo values from df1:
idx = df1['Foo'].values
Then filter your original dataframe:
df2 = df[~df['Foo'].isin(idx)]

Replicating GROUP_CONCAT for pandas.DataFrame

I have a pandas DataFrame df:
+------+---------+
| team | user |
+------+---------+
| A | elmer |
| A | daffy |
| A | bugs |
| B | dawg |
| A | foghorn |
| B | speedy |
| A | goofy |
| A | marvin |
| B | pepe |
| C | petunia |
| C | porky |
+------+---------
I want to find or write a function to return a DataFrame that I would return in MySQL using the following:
SELECT
team,
GROUP_CONCAT(user)
FROM
df
GROUP BY
team
for the following result:
+------+---------------------------------------+
| team | group_concat(user) |
+------+---------------------------------------+
| A | elmer,daffy,bugs,foghorn,goofy,marvin |
| B | dawg,speedy,pepe |
| C | petunia,porky |
+------+---------------------------------------+
I can think of nasty ways to do this by iterating over rows and adding to a dictionary, but there's got to be a better way.

Do the following:
df.groupby('team').apply(lambda x: ','.join(x.user))
to get a Series of strings or
df.groupby('team').apply(lambda x: list(x.user))
to get a Series of lists of strings.
Here's what the results look like:
In [33]: df.groupby('team').apply(lambda x: ', '.join(x.user))
Out[33]:
team
a elmer, daffy, bugs, foghorn, goofy, marvin
b dawg, speedy, pepe
c petunia, porky
dtype: object
In [34]: df.groupby('team').apply(lambda x: list(x.user))
Out[34]:
team
a [elmer, daffy, bugs, foghorn, goofy, marvin]
b [dawg, speedy, pepe]
c [petunia, porky]
dtype: object
Note that in general any further operations on these types of Series will be slow and are generally discouraged. If there's another way to aggregate without putting a list inside of a Series you should consider using that approach instead.

A more general solution if you want to use agg:
df.groupby('team').agg({'user' : lambda x: ', '.join(x)})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas: count unique values in row [duplicate] - python

Related

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

Replace column name by Index

What is the most efficient way of replacing negative values in PySpark DataFrame column with zero?

What is the smartest way to get the rest of a pandas.DataFrame?

Replicating GROUP_CONCAT for pandas.DataFrame

Categories

Resources