Convert a Pandas DataFrame with true/false to a dictionary

Convert a Pandas DataFrame with true/false to a dictionary - python

I would like to transform a dataframe with the following layout:
| image | finding1 | finding2 | nofinding |
| ------- | -------- | -------- | --------- |
| 039.png | true | false | false |
| 006.png | true | true | false |
| 012.png | false | false | true |
into a dictionary with the following structure:
{
"039.png" : [
"finding1"
],
"006.png" : [
"finding1",
"finding2"
],
"012.png" : [
"nofinding"
]}

IIUC, you could replace the False to NA (assuming boolean False here, for strings use 'false'), then stack to remove the values and use groupby.agg to aggregate as list before converting to dictionary:
dic = (df
.set_index('image')
.replace({False: pd.NA})
.stack()
.reset_index(1)
.groupby(level='image', sort=False)['level_1'].agg(list)
.to_dict()
)
output:
{'039.png': ['finding1'],
'006.png': ['finding1', 'finding2'],
'012.png': ['nofinding']}

Related

Python. Dataframes. Select rows and apply condition on the selection

python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.

You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

| Index | col1 |
| -------- | -------------- |
| 0 | [0,0] |
| 2 | [7.9, 11.06] |
| 3 | [0.9, 4] |
| 4 | NAN |
I have data similar to like this.I want to add elements of the list and store it in other column say total using loop such that output looks like this:
| Index | col1 |Total |
| -------- | -------------- | --------|
| 0 | [0,0] |0 |
| 2 | [7.9, 11.06] |18.9 |
| 3 | [0.9, 4] |4.9 |
| 4 | NAN |NAN |

Using na_action parameter in map should work as well:
df['Total'] = df['col1'].map(sum,na_action='ignore')

Use apply with a lambda to sum the lists or return np.NA if the values are not a list:
df['Total'] = df['col1'].apply(lambda x: sum(x) if isinstance(x, list) else pd.NA)
I tried with df.fillna([]), but lists are not a valid parameters of fillna.
Edit: consider using awkward arrays instead of lists: https://awkward-array.readthedocs.io/en/latest/

How to use list comprehension on a column with array in pyspark?

I have a pyspark dataframe that looks like this.
+--------------------+-------+--------------------+
| ID |country| attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2,3,4...] |
|3de27656 | US|[1,7,2,4...] |
|75ce4e58 | US|[1,2,1,4...] |
|908df65c | US|[1,8,3,0...] |
|f0503257 | US|[1,2,3,2...] |
|2tBxD6j | US|[1,2,3,4...] |
|33811685 | US|[1,5,3,5...] |
|aad21639 | US|[7,8,9,4...] |
|e3d9e3bb | US|[1,10,9,4...] |
|463f6f69 | US|[12,2,13,4...] |
+--------------------+-------+--------------------+
I also have a set that looks like this
reference_set = (1,2,100,500,821)
what I want to do is create a new list as a column in the dataframe using maybe a list comprehension like this [attr for attr in attrs if attr in reference_set]
so my final dataframe should be something like this
+--------------------+-------+--------------------+
| ID |country| filtered_attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2] |
|3de27656 | US|[1,2] |
|75ce4e58 | US|[1,2] |
|908df65c | US|[1] |
|f0503257 | US|[1,2] |
|2tBxD6j | US|[1,2] |
|33811685 | US|[1] |
|aad21639 | US|[] |
|e3d9e3bb | US|[1] |
|463f6f69 | US|[2] |
+--------------------+-------+--------------------+
How can I do this? as I'm new to pyspark I can't think of a logic.
Edit : posted a logic below, if there's a more efficient way of doing this please let me know.

You can use built-in function - array_intersect.
# Sample dataframe
df = spark.createDataFrame([('ffae10af', 'US', [1,2,3,4])], ["ID", "Country", "attrs"])
reference_set = {1,2,100,500,821}
# This step is to add set as column in dataframe
set_to_string = ",".join([str(x) for x in reference_set])
df.withColumn('reference_set', split(lit(set_to_string), ',').cast('array<bigint>')). \
withColumn('filtered_attrs', array_intersect('attrs','reference_set'))\
.show(truncate = False)
+--------+-------+------------+---------------------+--------------+
|ID |Country|attrs |reference_set |filtered_attrs|
+--------+-------+------------+---------------------+--------------+
|ffae10af|US |[1, 2, 3, 4]|[1, 2, 100, 500, 821]|[1, 2] |
+--------+-------+------------+---------------------+--------------+

I managed to use the filter function paired with a UDF to make this work.
def filter_items(item):
if item in reference_set:
return True
else:
return False
custom_udf = udf(lambda attributes : list(filter(filter_items, attributes)))
processed_df = df.withColumn('filtered_attrs',custom_udf(col('attrs')))
This gives me the required output

Conditional DataFrame filter on boolean columns?

If I have a DataFrame as so:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
| 456def | TRUE | FALSE |
| 789ghi | TRUE | TRUE |
| 789ghi | FALSE | FALSE |
| 789ghi | FALSE | FALSE |
How do I apply a groupby or some equivalent filter to count the unique number of id elements in a subset of the DataFrame that looks like this:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
Meaning, I want to get the unique number of id values where attribute_1 == True for all instances of a given id BUT attribute_2 must have at least 1 True.
So, 456def would not be included in the filter because it does not have at least one True for attribute_2.
789ghi would not be included in the filter because all of it's attribute_1 entries are not True.

You'll need to groupby twice, once with transform('all') on "attribute_1" and the second time with transform('any') on "attribute_2".
i = df[df.groupby('id').attribute_1.transform('all')]
j = i[i.groupby('id').attribute_2.transform('any')]
print (j)
id attribute_1 attribute_2
0 123abc True True
1 123abc True False
Finally, to get the unique IDs that satisfy this condition, call nunique:
print (j['id'].nunique())
1
This is easiest to do when your attribute_* columns are boolean. If they are strings, fix them first:
df = df.replace({'TRUE': True, 'FALSE': False})

How to convert dict to spark map output

I'm working with spark and python. I would like to transform my input dataset.
My input dataset (RDD)
-------------------------------------------------------------
| id | var |
-------------------------------------------------------------
| 1 |"[{index: 1, value: 200}, {index: 2, value: A}, ...]" |
| 2 |"[{index: 1, value: 140}, {index: 2, value: C}, ...]" |
| .. | ... |
-------------------------------------------------------------
I would like to have this DataFrame (output dataset)
----------------------
| id | index | value |
----------------------
| 1 | 1 | 200 |
| 1 | 2 | A |
| 1 | ... | ... |
| 2 | 1 | 140 |
| 2 | 2 | C |
| ...| ... | ... |
----------------------
I create a map function
def process(row):
my_dict = {}
for item in row['value']:
my_dict['id'] = row['id']
my_dict['index'] = item['index']
my_dict['value'] = item['value']
return my_dict
I would like to map my process function like this:
output_rdd = input_rdd.map(process)
Is it possible to do this on this way (or a simpler way)?

I found the solution:
output_rdd = input_rdd.map(lambda row:process(row)).flatMap(lambda x: x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert a Pandas DataFrame with true/false to a dictionary - python

Related

Python. Dataframes. Select rows and apply condition on the selection

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

How to use list comprehension on a column with array in pyspark?

Conditional DataFrame filter on boolean columns?

How to convert dict to spark map output

Categories

Resources