python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False
| Index | col1 |
| -------- | -------------- |
| 0 | [0,0] |
| 2 | [7.9, 11.06] |
| 3 | [0.9, 4] |
| 4 | NAN |
I have data similar to like this.I want to add elements of the list and store it in other column say total using loop such that output looks like this:
| Index | col1 |Total |
| -------- | -------------- | --------|
| 0 | [0,0] |0 |
| 2 | [7.9, 11.06] |18.9 |
| 3 | [0.9, 4] |4.9 |
| 4 | NAN |NAN |
Using na_action parameter in map should work as well:
df['Total'] = df['col1'].map(sum,na_action='ignore')
Use apply with a lambda to sum the lists or return np.NA if the values are not a list:
df['Total'] = df['col1'].apply(lambda x: sum(x) if isinstance(x, list) else pd.NA)
I tried with df.fillna([]), but lists are not a valid parameters of fillna.
Edit: consider using awkward arrays instead of lists: https://awkward-array.readthedocs.io/en/latest/
I have a pyspark dataframe that looks like this.
+--------------------+-------+--------------------+
| ID |country| attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2,3,4...] |
|3de27656 | US|[1,7,2,4...] |
|75ce4e58 | US|[1,2,1,4...] |
|908df65c | US|[1,8,3,0...] |
|f0503257 | US|[1,2,3,2...] |
|2tBxD6j | US|[1,2,3,4...] |
|33811685 | US|[1,5,3,5...] |
|aad21639 | US|[7,8,9,4...] |
|e3d9e3bb | US|[1,10,9,4...] |
|463f6f69 | US|[12,2,13,4...] |
+--------------------+-------+--------------------+
I also have a set that looks like this
reference_set = (1,2,100,500,821)
what I want to do is create a new list as a column in the dataframe using maybe a list comprehension like this [attr for attr in attrs if attr in reference_set]
so my final dataframe should be something like this
+--------------------+-------+--------------------+
| ID |country| filtered_attrs|
+--------------------+-------+--------------------+
|ffae10af | US|[1,2] |
|3de27656 | US|[1,2] |
|75ce4e58 | US|[1,2] |
|908df65c | US|[1] |
|f0503257 | US|[1,2] |
|2tBxD6j | US|[1,2] |
|33811685 | US|[1] |
|aad21639 | US|[] |
|e3d9e3bb | US|[1] |
|463f6f69 | US|[2] |
+--------------------+-------+--------------------+
How can I do this? as I'm new to pyspark I can't think of a logic.
Edit : posted a logic below, if there's a more efficient way of doing this please let me know.
You can use built-in function - array_intersect.
# Sample dataframe
df = spark.createDataFrame([('ffae10af', 'US', [1,2,3,4])], ["ID", "Country", "attrs"])
reference_set = {1,2,100,500,821}
# This step is to add set as column in dataframe
set_to_string = ",".join([str(x) for x in reference_set])
df.withColumn('reference_set', split(lit(set_to_string), ',').cast('array<bigint>')). \
withColumn('filtered_attrs', array_intersect('attrs','reference_set'))\
.show(truncate = False)
+--------+-------+------------+---------------------+--------------+
|ID |Country|attrs |reference_set |filtered_attrs|
+--------+-------+------------+---------------------+--------------+
|ffae10af|US |[1, 2, 3, 4]|[1, 2, 100, 500, 821]|[1, 2] |
+--------+-------+------------+---------------------+--------------+
I managed to use the filter function paired with a UDF to make this work.
def filter_items(item):
if item in reference_set:
return True
else:
return False
custom_udf = udf(lambda attributes : list(filter(filter_items, attributes)))
processed_df = df.withColumn('filtered_attrs',custom_udf(col('attrs')))
This gives me the required output
If I have a DataFrame as so:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
| 456def | TRUE | FALSE |
| 789ghi | TRUE | TRUE |
| 789ghi | FALSE | FALSE |
| 789ghi | FALSE | FALSE |
How do I apply a groupby or some equivalent filter to count the unique number of id elements in a subset of the DataFrame that looks like this:
| id | attribute_1 | attribute_2 |
|--------|-------------|-------------|
| 123abc | TRUE | TRUE |
| 123abc | TRUE | FALSE |
Meaning, I want to get the unique number of id values where attribute_1 == True for all instances of a given id BUT attribute_2 must have at least 1 True.
So, 456def would not be included in the filter because it does not have at least one True for attribute_2.
789ghi would not be included in the filter because all of it's attribute_1 entries are not True.
You'll need to groupby twice, once with transform('all') on "attribute_1" and the second time with transform('any') on "attribute_2".
i = df[df.groupby('id').attribute_1.transform('all')]
j = i[i.groupby('id').attribute_2.transform('any')]
print (j)
id attribute_1 attribute_2
0 123abc True True
1 123abc True False
Finally, to get the unique IDs that satisfy this condition, call nunique:
print (j['id'].nunique())
1
This is easiest to do when your attribute_* columns are boolean. If they are strings, fix them first:
df = df.replace({'TRUE': True, 'FALSE': False})
I'm working with spark and python. I would like to transform my input dataset.
My input dataset (RDD)
-------------------------------------------------------------
| id | var |
-------------------------------------------------------------
| 1 |"[{index: 1, value: 200}, {index: 2, value: A}, ...]" |
| 2 |"[{index: 1, value: 140}, {index: 2, value: C}, ...]" |
| .. | ... |
-------------------------------------------------------------
I would like to have this DataFrame (output dataset)
----------------------
| id | index | value |
----------------------
| 1 | 1 | 200 |
| 1 | 2 | A |
| 1 | ... | ... |
| 2 | 1 | 140 |
| 2 | 2 | C |
| ...| ... | ... |
----------------------
I create a map function
def process(row):
my_dict = {}
for item in row['value']:
my_dict['id'] = row['id']
my_dict['index'] = item['index']
my_dict['value'] = item['value']
return my_dict
I would like to map my process function like this:
output_rdd = input_rdd.map(process)
Is it possible to do this on this way (or a simpler way)?
I found the solution:
output_rdd = input_rdd.map(lambda row:process(row)).flatMap(lambda x: x)