delete duplicates between two rows Tableau - python

how to delete duplicates between two values and keep the first value only on tableau for each user id ?
for example for a certain user :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| fail| 1/3/2022|
| fail| 1/4/2022|
| success| 1/5/2022|
i want the results to be :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| success| 1/5/2022|
on python it would be like this :
edited_data=[]
for key in d:
dup = [True]
total_len = len(d[key].index)
for i in range(1, total_len):
if d[key].iloc[i]['status'] == d[key].iloc[i-1]['status']:
dup.append(False)
else:
dup.append(True)
edited_data.append(d[key][dup])```

One way you could do this is with the LOOKUP() function. Since this particular problem requires each row to know what came before it, it will be important to make sure your dates are sorted correctly and that the table calculation is computed correctly. Something like this should work:
IF LOOKUP(MIN([Status]),-1) = MIN([Status]) THEN "Hide" ELSE "Show" END
And then simply hide or exclude the "Hide" rows.

Related

Detect Changes In Two Or More CSVs Using Pandas

I am trying to use Pandas to detect changes across two CSVs. I would like it ideally to highlight which UIDs have been changed. I've attached an example of the ideal output here.
CSV 1 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | u02#email.com |
| U03 | u03#email.com |
| U04 | u04#email.com |
CSV 2 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | newemail#email.com |
| U03 | u03#email.com |
| U04 | newemail2#email.com |
| U05 | u05#email.com |
| U06 | u06#email.com |
Over the two CSVs, U02 and U04 saw email changes, whereas U05 and U06 were new records entirely.
I have tried using the pandas compare function, and unfortunately it doesn't work because CSV2 has more records than CSV1.
I have since concatenated the UID and email field, like so, and then created a new field called "Unique" to show whether the concatenated value is a duplication as True or False (but doesn't show if it's a new record entirely)
df3['Concatenated'] = df3["UID"] +"~"+ df3["Email"]
df3['Unique'] = ~df3['Concatenated'].duplicated(keep=False)
This works to an extent, but it feels clunky, and I was wondering if anyone had a smarter way of doing this - especially when it comes into showing whether the record is new or not.
The strategy here is to merge the two dataframes on UID, then compare the email columns, and finally see if the new UIDs are in the UID list.
df_compare = pd.merge(left=df, right=df_new, how='outer', on='UID')
df_compare['Change Status'] = df_compare.apply(lambda x: 'No Change' if x.Email_x == x.Email_y else 'Change', axis=1)
df_compare.loc[~df_compare.UID.isin(df.UID),'Change Status'] = 'New Record'
df_compare = df_compare.drop(columns=['Email_x']).rename(columns={'Email_y': 'Email'})
gives df_compare as:
UID Email Change Status
0 U01 u01#email.com No Change
1 U02 newemail#email.com Change
2 U03 u03#email.com No Change
3 U04 newemail2#email.com Change
4 U05 u05#email.com New Record
5 U06 u06#email.com New Record

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError

Creating new column from API lookup using groupby

I have a dataframe of weather date that looks like this:
+----+------------+----------+-----------+
| ID | Station_ID | Latitude | Longitude |
+----+------------+----------+-----------+
| 0 | 6010400 | 52.93 | -82.43 |
| 1 | 6010400 | 52.93 | -82.43 |
| 2 | 6010400 | 52.93 | -82.43 |
| 3 | 616I001 | 45.07 | -77.88 |
| 4 | 616I001 | 45.07 | -77.88 |
| 5 | 616I001 | 45.07 | -77.88 |
+----+------------+----------+-----------+
I want to create a new column called postal_code using an API lookup based on the latitude and longitude values. I cannot perform a lookup for each row in the dataframe as that would be inefficient, since there are over 500,000 rows and only 186 unique Station_IDs. It's also unfeasible due to rate limiting on the API I need to use.
I believe I need to perform a groupby transform but can't quite figure out how to get it to work correctly.
Any help with this would be greatly appreciated.
I believe, you can use groupby only for aggregations, which is not what you want.
First combine both 'Latitude' and 'Longitude'. It gives a new column with tuples.
df['coordinates'] = list(zip(df['Latitude'],df['Longitude']))
Then you can use this 'coordinates' column to create all unique values of (Latitude,Longitude) using set datatype, so it doesn't contain duplicates.
set(list(df['coordinates']))
Then fetch the postal_codes of these coordinates using API calls as you said and store them as a dict.
Then you can use this dict to populate postal codes for each row.
postal_code_dict = {'key':'value'} #sample dictionary
df['postal_code'] = df['coordinates'].apply(lambda x: postal_code_dict[x])
Hope this helps.

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

Identify duplicate values in dictionary and print in a table

I have a dictionary (d) where every key can have multiple values (appended as a list).
For example, dictionary has following two key,value pairs where one has duplicate values while other doesn't:
SPECIFIC-THREATS , ['5 SPECIFIC-THREATS Microsoft Windows print
spooler little endian DoS attempt', '4 SPECIFIC-THREATS obfuscated
RealPlayer Ierpplug.dll ActiveX exploit attempt', '4 SPECIFIC-THREATS
obfuscated RealPlayer Ierpplug.dll ActiveX exploit attempt']
and
TELNET , ['1 TELNET bsd exploit client finishing']
I want to go through the whole dictionary, check if any key has duplicate values and then print results in a table which has key, number of duplicate values, value (which appears multiple times) etc. as columns.
Here is what I have so far:
import texttable
import collections
def dupechecker():
t = texttable.Texttable()
for key, value in d.iteritems():
for x, y in collections.Counter(value).items():
if y > 1:
t.add_rows([["Category", "Number of dupe values", "Value which appears multiple times"], [key, y, x]])
print t.draw()
It works but the keys which do not have any duplicate values (i.e. TELNET in this case) wont appear in the table output (since the table is printed in the if condition statement). This is what I am getting:
+-------------------------+-------------------------+-------------------------+
| Category | Number of dupe values | Value which appears |
| | | multiple times |
+=========================+=========================+=========================+
| SPECIFIC-THREATS | 2 | 4 SPECIFIC-THREATS |
| | | obfuscated RealPlayer |
| | | Ierpplug.dll ActiveX |
| | | exploit attempt |
+-------------------------+-------------------------+-------------------------+
Is there anyway with which I can keep track of interesting parameters (no. of duplicate values and value which appears multiple times) for each key and then print them together. I want the output to be like:
+-------------------------+-------------------------+-------------------------+
| Category | Number of dupe values | Value which appears |
| | | multiple times |
+=========================+=========================+=========================+
| SPECIFIC-THREATS | 2 | 4 SPECIFIC-THREATS |
| | | obfuscated RealPlayer |
| | | Ierpplug.dll ActiveX |
| | | exploit attempt |
+-------------------------+-------------------------+-------------------------+
| TELNET | 0 | |
| | | |
| | | |
| | | |
+-------------------------+-------------------------+-------------------------+
UPDATE
Resolved
Just change your dupechecker to add rows also for "non-duplicates", but only once per category, add the header before the loop and print the table when you are done.
def dupechecker():
t = texttable.Texttable()
t.header(["Category", "Number of dupe values", "Value which appears multiple times"])
for key, value in d.iteritems():
has_dupe = False
for x, y in collections.Counter(value).items():
if y > 1:
has_dupe = True
t.add_row([key, y, x])
if not has_dupe:
t.add_row([key, 0, ''])
print t.draw()

Categories