Detect Changes In Two Or More CSVs Using Pandas - python

I am trying to use Pandas to detect changes across two CSVs. I would like it ideally to highlight which UIDs have been changed. I've attached an example of the ideal output here.
CSV 1 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | u02#email.com |
| U03 | u03#email.com |
| U04 | u04#email.com |
CSV 2 (imported as DataFrame):
| UID | Email |
| -------- | --------------- |
| U01 | u01#email.com |
| U02 | newemail#email.com |
| U03 | u03#email.com |
| U04 | newemail2#email.com |
| U05 | u05#email.com |
| U06 | u06#email.com |
Over the two CSVs, U02 and U04 saw email changes, whereas U05 and U06 were new records entirely.
I have tried using the pandas compare function, and unfortunately it doesn't work because CSV2 has more records than CSV1.
I have since concatenated the UID and email field, like so, and then created a new field called "Unique" to show whether the concatenated value is a duplication as True or False (but doesn't show if it's a new record entirely)
df3['Concatenated'] = df3["UID"] +"~"+ df3["Email"]
df3['Unique'] = ~df3['Concatenated'].duplicated(keep=False)
This works to an extent, but it feels clunky, and I was wondering if anyone had a smarter way of doing this - especially when it comes into showing whether the record is new or not.

The strategy here is to merge the two dataframes on UID, then compare the email columns, and finally see if the new UIDs are in the UID list.
df_compare = pd.merge(left=df, right=df_new, how='outer', on='UID')
df_compare['Change Status'] = df_compare.apply(lambda x: 'No Change' if x.Email_x == x.Email_y else 'Change', axis=1)
df_compare.loc[~df_compare.UID.isin(df.UID),'Change Status'] = 'New Record'
df_compare = df_compare.drop(columns=['Email_x']).rename(columns={'Email_y': 'Email'})
gives df_compare as:
UID Email Change Status
0 U01 u01#email.com No Change
1 U02 newemail#email.com Change
2 U03 u03#email.com No Change
3 U04 newemail2#email.com Change
4 U05 u05#email.com New Record
5 U06 u06#email.com New Record

Related

How to map 2 dataset to check if a value from Dataset_A is present in Dataset_B and create a new column in Dataset_A as 'Present or Not'?

I am working on 2 datasets in PySpark, lets say Dataset_A and Dataset_B. I want to check if 'P/N' column in Dataset_A is present in 'Assembly_P/N' column in Dataset_B. Then I need to create a new column in Dataset_A titled 'Present or Not' with the values 'Present' or 'Not Present' depending on the search result.
PS. Both Datasets are huge and I am trying to figure out an efficient solution to do this without actually joining the tables.
sample
Dataset_A
| P/N |
| -------- |
| 1bc |
| 2df |
| 1cd |
Dataset_B
| Assembly_P/N |
| -------- |
| 1bc |
| 6gh |
| 2df |
Expected Result
Dataset_A
| P/N | Present or Not |
| -------- | -------- |
| 1bc | Present |
| 2df | Present |
| 1cd | Not Present |
from pyspark.sql.functions import udf
from pyspark.sql.functions import when, col, lit
def check_value(PN):
if dataset_B(col("Assembly_P/N")).isNotNull().rlike("%PN%"):
return 'Present'
else:
return 'Not Present'
check_value_udf = udf(check_value,StringType())
dataset_A = dataset_A.withColumn('Present or Not',check_value_udf(dataset_A.P/N))
I am getting PicklingError

delete duplicates between two rows Tableau

how to delete duplicates between two values and keep the first value only on tableau for each user id ?
for example for a certain user :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| fail| 1/3/2022|
| fail| 1/4/2022|
| success| 1/5/2022|
i want the results to be :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| success| 1/5/2022|
on python it would be like this :
edited_data=[]
for key in d:
dup = [True]
total_len = len(d[key].index)
for i in range(1, total_len):
if d[key].iloc[i]['status'] == d[key].iloc[i-1]['status']:
dup.append(False)
else:
dup.append(True)
edited_data.append(d[key][dup])```
One way you could do this is with the LOOKUP() function. Since this particular problem requires each row to know what came before it, it will be important to make sure your dates are sorted correctly and that the table calculation is computed correctly. Something like this should work:
IF LOOKUP(MIN([Status]),-1) = MIN([Status]) THEN "Hide" ELSE "Show" END
And then simply hide or exclude the "Hide" rows.

split JSON/list of dictionaries in the column in dataframe to new rows in python

I am quite new to Python, I tried to find an answer but nothing I tried seems to be working. And the most of the answers are provided when the whole data in JSON format
Through PYODBC I use the following code to retrieve data
formula = """select id, type, custbody_attachment_1 from transaction """
lineitem = pd.read_sql_query(formula, cnxn)
It gives me something like the following
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Internal_ID | Type | Formula_Text |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 2895531 | Bill | |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3492009 | Bill | [{"FL":"https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile0","NM":"someFileName0"}] |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3529162 | Bill | [{"FL":"5https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile1","NM":"someFileName1"},{"FL":"https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile2","NM":"someFileName2"}] |
+-------------+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I need the output like this. (There might be more than 2 links in the cell.)
+-------------+------+---------------------------------------------------------------------+---------------+
| Internal_ID | Type | FL | NM |
+-------------+------+---------------------------------------------------------------------+---------------+
| 2895531 | Bill | | |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3492009 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile0 | someFileName0 |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3529162 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile1 | someFileName1 |
+-------------+------+---------------------------------------------------------------------+---------------+
| 3529162 | Bill | https://.app.netsuite.com/core/media/media.nl?id=someLinkToTheFile2 | someFileName2 |
+-------------+------+---------------------------------------------------------------------+---------------+
I tried to play with JSON but there were one problem after another(because it seemed like JSON data to me). In the end I run
print(lineitem['custbody_attachment_1'])
and got the following in Python console
999 [{"FL":"https://4811553.app.netsuite.com/core/...
Name: custbody_attachment_1, Length: 1000, dtype: object
So, I have no idea how to transform this so I could create new rows
df = df.explode('Formula_Text')
df = pd.concat([df.drop(['Formula_Text'], axis=1), df['Formula_Text'].apply(pd.Series)], axis=1)
print(df)

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

How to efficiently generate a special co-author network in python pandas?

I'm trying to generate a network graph of individual authors given a table of articles. The table I start with is of articles with a single column for the "lead author" and a single column for "co-author". Since each article can have up to 5 authors, article rows may repeat as such:
| paper_ID | project_name | lead_id | co_lead_id | published |
|----------+--------------+---------+------------+-----------|
| 1234 | "fubar" | 999 | 555 | yes |
| 1234 | "fubar" | 999 | 234 | yes |
| 1234 | "fubar" | 999 | 115 | yes |
| 2513 | "fubar2" | 765 | 369 | no |
| 2513 | "fubar2" | 765 | 372 | no |
| 5198 | "fubar3" | 369 | 325 | yes |
My end goal is to have a nodes table, where each row is a unique author, and an edge table, where each row contains source and target author_id columns. The edges table is trivial, as I can merely create a dataframe using the requisite columns of the article table.
For example, for the above table I would have the following node table:
| author_id | is_published |
|-----------+--------------|
| 999 | yes |
| 555 | yes |
| 234 | yes |
| 115 | yes |
| 765 | no |
| 369 | yes |
| 372 | no |
| 325 | yes |
Notice how the "is_published" shows if the author was ever a lead or co-author on at least one published paper. This is where I'm running into trouble creating a nodes table efficiently. Currently I iterate through every row in the article table and run checks on if an author exists yet in the nodes table and whether to turn on the "is_published" flag. See the following code snippet as an example:
articles = pd.read_excel('excel_file_with_articles_table')
nodes = pd.DataFrame(columns=('is_published'))
nodes.index.name = 'author_id'
for row in articles.itertuples():
if not row.lead_id in nodes.index:
author = pd.Series([False], index=["is_published"], name=row.lead_id)
pi_nodes = pi_nodes.append(author)
if not row.co_lead_id in nodes.index:]
investigator = pd.Series([False], index=["is_published"], name=row.co_lead_id)
pi_nodes = pi_nodes.append(investigator)
if row.published == "yes":
pi_nodes.at[row.lead_id,"is_published"]=True
pi_nodes.at[row.co_lead_id,"is_published"]=True
For my data set (with tens of thousands of rows), this is somewhat slow, and I understand that loops should be avoided when possible when using pandas dataframes. I feel like the pandas apply function may be able to do what I need, but I'm at a loss as to how to implement it.
With df as your first DataFrame, you should be able to:
nodes = pd.concat([df.loc[:, ['lead_id', 'is_published']].rename(columns={'lead_id': 'author_id'}, df.loc[:, ['co_lead_id', 'is_published']].rename(columns={'co_lead_id': 'author_id'}]).drop_duplicates()
for a unique list of author_id and co_author_id with their respective is_published information.
To only keep is_published=True if there is also a False entry:
nodes = nodes.sort_values('is_published', ascending=False).drop_duplicates(subset=['author_id'])
.sort_values() will sort True (==1) before False, and .drop_duplicates() by default keeps the first occurrence (see docs). With this addition I guess you don't really need the first .drop_duplicates() anymore.

Categories