Identify duplicate values in dictionary and print in a table - python

I have a dictionary (d) where every key can have multiple values (appended as a list).
For example, dictionary has following two key,value pairs where one has duplicate values while other doesn't:
SPECIFIC-THREATS , ['5 SPECIFIC-THREATS Microsoft Windows print
spooler little endian DoS attempt', '4 SPECIFIC-THREATS obfuscated
RealPlayer Ierpplug.dll ActiveX exploit attempt', '4 SPECIFIC-THREATS
obfuscated RealPlayer Ierpplug.dll ActiveX exploit attempt']
and
TELNET , ['1 TELNET bsd exploit client finishing']
I want to go through the whole dictionary, check if any key has duplicate values and then print results in a table which has key, number of duplicate values, value (which appears multiple times) etc. as columns.
Here is what I have so far:
import texttable
import collections
def dupechecker():
t = texttable.Texttable()
for key, value in d.iteritems():
for x, y in collections.Counter(value).items():
if y > 1:
t.add_rows([["Category", "Number of dupe values", "Value which appears multiple times"], [key, y, x]])
print t.draw()
It works but the keys which do not have any duplicate values (i.e. TELNET in this case) wont appear in the table output (since the table is printed in the if condition statement). This is what I am getting:
+-------------------------+-------------------------+-------------------------+
| Category | Number of dupe values | Value which appears |
| | | multiple times |
+=========================+=========================+=========================+
| SPECIFIC-THREATS | 2 | 4 SPECIFIC-THREATS |
| | | obfuscated RealPlayer |
| | | Ierpplug.dll ActiveX |
| | | exploit attempt |
+-------------------------+-------------------------+-------------------------+
Is there anyway with which I can keep track of interesting parameters (no. of duplicate values and value which appears multiple times) for each key and then print them together. I want the output to be like:
+-------------------------+-------------------------+-------------------------+
| Category | Number of dupe values | Value which appears |
| | | multiple times |
+=========================+=========================+=========================+
| SPECIFIC-THREATS | 2 | 4 SPECIFIC-THREATS |
| | | obfuscated RealPlayer |
| | | Ierpplug.dll ActiveX |
| | | exploit attempt |
+-------------------------+-------------------------+-------------------------+
| TELNET | 0 | |
| | | |
| | | |
| | | |
+-------------------------+-------------------------+-------------------------+
UPDATE
Resolved

Just change your dupechecker to add rows also for "non-duplicates", but only once per category, add the header before the loop and print the table when you are done.
def dupechecker():
t = texttable.Texttable()
t.header(["Category", "Number of dupe values", "Value which appears multiple times"])
for key, value in d.iteritems():
has_dupe = False
for x, y in collections.Counter(value).items():
if y > 1:
has_dupe = True
t.add_row([key, y, x])
if not has_dupe:
t.add_row([key, 0, ''])
print t.draw()

Related

Update a column data w.r.t values in other columns regex match in dataframes

I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large

delete duplicates between two rows Tableau

how to delete duplicates between two values and keep the first value only on tableau for each user id ?
for example for a certain user :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| fail| 1/3/2022|
| fail| 1/4/2022|
| success| 1/5/2022|
i want the results to be :
| status | date |
| -------- | -------------- |
| success| 1/1/2022|
| fail| 1/2/2022|
| success| 1/5/2022|
on python it would be like this :
edited_data=[]
for key in d:
dup = [True]
total_len = len(d[key].index)
for i in range(1, total_len):
if d[key].iloc[i]['status'] == d[key].iloc[i-1]['status']:
dup.append(False)
else:
dup.append(True)
edited_data.append(d[key][dup])```
One way you could do this is with the LOOKUP() function. Since this particular problem requires each row to know what came before it, it will be important to make sure your dates are sorted correctly and that the table calculation is computed correctly. Something like this should work:
IF LOOKUP(MIN([Status]),-1) = MIN([Status]) THEN "Hide" ELSE "Show" END
And then simply hide or exclude the "Hide" rows.

Pandas: Replace list value to a string of values from another dataframe

I did my best to try to find any answer here or google without success.
I'm trying to replace a list of IDs inside of a cell with a ", ".join of values from another Dataframe which contains the "Id" and "name" of the element.
| id | setting | queues |
|-------------------------------------|
| 1ade | A | ['asdf'] |
| 2ade | B | |
| 3cfg | C | ['asdf', 'qwerty'] |
| id | name |
|----------------|
| asdf | 'Foo' |
| qwerty | 'Bar' |
Result:
| id | setting | queues |
|-------------------------------------|
| 1ade | A | Foo |
| 2ade | B | |
| 3cfg | C | Foo, Bar |
I'm losing my mind because I tried with merge, replace and lambda. For example using this:
merged["queues"] = merged["queues"].apply(lambda q: ", ".join(pd.merge(pd.DataFrame(data=list(q)), queues, right_on="id")["name"]))
Any answer will be appreciated because I am losing my mind.
First if possible some non list values repalce them to empty lists and then convert second DataFrame to dictionary and lookup in dict with filtration by if:
merged["queues"] = merged["queues"].apply(lambda x: x if isinstance(x, list) else [])
d = df2.set_index('id')['name'].to_dict()
merged["queues"] = merged["queues"].apply(lambda x: ",".join(d[y] for y in x if y in d))
print (merged)
id setting queues
0 1ade A Foo
1 2ade B
2 3cfg C Foo,Bar

Filter SQL elements with adjacent ID

I don't really know how to properly state this question in the title.
Suppose I have a table Word like the following:
| id | text |
| --- | --- |
| 0 | Hello |
| 1 | Adam |
| 2 | Hello |
| 3 | Max |
| 4 | foo |
| 5 | bar |
Is it possible to query this table based on text and receive the objects whose primary key (id) is exactly one off?
So, if I do
Word.objects.filter(text='Hello')
I get a QuerySet containing the rows
| id | text |
| --- | --- |
| 0 | Hello |
| 2 | Hello |
but I want the rows
| id | text |
| --- | --- |
| 1 | Adam |
| 3 | Max |
I guess I could do
word_ids = Word.objects.filter(text='Hello').values_list('id', flat=True)
word_ids = [w_id + 1 for w_id in word_ids] # or use a numpy array for this
Word.objects.filter(id__in=word_ids)
but that doesn't seem overly efficient. Is there a straight SQL way to do this in one call? Preferably directly using Django's QuerySets?
EDIT: The idea is that in fact I want to filter those words that are in the second QuerySet. Something like:
Word.objects.filter(text__of__previous__word='Hello', text='Max')
In plain Postgres you could use the lag window function (https://www.postgresql.org/docs/current/static/functions-window.html)
SELECT
id,
name
FROM (
SELECT
*,
lag(name) OVER (ORDER BY id) as prev_name
FROM test
) s
WHERE prev_name = 'Hello'
The lag function adds a column with the text of the previous row. So you can filter by this text in a subquery.
demo:db<>fiddle
I am not really into Django but the documentation means, in version 2.0 the functionality for window function has been added.
If by "1 off" you mean that the difference is exactly 1, then you can do:
select w.*
from w
where w.id in (select w2.id + 1 from words w2 where w2.text = 'Hello');
lag() is also a very reasonable solution. This seems like a direct interpretation of your question. If you have gaps (and the intention is + 1), then lag() is a bit trickier.

How to group columns alphabetically with tabulate module?

I am using the tabulate module to print information nicely at the console. I am using python 2.6
I currently have this:
+-------------------------------+
| Task | Status | Rating |
|---------+---------------------+
| A | Done | Good |
| B | Done | Bad |
| C | Pending | |
| D | Done | Good |
+---------+----------+----------+
I want to go to this:
+-------------------------------+
| Task | Status | Rating |
|---------+---------------------+
| A | Done | Good |
| B | Done | Bad |
| D | Done | Good |
| C | Pending | |
+---------+----------+----------+
So that all of the Dones are grouped together.
Currently the tabulate receives a dictionary and I unpack the values like this:
def generate_table(data):
table = []
headers = ['Task', 'Status', 'Rating']
for key, value in data.iteritems():
print key, value
if 'Rating' in value:
m, l = value['Status'], value['Rating']
m = m.split('/')[-1]
temp = [key,m,l]
table.append(temp)
else:
m, l = value['Status'], None
m = m.split('/')[-1]
temp = [key,m,l]
table.append(temp)
print tabulate(table, headers, tablefmt="psql")
You can sort your resulting table by Status column after your for loop:
sorted(table, key=lambda status: status[1])
This will effectively "group" the values alphabetically.

Categories