Python Group Array by Column and Display Unique Values

Python Group Array by Column and Display Unique Values - python

I have an Array of Arrays with following format:
x = [["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]]
I want to group by the ids and display all the unique usernames
How would I get an output that is like:
id3: Username1, Username3
id4: Username1
Edit: Was able to group by second column but I cannot only display unique values. Here is my code:
data={}
for key, group in groupby(sorted(x), key=lambda x: x[1]):
data[key]=[v[0] for v in group]
print(data)

Use dict to create unique keys by id and pythons sets to store values ( so you would store only unique names for that keys):
items = [
["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]
]
data = {}
for item in items:
if data.has_key(item[1]):
data[item[1]].add(item[0])
else:
data[item[1]] = set([item[0]])
print(data)

You may use a for loop but using a linq statement might be cleaner for future usage.
https://stackoverflow.com/a/3926105/4564614
has some great ways to incorpurate linq to solve this issue. I think what you are looking for would be grouping by.
Example:
from collections import defaultdict
from operator import attrgetter
def group_by(iterable, group_func):
groups = defaultdict(list)
for item in iterable:
groups[group_func(item)].append(item)
return groups
group_by((x.foo for x in ...), attrgetter('bar'))

Related

Order file based on numbers in name

I have a bunch of file with names as follows:
tif_files = av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif
However, they are not guaranteed to be in any sort of order.
I want to sort them based on names such that files from the same year are sorted together. When I do this
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])
I get the following result:
av_v5_1983_001.tif, av_v5_1984_001.tif, av_v5_1985_001.tif...av_v5_2021_001.tif
but I want this:
av_v5_1983_001.tif, av_v5_1983_002.tif, av_v5_1983_003.tif...av_v5_1984_001.tif, av_v5_1984_002.tif...av_v5_2021_001.tif, av_v5_2021_002.tif

take the last two using [2:] for example ['1984', '001.tif']
tif_files = 'av_v5_1983_001.tif', 'av_v5_1983_002.tif', 'av_v5_1983_003.tif',\
'av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v5_2021_001.tif', 'av_v5_2021_002.tif'
sorted(tif_files, key=lambda x: x.split('_')[2:])
# ['av_v5_1983_001.tif',
# 'av_v5_1983_002.tif',
# 'av_v5_1983_003.tif',
# 'av_v5_1984_001.tif',
# 'av_v5_1984_002.tif',
# 'av_v5_2021_001.tif',
# 'av_v5_2021_002.tif']

if you have v1 or v2 or ... v5 or ... you need to consider number of version also like below:
tif_files = ['av_v1_1983_001.tif', 'av_v5_1983_002.tif', 'av_v6_1983_002.tif','av_v5_1984_001.tif', 'av_v5_1984_002.tif', 'av_v4_2021_001.tif','av_v5_2021_001.tif', 'av_v5_2021_002.tif', 'av_v4_1984_002.tif']
sorted(tif_files, key=lambda x: [x.split('_')[2:], x.split('_')[1]])
Output:
['av_v1_1983_001.tif',
'av_v5_1983_002.tif',
'av_v6_1983_002.tif',
'av_v5_1984_001.tif',
'av_v4_1984_002.tif',
'av_v5_1984_002.tif',
'av_v4_2021_001.tif',
'av_v5_2021_001.tif',
'av_v5_2021_002.tif']

What you did was sorting it by the 00x index first then by the year as x.split('_')[-1] produces 001 and etc. Try to change the index to sort by year first , then sort it again by the index:
sorted(tif_files, key=lambda x:x.split('_')[2])
sorted(tif_files, key=lambda x:x.split('_')[-1][:-4])

As long as your naming convention remains consistent, you should be able to just sort them alphanumerically. As such, the below code should work;
sorted(tif_files)
If you instead wanted to sort by the last two numbers in the file name while ignoring the prefix, you would need something a bit more dramatic that would break those numbers out and let you order by them. You could use something like the below:
import pandas as pd
tif_files_list = [[xx, int(xx.split("_")[2]), int(xx.split("_")[3])] for xx in tif_files]
tif_files_frame = pd.DataFrame(tif_files_list, columns=["Name", "Primary Index", "Secondary Index"])
tif_files_frame_ordered = tif_files_frame.sort_values(["Primary Index", "Secondary Index"], axis=0)
tif_files_ordered = tif_files_frame_ordered["Name"].tolist()
This breaks the numbers in the names out into separate columns of a Pandas Dataframe, then sorts your entries by those broken out columns, at which point you can extract the ordered name column on its own.

If key returns a tuple of 2 values, the sort function will try to sort based on the first value then the second value.
please refer to: https://stackoverflow.com/a/5292332/9532450
tif_files = [
"hea_der_1983_002.tif",
"hea_der_1983_001.tif",
"hea_der_1984_002.tif",
"hea_der_1984_001.tif",
]
def parse(filename: str) -> tuple[str, str]:
split = filename.split("_")
return split[2], split[3]
sort = sorted(tif_files, key=parse)
print(sort)
output
['hea_der_1983_001.tif', 'hea_der_1983_002.tif', 'hea_der_1984_001.tif', 'hea_der_1984_002.tif']

right click your folder and click sort by >> name.

How do I create a list as a key of a dictionary and add to the in different parts list?

I have a for loop that runs through a CSV file and grabs certain elements and creates a dictionary based on two variables.
Code:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker] = [sf1_date]
I for example during the first iteration of the for loop, sf1_ticker = 'AAPL' and sf1_date = '2020/03/01' and the next time around, sf1_ticker = 'AAPL' and sf1_date = '2020/06/01', how do I make the key of 'AAPL' in the dictionary equal to ['2020/03/01', '2020/06/01']

It appears that when you say "key" you actually mean "value". The keys for a dictionary are the things that you use to lookup values in the dictionary. In your case ticker is the key and a list of dates are the values, e.g. you want a dictionary that looks like this:
{'AAPL': ['2020/03/01', '2020/06/01'].
'MSFT': ['2020/04/01', '2020/09/01']}
Here the strings AAPL and MSFT are dictionary keys. The date lists are the values associated with each key.
Your code can not construct such a dictionary because it is assigning a new value to the key. The following code will either create a new key in the dictionary company_date if the key does not already exist in the dictionary, or replace the existing value if the key already exists:
company_date[sf1_ticker] = [sf1_date]
You need to append to a list of values in the dict, rather than replace the current list, if any. There are a couple of ways to do it; dict.setdefault() is one:
company_date = {}
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
Another way is with a collections.defaultdict of list:
from collections import defaultdict
company_date = defaultdict(list)
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker].append(sf1_date)

You could create a new dictionary and add the date to the list if it exists. Otherwise, create the entry.
ticker_dates = {}
# Would give ticker_dates = {"AAPL":['2020/03/01', '2020/06/01']}
for ind,row in sft1.iterrows():
sf1_ticker = row['ticker']
sf1_date = row['datekey']
if sf1_ticker in ticker_dates:
ticker_date[sf1_ticker].append(sf1_date)
else:
ticker_dates[sf1_ticker] = [sf1_date]

You can use a defaultdict, which can be setup to add an empty list to any item that doesn't exist. It generally acts like a dictionary otherwise.
from collections import defaultdict
rows = [
['AAPL', '2020/03/01'],
['AAPL', '2020/06/01'],
['GOOGL', '2021/01/01']
]
company_date = defaultdict(list)
for ticker, date in rows:
company_date[ticker].append(date)
print(company_date)
# defaultdict(<class 'list'>, {'AAPL': ['2020/03/01', '2020/06/01'], 'GOOGL': ['2021/01/01']})

How to combine values having same value in some columns

I want to create a list of the item that has same value in the transaction column, i tried using for loop but it gives some random addresses.
CODE AND ERROR SNIPPET
For example:
lst = [["bread"],["Scandinavian", "Scandinavian"],["Hot chocolate", "Hot chocolate"]]

You can user groupby:
df.groupby(['Transaction'])['Item'].transform(lambda x: ','.join(x))

You can use a defaultdict(list) to keep track of reoccuring Transaction IDs in conjuction with Item and append the value for it in refHash
from collections import defaultdict
refHash = defaultdict(list)
for i,id in enumerate(df['Transaction'].values):
refHash[id] += [df.loc[i,'Item']]
lst = list(refHash.values())

In python, how can records be grouped together based on multiple fields?

I'm familiar with grouping records together based on a single field by:
from itertools import groupby
from operator import itemgetter
rows.sort(key=itemgetter('some_field')
groups_list = []
for data, items in groupby(rows, key=itemgetter('some_field'):
group_list = []
for item in items:
group_list.append(item)
groups_list.append(group_list
If I wanted to group the records together based on two fields without having to iterate over them twice, how could I accomplish this?

You can use an anonymous function to sort and group on those two fields:
f = lambda x: (x['field1'], x['field2'])
rows.sort(key=f)
groups_list = []
for data, items in groupby(rows, key=f):
...
Or update your itemgetter to fetch those two fields:
f = itemgetter('field1', 'field2')
rows.sort(key=f)
groups_list = []
for data, items in groupby(rows, key=f):
...

Format an array of tuples in a nice "table"

Say I have an array of tuples which look like that:
[('url#id1', 'url#predicate1', 'value1'),
('url#id1', 'url#predicate2', 'value2'),
('url#id1', 'url#predicate3', 'value3'),
('url#id2', 'url#predicate1', 'value4'),
('url#id2', 'url#predicate2', 'value5')]
I would like be able to return a nice 2D array to be able to display it "as it" in my page through django.
The table would look like that:
[['', 'predicate1', 'predicate2', 'predicate3'],
['id1', 'value1', 'value2', 'value3'],
['id2', 'value4', 'value5', '']]
You will notice that the 2nd item of each tuple became the table "column's title" and that we now have rows with ids and columns values.
How would you do that? Of course if you have a better idea than using the table example I gave I would be happy to have your thoughts :)
Right now I am generating a dict of dict and display that in django. But as my pairs of keys, values are not always in the same order in my dicts, then it cannot display correctly my data.
Thanks!

Your dict of dict is probably on the right track. While you create that dict of dict, you could also maintain a list of ids and a list of predicates. That way, you can remember the ordering and build the table by looping through those lists.
using the zip function on your initial array wil give you three lists: the list of ids, the list of predicates and the list of values.
to get rid of duplicates, try the reduce function:
list_without_duplicates = reduce(
lambda l, x: (l[-1] != x and l.append(x)) or l, list_with_duplicates, [])

Ok,
At last I came up with that code:
columns = dict()
columnsTitles = []
rows = dict()
colIdxCounter = 1 # Start with 1 because the first col are ids
rowIdxCounter = 1 # Start with 1 because the columns titles
for i in dataset:
if not rows.has_key(i[0]):
rows[i[0]] = rowIdxCounter
rowIdxCounter += 1
if not columns.has_key(i[1]):
columns[i[1]] = colIdxCounter
colIdxCounter += 1
columnsTitles.append(i[1])
toRet = [columnsTitles]
for i in range(len(rows)):
toAppend = []
for j in range(colIdxCounter):
toAppend.append("")
toRet.append(toAppend)
for i in dataset:
toRet[rows[i[0]]][columns[i[1]]] = i[2]
for i in toRet:
print i
Please don't hesitate to comment/improve it :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Group Array by Column and Display Unique Values - python

Related

Order file based on numbers in name

How do I create a list as a key of a dictionary and add to the in different parts list?

How to combine values having same value in some columns

In python, how can records be grouped together based on multiple fields?

Format an array of tuples in a nice "table"

Categories

Resources