Format an array of tuples in a nice "table" - python

Say I have an array of tuples which look like that:
[('url#id1', 'url#predicate1', 'value1'),
('url#id1', 'url#predicate2', 'value2'),
('url#id1', 'url#predicate3', 'value3'),
('url#id2', 'url#predicate1', 'value4'),
('url#id2', 'url#predicate2', 'value5')]
I would like be able to return a nice 2D array to be able to display it "as it" in my page through django.
The table would look like that:
[['', 'predicate1', 'predicate2', 'predicate3'],
['id1', 'value1', 'value2', 'value3'],
['id2', 'value4', 'value5', '']]
You will notice that the 2nd item of each tuple became the table "column's title" and that we now have rows with ids and columns values.
How would you do that? Of course if you have a better idea than using the table example I gave I would be happy to have your thoughts :)
Right now I am generating a dict of dict and display that in django. But as my pairs of keys, values are not always in the same order in my dicts, then it cannot display correctly my data.
Thanks!

Your dict of dict is probably on the right track. While you create that dict of dict, you could also maintain a list of ids and a list of predicates. That way, you can remember the ordering and build the table by looping through those lists.
using the zip function on your initial array wil give you three lists: the list of ids, the list of predicates and the list of values.
to get rid of duplicates, try the reduce function:
list_without_duplicates = reduce(
lambda l, x: (l[-1] != x and l.append(x)) or l, list_with_duplicates, [])

Ok,
At last I came up with that code:
columns = dict()
columnsTitles = []
rows = dict()
colIdxCounter = 1 # Start with 1 because the first col are ids
rowIdxCounter = 1 # Start with 1 because the columns titles
for i in dataset:
if not rows.has_key(i[0]):
rows[i[0]] = rowIdxCounter
rowIdxCounter += 1
if not columns.has_key(i[1]):
columns[i[1]] = colIdxCounter
colIdxCounter += 1
columnsTitles.append(i[1])
toRet = [columnsTitles]
for i in range(len(rows)):
toAppend = []
for j in range(colIdxCounter):
toAppend.append("")
toRet.append(toAppend)
for i in dataset:
toRet[rows[i[0]]][columns[i[1]]] = i[2]
for i in toRet:
print i
Please don't hesitate to comment/improve it :)

Related

updating dictionary in a nested loop

In the code below, I would like to update the fruit_dict dictionary with the mean price of each row. But the code is not working as expected. Kindly help.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
price=np.array(range(20)).reshape(5,4) #sample data for illustration
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
# initialize a dictionary
fruit_dict = dict.fromkeys(fruit_keys)
fruit_list = []
# print(fruit_dict)
# update dictionary values
for i in range(price.shape[1]):
for key,value in fruit_dict.items():
for j in range(price.shape[0]):
fruit_dict[key] = np.mean(price[j])
fruit_list.append(fruit_dict)
fruit_df = pd.DataFrame(fruit_list)
print(fruit_df)
Instead of creating the dictionary with the string pattern you can append the values for the means of rows as a string pattern by iterating the rows only.
In case if you have a dictionary with a certain pattern you can update the value in a single loop by assigning the key as the pattern which you need for displaying. you don't need to create an additional list for creating a data frame instead you can refer the documentation for creating data frames from dictionary itself Here. I have provided a sample output which may be suitable for your requirement.
In case you need an output with mean value as a column and fruits as rows you can use the below implementation.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
row = 5
column = 4
price = np.array(range(20)).reshape(row, column) # sample data for illustration
# initialize a dictionary
fruit_dict = {}
for j in range(row):
fruit_dict['fruit_'+str(j)] = np.mean(price[j])
fruit_df = pd.DataFrame.from_dict(fruit_dict,orient='index',columns=['mean_value'])
print(fruit_df)
This will provide an output like below. As I already mentioned you can create the data frame as you wish from a dictionary by referring the above data frame documentation.
mean_value
fruit_0 1.5
fruit_1 5.5
fruit_2 9.5
fruit_3 13.5
fruit_4 17.5
`
You shouldn't nest the loop over the range and the dictionary items, you should iterate over them together. You can do this with enumerate().
You're also not using value, so there's no need to use items().
for i, key in enumerate(fruit_dict):
fruit_dict[key] = np.mean(price[j])
Could arrive on a solution based on the answer provided by Sangeerththan. Please find the same below.
#!/usr/bin/python3
fruit_dict = {}
fruit_list =[]
price=np.array(range(40)).reshape(4,10)
for i in range(price.shape[0]):
mark_price = np.square(price[i])
for j in range(mark_price.shape[0]):
fruit_dict['proj_fruit_price_'+str(j)] = np.mean(mark_price[j])
fruit_list.append(fruit_dict.copy())
fruit_df = pd.DataFrame(fruit_list)
You can use this instead of your loops:
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
out = {fruit_keys[index]: np.mean(price[index]) for index in range(price.shape[0])}
Output:
{'fruit_1': '1.5', 'fruit_2': '5.5', 'fruit_3': '9.5', 'fruit_4': '13.5', 'fruit_5': '17.5'}

How do I create a list as a key of a dictionary and add to the in different parts list?

I have a for loop that runs through a CSV file and grabs certain elements and creates a dictionary based on two variables.
Code:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker] = [sf1_date]
I for example during the first iteration of the for loop, sf1_ticker = 'AAPL' and sf1_date = '2020/03/01' and the next time around, sf1_ticker = 'AAPL' and sf1_date = '2020/06/01', how do I make the key of 'AAPL' in the dictionary equal to ['2020/03/01', '2020/06/01']
It appears that when you say "key" you actually mean "value". The keys for a dictionary are the things that you use to lookup values in the dictionary. In your case ticker is the key and a list of dates are the values, e.g. you want a dictionary that looks like this:
{'AAPL': ['2020/03/01', '2020/06/01'].
'MSFT': ['2020/04/01', '2020/09/01']}
Here the strings AAPL and MSFT are dictionary keys. The date lists are the values associated with each key.
Your code can not construct such a dictionary because it is assigning a new value to the key. The following code will either create a new key in the dictionary company_date if the key does not already exist in the dictionary, or replace the existing value if the key already exists:
company_date[sf1_ticker] = [sf1_date]
You need to append to a list of values in the dict, rather than replace the current list, if any. There are a couple of ways to do it; dict.setdefault() is one:
company_date = {}
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
Another way is with a collections.defaultdict of list:
from collections import defaultdict
company_date = defaultdict(list)
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker].append(sf1_date)
You could create a new dictionary and add the date to the list if it exists. Otherwise, create the entry.
ticker_dates = {}
# Would give ticker_dates = {"AAPL":['2020/03/01', '2020/06/01']}
for ind,row in sft1.iterrows():
sf1_ticker = row['ticker']
sf1_date = row['datekey']
if sf1_ticker in ticker_dates:
ticker_date[sf1_ticker].append(sf1_date)
else:
ticker_dates[sf1_ticker] = [sf1_date]
You can use a defaultdict, which can be setup to add an empty list to any item that doesn't exist. It generally acts like a dictionary otherwise.
from collections import defaultdict
rows = [
['AAPL', '2020/03/01'],
['AAPL', '2020/06/01'],
['GOOGL', '2021/01/01']
]
company_date = defaultdict(list)
for ticker, date in rows:
company_date[ticker].append(date)
print(company_date)
# defaultdict(<class 'list'>, {'AAPL': ['2020/03/01', '2020/06/01'], 'GOOGL': ['2021/01/01']})

Pythonic way to match multiple values in a list of lists to another list of lists and return a value

I'm trying to match two or more values from a list of lists to another list of lists and return a value from one of the lists. Much like SQL's on clause - on x.field = y.field and x.field = y.field.
Picture a list of transactions from your Amazon account. The ids are unique, but the names change (darn Amazon!). I want to use the last name/title, based on max date. I could probably do the below with the initial data set, but couldn't think of how. I'm reading in the rows as a list of lists.
I'm just working on a personal project combing through Amazon purchases, but could see this being very useful down the road. I have a solution, but I think it will run very long depending on the size of the data. I've seen people call out Pandas' dataframe as a solution, but I'm trying to learn Python's standard libraries first. It's my first question on Stack, I apologize and thank you in advance.
#Example data set comes from a csv I've read into different list of lists
#Fields in order are ID, date (max date from csv to id) -- data set is unique row count 140
X = [['b12', 8/1/2019], ['c34', 7/25/2018],..]
#Fields in order are ID, date, Name -- data set is unique, due to date, row count 1,231
Y = [['b12', 6/23/19, 'item 1'], ['b12', 7/21/19, 'item 1.0'], ['b12', 8/1/19, 'item 1.1'],..]
#Code that works, but I'm sure is 'expensive'
for i in X:
for n in Y:
if i[0] == n[0] and i[1] == n[1]:
i.append(x[2])
else: continue
#Result is either I append to X (like I have) or create a new list of lists all together
X
[['b12', 8/1/2019, 'item 1.1'], ['c34', 7/25/2019, 'item 2.8'],...]
You can create a mapping dict from your list Y with (id, date) as key and the name as value. Then use a list comprehension to create a new list from list X with the the mapped value from the mapping dict
>>> X = [['b12', '8/1/2019'], ['c34', '7/25/2018']]
>>> Y = [['b12', '6/23/19', 'item 1'], ['b12', '7/21/19', 'item 1.0'], ['b12', '8/1/19', 'item 1.1'], ['c34', '7/25/18', 'item2.1']]
>>>
>>> mapping = {(id, date):name for id,date,name in Y}
>>> res = [[id, date, mapping[(id, date.replace('/20', '/'))]] for id,date in X]
>>>
>>> print (res)
[['b12', '8/1/2019', 'item 1.1'], ['c34', '7/25/2018', 'item2.1']]

How to group by and sum when all elements of one list are in another list

I have a data frame df1. "transactions" column has an array of int.
id transactions
1 [1,2,3]
2 [2,3]
data frame df2. "items" column has an array of int.
items cost
[1,2] 2.0
[2] 1.0
[2,4] 4.0
I need to check whether all elements of items are in each transaction if so sum up the costs.
Expected Result
id transaction score
1 [1,2,3] 3.0
2 [2,3] 1.0
I did the following
#cross join
-----------
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()],
right.values[ib2.ravel()]]))
out=cartesian_product_simplified(df1,df2)
#column names assigning
out.columns=['id', 'transactions', 'cost', 'items']
#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()
#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list()
for row in trans:
ret =np.all(np.in1d(itm, row))
out_list.append(ret)
return out_list
if true: group and sum
-----------------------
a=check(t,item)
for i in a:
if(i):
print(out.groupby(['id','transactions']))['cost'].sum()
else:
print("no")
Throws TypeError: 'NoneType' object is not subscriptable.
I am new to python and don't know how to put all these together. How to group by and sum the cost when all items of one list in another list?
The simplies way is just to check all items for all transactions:
# df1 and df2 are initialized
def sum_score(transaction):
score = 0
for _, row in df2.iterrows():
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
df1["score"] = df1["transactions"].map(sum_score)
It will be extremely slow on big scale. If this is a problem, we need to iterate not over every item, but preselect only possible. If you have enough memory, it can be done like that. For each item we remember all the row numbers in df2, where it appeared. So for each transaction we get the items, get all the possible lines and check only them.
import collections
# df1 and df2 are initialized
def get_sum_score_precalculated_func(items_cost_df):
# create a dict of possible indexes to search for an item
items_search_dict = collections.default_dict(set)
for i, (_, row) in enumerate(items_cost_df.iterrow()):
for item in row["items"]:
items_search_dict[item].add(i)
def sum_score(transaction):
possible_indexes = set()
for i in transaction:
possible_indexes += items_search_dict[i]
score = 0
for i in possible_indexes:
row = items_cost_df.iloc[i]
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
return sum_score
df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))
Here I use
set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count).
collections.defaultdict which is a usual dict, but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). It help to avoid if x not in my_dict: my_dict[x] = set(). I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func
That should be much faster in case the items are quite unique and can be found only in a few lines of df2.
If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. And then just join the result.
transactions_score = []
for transaction in df1["transactions"].unique():
score = sum_score(transaction)
transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
transaction_score,
columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")
Here I use sum_score from first example of code
P.S. With the python error message there should be a line number which helps a lot to understand the problem.
# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()
# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
items = rows[0]
costs = rows[1]
for key, value in df_1_dict.items():
# find common items in both
common = set(value).intersection(set(items))
# execute of common item exist in second dataframe
if len(common) == len(items):
new_row = {"id": key, "transactions": value, "costs": costs}
new_data.append(new_row)
merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]
# group the data by id to get total cost for each id
merged_df = (
merged_df
.groupby(["id"])
.agg({"costs": "sum"})
.reset_index()
)

Python Group Array by Column and Display Unique Values

I have an Array of Arrays with following format:
x = [["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]]
I want to group by the ids and display all the unique usernames
How would I get an output that is like:
id3: Username1, Username3
id4: Username1
Edit: Was able to group by second column but I cannot only display unique values. Here is my code:
data={}
for key, group in groupby(sorted(x), key=lambda x: x[1]):
data[key]=[v[0] for v in group]
print(data)
Use dict to create unique keys by id and pythons sets to store values ( so you would store only unique names for that keys):
items = [
["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]
]
data = {}
for item in items:
if data.has_key(item[1]):
data[item[1]].add(item[0])
else:
data[item[1]] = set([item[0]])
print(data)
You may use a for loop but using a linq statement might be cleaner for future usage.
https://stackoverflow.com/a/3926105/4564614
has some great ways to incorpurate linq to solve this issue. I think what you are looking for would be grouping by.
Example:
from collections import defaultdict
from operator import attrgetter
def group_by(iterable, group_func):
groups = defaultdict(list)
for item in iterable:
groups[group_func(item)].append(item)
return groups
group_by((x.foo for x in ...), attrgetter('bar'))

Categories