How to combine values having same value in some columns - python

I want to create a list of the item that has same value in the transaction column, i tried using for loop but it gives some random addresses.
CODE AND ERROR SNIPPET
For example:
lst = [["bread"],["Scandinavian", "Scandinavian"],["Hot chocolate", "Hot chocolate"]]

You can user groupby:
df.groupby(['Transaction'])['Item'].transform(lambda x: ','.join(x))

You can use a defaultdict(list) to keep track of reoccuring Transaction IDs in conjuction with Item and append the value for it in refHash
from collections import defaultdict
refHash = defaultdict(list)
for i,id in enumerate(df['Transaction'].values):
refHash[id] += [df.loc[i,'Item']]
lst = list(refHash.values())

Related

updating dictionary in a nested loop

In the code below, I would like to update the fruit_dict dictionary with the mean price of each row. But the code is not working as expected. Kindly help.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
price=np.array(range(20)).reshape(5,4) #sample data for illustration
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
# initialize a dictionary
fruit_dict = dict.fromkeys(fruit_keys)
fruit_list = []
# print(fruit_dict)
# update dictionary values
for i in range(price.shape[1]):
for key,value in fruit_dict.items():
for j in range(price.shape[0]):
fruit_dict[key] = np.mean(price[j])
fruit_list.append(fruit_dict)
fruit_df = pd.DataFrame(fruit_list)
print(fruit_df)
Instead of creating the dictionary with the string pattern you can append the values for the means of rows as a string pattern by iterating the rows only.
In case if you have a dictionary with a certain pattern you can update the value in a single loop by assigning the key as the pattern which you need for displaying. you don't need to create an additional list for creating a data frame instead you can refer the documentation for creating data frames from dictionary itself Here. I have provided a sample output which may be suitable for your requirement.
In case you need an output with mean value as a column and fruits as rows you can use the below implementation.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
row = 5
column = 4
price = np.array(range(20)).reshape(row, column) # sample data for illustration
# initialize a dictionary
fruit_dict = {}
for j in range(row):
fruit_dict['fruit_'+str(j)] = np.mean(price[j])
fruit_df = pd.DataFrame.from_dict(fruit_dict,orient='index',columns=['mean_value'])
print(fruit_df)
This will provide an output like below. As I already mentioned you can create the data frame as you wish from a dictionary by referring the above data frame documentation.
mean_value
fruit_0 1.5
fruit_1 5.5
fruit_2 9.5
fruit_3 13.5
fruit_4 17.5
`
You shouldn't nest the loop over the range and the dictionary items, you should iterate over them together. You can do this with enumerate().
You're also not using value, so there's no need to use items().
for i, key in enumerate(fruit_dict):
fruit_dict[key] = np.mean(price[j])
Could arrive on a solution based on the answer provided by Sangeerththan. Please find the same below.
#!/usr/bin/python3
fruit_dict = {}
fruit_list =[]
price=np.array(range(40)).reshape(4,10)
for i in range(price.shape[0]):
mark_price = np.square(price[i])
for j in range(mark_price.shape[0]):
fruit_dict['proj_fruit_price_'+str(j)] = np.mean(mark_price[j])
fruit_list.append(fruit_dict.copy())
fruit_df = pd.DataFrame(fruit_list)
You can use this instead of your loops:
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
out = {fruit_keys[index]: np.mean(price[index]) for index in range(price.shape[0])}
Output:
{'fruit_1': '1.5', 'fruit_2': '5.5', 'fruit_3': '9.5', 'fruit_4': '13.5', 'fruit_5': '17.5'}

How do I create a list as a key of a dictionary and add to the in different parts list?

I have a for loop that runs through a CSV file and grabs certain elements and creates a dictionary based on two variables.
Code:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker] = [sf1_date]
I for example during the first iteration of the for loop, sf1_ticker = 'AAPL' and sf1_date = '2020/03/01' and the next time around, sf1_ticker = 'AAPL' and sf1_date = '2020/06/01', how do I make the key of 'AAPL' in the dictionary equal to ['2020/03/01', '2020/06/01']
It appears that when you say "key" you actually mean "value". The keys for a dictionary are the things that you use to lookup values in the dictionary. In your case ticker is the key and a list of dates are the values, e.g. you want a dictionary that looks like this:
{'AAPL': ['2020/03/01', '2020/06/01'].
'MSFT': ['2020/04/01', '2020/09/01']}
Here the strings AAPL and MSFT are dictionary keys. The date lists are the values associated with each key.
Your code can not construct such a dictionary because it is assigning a new value to the key. The following code will either create a new key in the dictionary company_date if the key does not already exist in the dictionary, or replace the existing value if the key already exists:
company_date[sf1_ticker] = [sf1_date]
You need to append to a list of values in the dict, rather than replace the current list, if any. There are a couple of ways to do it; dict.setdefault() is one:
company_date = {}
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
Another way is with a collections.defaultdict of list:
from collections import defaultdict
company_date = defaultdict(list)
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker].append(sf1_date)
You could create a new dictionary and add the date to the list if it exists. Otherwise, create the entry.
ticker_dates = {}
# Would give ticker_dates = {"AAPL":['2020/03/01', '2020/06/01']}
for ind,row in sft1.iterrows():
sf1_ticker = row['ticker']
sf1_date = row['datekey']
if sf1_ticker in ticker_dates:
ticker_date[sf1_ticker].append(sf1_date)
else:
ticker_dates[sf1_ticker] = [sf1_date]
You can use a defaultdict, which can be setup to add an empty list to any item that doesn't exist. It generally acts like a dictionary otherwise.
from collections import defaultdict
rows = [
['AAPL', '2020/03/01'],
['AAPL', '2020/06/01'],
['GOOGL', '2021/01/01']
]
company_date = defaultdict(list)
for ticker, date in rows:
company_date[ticker].append(date)
print(company_date)
# defaultdict(<class 'list'>, {'AAPL': ['2020/03/01', '2020/06/01'], 'GOOGL': ['2021/01/01']})

Python Group Array by Column and Display Unique Values

I have an Array of Arrays with following format:
x = [["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]]
I want to group by the ids and display all the unique usernames
How would I get an output that is like:
id3: Username1, Username3
id4: Username1
Edit: Was able to group by second column but I cannot only display unique values. Here is my code:
data={}
for key, group in groupby(sorted(x), key=lambda x: x[1]):
data[key]=[v[0] for v in group]
print(data)
Use dict to create unique keys by id and pythons sets to store values ( so you would store only unique names for that keys):
items = [
["Username1","id3"],
["Username1", "id4"],
["Username1", "id4"],
["Username3", "id3"]
]
data = {}
for item in items:
if data.has_key(item[1]):
data[item[1]].add(item[0])
else:
data[item[1]] = set([item[0]])
print(data)
You may use a for loop but using a linq statement might be cleaner for future usage.
https://stackoverflow.com/a/3926105/4564614
has some great ways to incorpurate linq to solve this issue. I think what you are looking for would be grouping by.
Example:
from collections import defaultdict
from operator import attrgetter
def group_by(iterable, group_func):
groups = defaultdict(list)
for item in iterable:
groups[group_func(item)].append(item)
return groups
group_by((x.foo for x in ...), attrgetter('bar'))

Is there a way to remove nan from a dictionary filled with data?

I have a dictionary that is filled with data from two files I imported, but some of the data comes out as nan. How do I remove the pieces of data with nan?
My code is:
import matplotlib.pyplot as plt
from pandas.lib import Timestamp
import numpy as np
from datetime import datetime
import pandas as pd
import collections
orangebook = pd.read_csv('C:\Users\WEGWEIS_JAKE\Desktop\Work Programs\Code Files\products2.txt',sep='~', parse_dates=['Approval_Date'])
specificdrugs=pd.read_csv('C:\Users\WEGWEIS_JAKE\Desktop\Work Programs\Code Files\Drugs.txt',sep=',')
"""This is a dictionary that collects data from the .txt file
This dictionary has a key,value pair for every generic name with its corresponding approval date """
drugdict={}
for d in specificdrugs['Generic Name']:
drugdict.dropna()
drugdict[d]=orangebook[orangebook.Ingredient==d.upper()]['Approval_Date'].min()
What should I add or take away from this code to make sure that there are no key,value pairs in the dictionary with a value of nan?
from math import isnan
if nans are being stored as keys:
# functional
clean_dict = filter(lambda k: not isnan(k), my_dict)
# dict comprehension
clean_dict = {k: my_dict[k] for k in my_dict if not isnan(k)}
if nans are being stored as values:
# functional
clean_dict = filter(lambda k: not isnan(my_dict[k]), my_dict)
# dict comprehension
clean_dict = {k: my_dict[k] for k in my_dict if not isnan(my_dict[k])}
With simplejson
import simplejson
clean_dict = simplejson.loads(simplejson.dumps(my_dict, ignore_nan=True))
## or depending on your needs
clean_dict = simplejson.loads(simplejson.dumps(my_dict, allow_nan=False))
Instead of trying to remove the NaNs from your dictionary, you should further investigate why NaNs are getting there in the first place.
It gets difficult to use NaNs in a dictionary, as a NaN does not equal itself.
Check this out for more information: NaNs as key in dictionaries
A slightly modified version of twinlakes's approach would be that of using pandas.isna() functionality as follows:
if nans are being stored as keys:
# functional
clean_dict = filter(lambda k: not pd.isna(k), my_dict)
# dict comprehension
clean_dict = {k: my_dict[k] for k in my_dict if not pd.isna(k)}
if nans are being stored as values:
# functional
clean_dict = filter(lambda k: not pd.isna(my_dict[k]), my_dict)
# dict comprehension
clean_dict = {k: my_dict[k] for k in my_dict if not pd.isna(my_dict[k])}
This way even when the fields are non numeric, it'll still work.
Know old, but here is what worked for me and simple - remove NaNs on reading of the CSV upfront:
orangebook = pd.read_csv('C:\Users\WEGWEIS_JAKE\Desktop\Work Programs\Code Files\products2.txt',sep='~', parse_dates=['Approval_Date']).dropna()
I also like to convert to dictionary at the same time:
orangebook = pd.read_csv('C:\Users\WEGWEIS_JAKE\Desktop\Work Programs\Code Files\products2.txt',sep='~', parse_dates=['Approval_Date']).dropna().to_dict()

Format an array of tuples in a nice "table"

Say I have an array of tuples which look like that:
[('url#id1', 'url#predicate1', 'value1'),
('url#id1', 'url#predicate2', 'value2'),
('url#id1', 'url#predicate3', 'value3'),
('url#id2', 'url#predicate1', 'value4'),
('url#id2', 'url#predicate2', 'value5')]
I would like be able to return a nice 2D array to be able to display it "as it" in my page through django.
The table would look like that:
[['', 'predicate1', 'predicate2', 'predicate3'],
['id1', 'value1', 'value2', 'value3'],
['id2', 'value4', 'value5', '']]
You will notice that the 2nd item of each tuple became the table "column's title" and that we now have rows with ids and columns values.
How would you do that? Of course if you have a better idea than using the table example I gave I would be happy to have your thoughts :)
Right now I am generating a dict of dict and display that in django. But as my pairs of keys, values are not always in the same order in my dicts, then it cannot display correctly my data.
Thanks!
Your dict of dict is probably on the right track. While you create that dict of dict, you could also maintain a list of ids and a list of predicates. That way, you can remember the ordering and build the table by looping through those lists.
using the zip function on your initial array wil give you three lists: the list of ids, the list of predicates and the list of values.
to get rid of duplicates, try the reduce function:
list_without_duplicates = reduce(
lambda l, x: (l[-1] != x and l.append(x)) or l, list_with_duplicates, [])
Ok,
At last I came up with that code:
columns = dict()
columnsTitles = []
rows = dict()
colIdxCounter = 1 # Start with 1 because the first col are ids
rowIdxCounter = 1 # Start with 1 because the columns titles
for i in dataset:
if not rows.has_key(i[0]):
rows[i[0]] = rowIdxCounter
rowIdxCounter += 1
if not columns.has_key(i[1]):
columns[i[1]] = colIdxCounter
colIdxCounter += 1
columnsTitles.append(i[1])
toRet = [columnsTitles]
for i in range(len(rows)):
toAppend = []
for j in range(colIdxCounter):
toAppend.append("")
toRet.append(toAppend)
for i in dataset:
toRet[rows[i[0]]][columns[i[1]]] = i[2]
for i in toRet:
print i
Please don't hesitate to comment/improve it :)

Categories