In the code below, I would like to update the fruit_dict dictionary with the mean price of each row. But the code is not working as expected. Kindly help.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
price=np.array(range(20)).reshape(5,4) #sample data for illustration
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
# initialize a dictionary
fruit_dict = dict.fromkeys(fruit_keys)
fruit_list = []
# print(fruit_dict)
# update dictionary values
for i in range(price.shape[1]):
for key,value in fruit_dict.items():
for j in range(price.shape[0]):
fruit_dict[key] = np.mean(price[j])
fruit_list.append(fruit_dict)
fruit_df = pd.DataFrame(fruit_list)
print(fruit_df)
Instead of creating the dictionary with the string pattern you can append the values for the means of rows as a string pattern by iterating the rows only.
In case if you have a dictionary with a certain pattern you can update the value in a single loop by assigning the key as the pattern which you need for displaying. you don't need to create an additional list for creating a data frame instead you can refer the documentation for creating data frames from dictionary itself Here. I have provided a sample output which may be suitable for your requirement.
In case you need an output with mean value as a column and fruits as rows you can use the below implementation.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
row = 5
column = 4
price = np.array(range(20)).reshape(row, column) # sample data for illustration
# initialize a dictionary
fruit_dict = {}
for j in range(row):
fruit_dict['fruit_'+str(j)] = np.mean(price[j])
fruit_df = pd.DataFrame.from_dict(fruit_dict,orient='index',columns=['mean_value'])
print(fruit_df)
This will provide an output like below. As I already mentioned you can create the data frame as you wish from a dictionary by referring the above data frame documentation.
mean_value
fruit_0 1.5
fruit_1 5.5
fruit_2 9.5
fruit_3 13.5
fruit_4 17.5
`
You shouldn't nest the loop over the range and the dictionary items, you should iterate over them together. You can do this with enumerate().
You're also not using value, so there's no need to use items().
for i, key in enumerate(fruit_dict):
fruit_dict[key] = np.mean(price[j])
Could arrive on a solution based on the answer provided by Sangeerththan. Please find the same below.
#!/usr/bin/python3
fruit_dict = {}
fruit_list =[]
price=np.array(range(40)).reshape(4,10)
for i in range(price.shape[0]):
mark_price = np.square(price[i])
for j in range(mark_price.shape[0]):
fruit_dict['proj_fruit_price_'+str(j)] = np.mean(mark_price[j])
fruit_list.append(fruit_dict.copy())
fruit_df = pd.DataFrame(fruit_list)
You can use this instead of your loops:
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
out = {fruit_keys[index]: np.mean(price[index]) for index in range(price.shape[0])}
Output:
{'fruit_1': '1.5', 'fruit_2': '5.5', 'fruit_3': '9.5', 'fruit_4': '13.5', 'fruit_5': '17.5'}
Related
I really do not understand how there was the command (if "entry" in langs_count) is possible when the dictionary was initialized to be empty, so what is inside the dictionary and how did it get there? I'm really confused
`
import pandas as pd
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")
# Initialize an empty dictionary: langs_count
langs_count = {}
# Extract column from DataFrame: col
col = df['lang']
# Iterate over lang column in DataFrame
for entry in col:
# If the language is in langs_count, add 1
if entry in langs_count.keys():
langs_count[entry]+=1
# Else add the language to langs_count, set the value to 1
else:
langs_count[entry]=1
# Print the populated dictionary
print(langs_count)
`
You can implement the count functionality using groupby.
import pandas as pd
# Import Twitter data as DataFrame: df
df = pd.read_csv("tweets.csv")
# Populate dictionary with count of occurrences in 'lang' column
langs_count = dict(df.groupby(['lang']).size())
# Print the populated dictionary
print(langs_count)
I have a for loop that runs through a CSV file and grabs certain elements and creates a dictionary based on two variables.
Code:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker] = [sf1_date]
I for example during the first iteration of the for loop, sf1_ticker = 'AAPL' and sf1_date = '2020/03/01' and the next time around, sf1_ticker = 'AAPL' and sf1_date = '2020/06/01', how do I make the key of 'AAPL' in the dictionary equal to ['2020/03/01', '2020/06/01']
It appears that when you say "key" you actually mean "value". The keys for a dictionary are the things that you use to lookup values in the dictionary. In your case ticker is the key and a list of dates are the values, e.g. you want a dictionary that looks like this:
{'AAPL': ['2020/03/01', '2020/06/01'].
'MSFT': ['2020/04/01', '2020/09/01']}
Here the strings AAPL and MSFT are dictionary keys. The date lists are the values associated with each key.
Your code can not construct such a dictionary because it is assigning a new value to the key. The following code will either create a new key in the dictionary company_date if the key does not already exist in the dictionary, or replace the existing value if the key already exists:
company_date[sf1_ticker] = [sf1_date]
You need to append to a list of values in the dict, rather than replace the current list, if any. There are a couple of ways to do it; dict.setdefault() is one:
company_date = {}
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
Another way is with a collections.defaultdict of list:
from collections import defaultdict
company_date = defaultdict(list)
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker].append(sf1_date)
You could create a new dictionary and add the date to the list if it exists. Otherwise, create the entry.
ticker_dates = {}
# Would give ticker_dates = {"AAPL":['2020/03/01', '2020/06/01']}
for ind,row in sft1.iterrows():
sf1_ticker = row['ticker']
sf1_date = row['datekey']
if sf1_ticker in ticker_dates:
ticker_date[sf1_ticker].append(sf1_date)
else:
ticker_dates[sf1_ticker] = [sf1_date]
You can use a defaultdict, which can be setup to add an empty list to any item that doesn't exist. It generally acts like a dictionary otherwise.
from collections import defaultdict
rows = [
['AAPL', '2020/03/01'],
['AAPL', '2020/06/01'],
['GOOGL', '2021/01/01']
]
company_date = defaultdict(list)
for ticker, date in rows:
company_date[ticker].append(date)
print(company_date)
# defaultdict(<class 'list'>, {'AAPL': ['2020/03/01', '2020/06/01'], 'GOOGL': ['2021/01/01']})
I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])
import csv
def statistics():
statistics2 = {}
with open("BLS_private.csv") as f:
reader = csv.reader(f)
for row in reader:
statistics2 = row
return statistics2
statistics()
Dicionary sample data:
['2005', '110718', '110949', '111094', '111440', '111583', '111844', '112124', '112311', '112395', '112491', '112795', '112935']
['2006', '113250', '113535', '113793', '113958', '113965', '114045', '114203', '114348', '114434', '114439', '114628', '114794']
How would I go about adding all of the values together in a row except for the first value in a dictionary?
The first value is always the year; like in the sample data i have 2005 and 2006. I don't need to add the year.
Then I want to add together all of the values after that in the row. How would I do that?
(I also have a lot of years)
Welcome to StackOverflow :)
You can achieve this by utilizing list.pop(index) to grab the first item in the list as the key, and list comprehension to calculate the sums of the remaining values.
## Assign variable for dictionary ##
dictionary = {}
## Data assuming it is formatted as lists within a list ##
data = [['2005', '110718', '110949', '111094', '111440', '111583', '111844', '112124', '112311', '112395', '112491', '112795', '112935'],
['2006', '113250', '113535', '113793', '113958', '113965', '114045', '114203', '114348', '114434', '114439', '114628', '114794']]
## Iterate over all lists within your data list
for i in data:
## Utilize list.pop(index) to grab and remove the first item in the list (Index 0)
key = i.pop(0)
## Create a key in the dictonary using the value we popped off the list
## Sum all values left in the list as an integer if it is a digit
dictionary[key] = sum([int(v) for v in i if v.isdigit()])
## Result
dictionary
{'2005': 1342679, '2006': 1369392}
Hi there’s an Excel spreadsheet showing Product ID and Location, like below.
I want to list all the locations of each product ID in sequence with no duplication.
For example:
53424 has Phoenix, Matsuyama, Phoenix, Matsuyama, Phoenix, Matsuyama, Phoenix.
56224 has Samarinda, Boise. Seoul.
etc.
What's the best way to achieve it with Python?
I can only read the cells in the spreadsheet but have no idea what’s good to proceed.
Thank you.
the_file = xlrd.open_workbook("C:\\excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
for row_index in range(0, the_sheet.nrows):
product_id = the_sheet.cell(row_index, 0).value
location = the_sheet.cell(row_index, 1).value
You need to make use of Python's groupby() function to take away the duplicates as follows:
from collections import defaultdict
from itertools import groupby
import xlrd
the_file = xlrd.open_workbook(r"excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
products = defaultdict(list)
for row_index in range(1, the_sheet.nrows):
products[int(the_sheet.cell(row_index, 0).value)].append(the_sheet.cell(row_index, 1).value)
for product, v in sorted(products.items()):
print "{} has {}.".format(product, ', '.join(k for k, g in groupby(v)))
This uses a defaultlist() with a dictionary to build your products. So each key in the dictionary holds your product ID and the contents is automatically a list of the matching entries. Finally the groupby() is used to read out each raw value and only give you one entry for the cases where there are consecutive identically values. Finally the list this produces is joined together with commas between them.
You should use a dictionary to store the data from excel and then traverse it according to product ID.
So, following code should help you out -
the_file = xlrd.open_workbook("C:\\excel file.xlsx")
the_sheet = the_file.sheet_by_name("Sheet1")
dataset = dict()
for row_index in range(0, the_sheet.nrows):
product_id = the_sheet.cell(row_index, 0).value
location = the_sheet.cell(row_index, 1).value
if product_id in dataset:
dataset[product_id].append(location)
else:
dataset[product_id] = [location]
for product_id in sorted(dataset.keys()):
print "{0} has {1}.".format(product_id, ", ".join(dataset[product_id]))
Above will preserve the order of locations as per product_id (in sequence).