import csv
def statistics():
statistics2 = {}
with open("BLS_private.csv") as f:
reader = csv.reader(f)
for row in reader:
statistics2 = row
return statistics2
statistics()
Dicionary sample data:
['2005', '110718', '110949', '111094', '111440', '111583', '111844', '112124', '112311', '112395', '112491', '112795', '112935']
['2006', '113250', '113535', '113793', '113958', '113965', '114045', '114203', '114348', '114434', '114439', '114628', '114794']
How would I go about adding all of the values together in a row except for the first value in a dictionary?
The first value is always the year; like in the sample data i have 2005 and 2006. I don't need to add the year.
Then I want to add together all of the values after that in the row. How would I do that?
(I also have a lot of years)
Welcome to StackOverflow :)
You can achieve this by utilizing list.pop(index) to grab the first item in the list as the key, and list comprehension to calculate the sums of the remaining values.
## Assign variable for dictionary ##
dictionary = {}
## Data assuming it is formatted as lists within a list ##
data = [['2005', '110718', '110949', '111094', '111440', '111583', '111844', '112124', '112311', '112395', '112491', '112795', '112935'],
['2006', '113250', '113535', '113793', '113958', '113965', '114045', '114203', '114348', '114434', '114439', '114628', '114794']]
## Iterate over all lists within your data list
for i in data:
## Utilize list.pop(index) to grab and remove the first item in the list (Index 0)
key = i.pop(0)
## Create a key in the dictonary using the value we popped off the list
## Sum all values left in the list as an integer if it is a digit
dictionary[key] = sum([int(v) for v in i if v.isdigit()])
## Result
dictionary
{'2005': 1342679, '2006': 1369392}
Related
In the code below, I would like to update the fruit_dict dictionary with the mean price of each row. But the code is not working as expected. Kindly help.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
price=np.array(range(20)).reshape(5,4) #sample data for illustration
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
# initialize a dictionary
fruit_dict = dict.fromkeys(fruit_keys)
fruit_list = []
# print(fruit_dict)
# update dictionary values
for i in range(price.shape[1]):
for key,value in fruit_dict.items():
for j in range(price.shape[0]):
fruit_dict[key] = np.mean(price[j])
fruit_list.append(fruit_dict)
fruit_df = pd.DataFrame(fruit_list)
print(fruit_df)
Instead of creating the dictionary with the string pattern you can append the values for the means of rows as a string pattern by iterating the rows only.
In case if you have a dictionary with a certain pattern you can update the value in a single loop by assigning the key as the pattern which you need for displaying. you don't need to create an additional list for creating a data frame instead you can refer the documentation for creating data frames from dictionary itself Here. I have provided a sample output which may be suitable for your requirement.
In case you need an output with mean value as a column and fruits as rows you can use the below implementation.
#!/usr/bin/python3
import random
import numpy as np
import pandas as pd
row = 5
column = 4
price = np.array(range(20)).reshape(row, column) # sample data for illustration
# initialize a dictionary
fruit_dict = {}
for j in range(row):
fruit_dict['fruit_'+str(j)] = np.mean(price[j])
fruit_df = pd.DataFrame.from_dict(fruit_dict,orient='index',columns=['mean_value'])
print(fruit_df)
This will provide an output like below. As I already mentioned you can create the data frame as you wish from a dictionary by referring the above data frame documentation.
mean_value
fruit_0 1.5
fruit_1 5.5
fruit_2 9.5
fruit_3 13.5
fruit_4 17.5
`
You shouldn't nest the loop over the range and the dictionary items, you should iterate over them together. You can do this with enumerate().
You're also not using value, so there's no need to use items().
for i, key in enumerate(fruit_dict):
fruit_dict[key] = np.mean(price[j])
Could arrive on a solution based on the answer provided by Sangeerththan. Please find the same below.
#!/usr/bin/python3
fruit_dict = {}
fruit_list =[]
price=np.array(range(40)).reshape(4,10)
for i in range(price.shape[0]):
mark_price = np.square(price[i])
for j in range(mark_price.shape[0]):
fruit_dict['proj_fruit_price_'+str(j)] = np.mean(mark_price[j])
fruit_list.append(fruit_dict.copy())
fruit_df = pd.DataFrame(fruit_list)
You can use this instead of your loops:
fruit_keys = [] # list of keys for dictionary
for i in range(5):
key = "fruit_" + str(i)
fruit_keys.append(key)
out = {fruit_keys[index]: np.mean(price[index]) for index in range(price.shape[0])}
Output:
{'fruit_1': '1.5', 'fruit_2': '5.5', 'fruit_3': '9.5', 'fruit_4': '13.5', 'fruit_5': '17.5'}
We have a dict with three different values : {value1: 1, value2 :0, value3: 2} saved in a json file Each time we run a code those values increase by one, for example, it becomes : {value1: 2, value2 :1, value3: 3}.
I am trying to add a rolling counter check to my code where it checks if the values have incremented by 1 or not. Once it is checked, another dict is created with the new values (we end up having two dict). If we run the code again after increasing the values in the first dict manually by 1 for example, it checks again if those values have incremented by one comparing to the previous run and overwrite the new dict with the new values.
In other words, this is what I tried
#generally this step is saving three different values in X.json from another file.
#The values in that file is increased in each iteration and
#hence this part of code where am checking if the increase by one is happening
X_dict = dict ()
with open('X.json', 'w') as f4:
f4.write(json.dumps(X_dict))
#check if file exist
if os.path.exists("X.json"):
#read the file
with open('X.json', 'r') as old:
#create a new dictionary
NEW = dict ()
#saving the data in the initial X in old to use a comparison
OLD = json.load(old)
for key,value in NEW.items():
#I would say that am trying to comapre old and new values
#new values are obtained after running code second time and having
#values in X changed while old dict have the older values
if new_value = old_value + 1:
NEW[key] = X[key]
else:
NEW[key] = OLD[key]
#if file does not exist to start with
else:
with open('X.json', 'w') as f5:
f5.write(json.dumps(X_dict))
Values in the NEW dict should be added/updated however I am not getting anything and the dict remains empty.
I have a for loop that runs through a CSV file and grabs certain elements and creates a dictionary based on two variables.
Code:
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker] = [sf1_date]
I for example during the first iteration of the for loop, sf1_ticker = 'AAPL' and sf1_date = '2020/03/01' and the next time around, sf1_ticker = 'AAPL' and sf1_date = '2020/06/01', how do I make the key of 'AAPL' in the dictionary equal to ['2020/03/01', '2020/06/01']
It appears that when you say "key" you actually mean "value". The keys for a dictionary are the things that you use to lookup values in the dictionary. In your case ticker is the key and a list of dates are the values, e.g. you want a dictionary that looks like this:
{'AAPL': ['2020/03/01', '2020/06/01'].
'MSFT': ['2020/04/01', '2020/09/01']}
Here the strings AAPL and MSFT are dictionary keys. The date lists are the values associated with each key.
Your code can not construct such a dictionary because it is assigning a new value to the key. The following code will either create a new key in the dictionary company_date if the key does not already exist in the dictionary, or replace the existing value if the key already exists:
company_date[sf1_ticker] = [sf1_date]
You need to append to a list of values in the dict, rather than replace the current list, if any. There are a couple of ways to do it; dict.setdefault() is one:
company_date = {}
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date.setdefault(sf1_ticker, []).append(sf1_date)
Another way is with a collections.defaultdict of list:
from collections import defaultdict
company_date = defaultdict(list)
for ind, row in sf1.iterrows():
sf1_date = row['datekey']
sf1_ticker = row['ticker']
company_date[sf1_ticker].append(sf1_date)
You could create a new dictionary and add the date to the list if it exists. Otherwise, create the entry.
ticker_dates = {}
# Would give ticker_dates = {"AAPL":['2020/03/01', '2020/06/01']}
for ind,row in sft1.iterrows():
sf1_ticker = row['ticker']
sf1_date = row['datekey']
if sf1_ticker in ticker_dates:
ticker_date[sf1_ticker].append(sf1_date)
else:
ticker_dates[sf1_ticker] = [sf1_date]
You can use a defaultdict, which can be setup to add an empty list to any item that doesn't exist. It generally acts like a dictionary otherwise.
from collections import defaultdict
rows = [
['AAPL', '2020/03/01'],
['AAPL', '2020/06/01'],
['GOOGL', '2021/01/01']
]
company_date = defaultdict(list)
for ticker, date in rows:
company_date[ticker].append(date)
print(company_date)
# defaultdict(<class 'list'>, {'AAPL': ['2020/03/01', '2020/06/01'], 'GOOGL': ['2021/01/01']})
The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputValue. However, I'm trying to add the rule that when PrimaryValue = DEFAULT to also return the OutputValue.
The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values whose corresponding PrimaryValue = DEFAULT to replace null by the OutputValue.
#create a map based on columns from reference_df
map_key = concat_ws('\0', final_reference.PrimaryName, final_reference.PrimaryValue)
map_value = final_reference.OutputValue
#dataframe of concatinated mappings to get the corresponding OutputValues from reference table
d = final_reference.agg(collect_set(array(concat_ws('\0','PrimaryName','PrimaryValue'), 'OutputValue')).alias('m')).first().m
#display(d)
#iterate through mapped values
mappings = create_map([lit(i) for i in chain.from_iterable(d)])
#dataframe with corresponding matched OutputValues
dataset = datasetM.select("*",*[ mappings[concat_ws('\0', lit(c), col(c))].alias(c_name) for c,c_name in matched_List.items()])
display(dataset)
From discussion in comments, I think you just need to add a default mappings from the existing one and then use coalease() function to find the first non-null value, see below:
from pyspark.sql.functions import collect_set, array, concat_ws, lit, col, create_map, coalesce
# skip some old code
d
#[['LeaseStatus\x00Abandoned', 'Active'],
# ['LeaseStatus\x00DEFAULT', 'Pending'],
# ['LeaseRecoveryType\x00Gross-modified', 'Modified Gross'],
# ['LeaseStatus\x00Archive', 'Expired'],
# ['LeaseStatus\x00Terminated', 'Terminated'],
# ['LeaseRecoveryType\x00Gross w/base year', 'Modified Gross'],
# ['LeaseRecoveryType\x00Gross', 'Gross']]
# original mapping
mappings = create_map([ lit(j) for i in d for j in i ])
# default mapping
mappings_default = create_map([ lit(j.split('\0')[0]) for i in d if i[0].upper().endswith('\x00DEFAULT') for j in i ])
#Column<b'map(LeaseStatus, Pending)'>
# a set of available PrimaryLookupAttributeName
available_list = set([ i[0].split('\0')[0] for i in d ])
# {'LeaseRecoveryType', 'LeaseStatus'}
# use coalesce to find the first non-null values from mappings, mappings_defaul etc
datasetPrimaryAttributes_False = datasetMatchedPortfolio.select("*",*[
coalesce(
mappings[concat_ws('\0', lit(c), col(c))],
mappings_default[c],
lit("Not Specified at Source" if c in available_list else "Lookup not found")
).alias(c_name) for c,c_name in matchedAttributeName_List.items()])
Some explanation:
(1) d is a list of lists retrieved from the reference_df, we use a list comprehension [ lit(j) for i in d for j in i ] to flatten this to a list and apply the flattened list to the create_map function:
(2) The mappings_default is similar to the above, but add a if condition to serve as a filter and keep only entries having PrimaryLookupAttributeValue (which is the first item of the inner list i[0]) ending with \x00DEFAULT and then use split to strip PrimaryLookupAttributeValue(which is basically \x00DEFAULT) off from the map_key.
I'm trying to find out averages and standard deviation of multiple columns of my dataset and then save them as a new column in a new dataframe. i.e. for every 'GROUP' in the dataset, I want one columns in the new dataframe with its average and SD. I came up with the following script but I'm not able to name it dynamically.
Average_F1_S_list, Average_F1_M_list, SD_F1_S_list, SD_F1_M_list = ([] for i in range(4))
Groups= DF['GROUP'].unique().tolist()
for key in Groups:
Average_F1_S = DF_DICT[key]['F1_S'].mean()
Average_F1_S_list.append(Average_F1_S)
SD_F1_S = DF_DICT[key]['F1_S'].std()
SD_F1_S_list.append(SD_F1_S)
Average_F1_M = DF_DICT[key]['F1_M'].mean()
Average_F1_M_list.append(Average_F1_M)
SD_F1_M = DF_DICT[key]['F1_M'].std()
SD_F1_M_list.append(SD_F1_M)
df=pd.DataFrame({'Group':Groups,
'Average_F1_S':Average_F1_S_list,'Standard_Dev_F1_S':SD_F1_S_list,
'Average_F1_M':Average_F1_M_list,'Standard_Dev_F1_M':SD_F1_M_list},
columns=['Group','Average_F1_S','Standard_Dev_F1_S','Average_F1_M', 'Standard_Dev_F1_M'])
This will not be a good solution as there are too many features. Is there any way I can create the lists dynamically?
This should do the trick! Hope this helps
# These are all the keys you want
key_names = ['F1_S', 'F1_M']
# Holds the data you want to pass to the dataframe.
df_info = {'Groups': Groups}
for group_name in Groups:
# For each group in the groups, we iterate over all the keys we want.
for key in key_names:
# Generate a keyname that you want for your dataframe.
avg_key_name = key + '_Average'
std_key_name = key + '_Standard_Dev'
if avg_key_name not in df_info:
df_info[avg_key_name] = []
df_info[std_key_name] = []
df_info[avg_key_name].append(DF_DICT[group_name][key].mean())
df_info[std_key_name].append(DF_DICT[group_name][key].std())
df = pd.DataFrame(df_info)