I have two python dictionaries:
ccyAr = {'AUDCAD','AUDCHF','AUDJPY','AUDNZD','AUDUSD','CADCHF','CADJPY','CHFJPY','EURAUD','EURCAD','EURCHF','EURGBP','EURJPY','EURNZD','EURUSD','GBPAUD','GBPCAD','GBPCHF','GBPJPY','GBPNZD','GBPUSD','NZDCAD','NZDCHF','NZDJPY','NZDUSD','USDCAD','USDCHF','USDJPY'}
data = {'BTrades', 'BPips', 'BProfit', 'STrades', 'SPips', 'SProfit', 'Trades', 'Pips', 'Profit', 'Won', 'WonPC', 'Lost', 'LostPC'}
I've been trying to get my head round how to most elegantly create a construct in which each of 'data' exists in each of 'ccyAr'. The following are the two I feel are the closest, but the first results (now I realise) in arrays and the latter i more like pseudocode:
1.
table={ { data:[] for d in data } for ccy in ccyAr }
2.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
I also want to set each of the entries to int 0 and I'd like to do it in one go. I'm struggling with the comprehension method as I end up creating each value of each inside directory member as a list instead of a value 0.
I've seen the autovivification piece but I don't want to mimic perl, I want to do it the pythonic way. Any help = cheers.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
Is close.
table = {}
for ccy in ccyAr:
table[ccy] = {}
for d in data:
table[ccy][d] = 0
Also, ccyAr and data in your question are sets, not dictionaries.
What you are searching for is a pandas DataFrame of shape data x ccyAr. I give a minimal example here:
import pandas as pd
data = {'1', '2'}
ccyAr = {'a','b','c'}
df = pd.DataFrame(np.zeros((len(data), len(ccyAr))))
Then the most important step is to set both the columns and the index. If your two so-called dictionaries are in fact sets (as it seems in your code), use:
df.columns = ccyAr
df.index = data
If they are indeed dictionaries, you instead have to call their keys method:
df.columns = ccyAr.keys()
df.index = data.keys()
You can print df to see that this is actually what you wanted:
| a | c | b
-------------
1 | 0 0 0
2 | 0 0 0
And now if you try to access via df['a'][1] it returns you 0. It is the best solution to your problem.
How to do this using a dictionary comprehension:
table = {ccy:{d:0 for d in data} for ccy in ccyAr}
Related
I have a dictionary of dataframes (the key is the name of the data frame and the value is the rows/columns). Each dataframe within the dictionary has just 2 columns and varying numbers of rows. I also have a list that has all of the keys in it.
I need to use a for-loop to iteratively name each dataframe with the key and have it saved outside of the dictionary. I know I can access each data frame using the dictionary, but i don't want to do it that way. I am using Spyder so I like to look at my tables in the Variable Explorer and I do not like printing them to the console. Additionally, I would like to modify some of the completed data frames and I need them to be their own thing for that.
Here is my code to make the dictionary (i did this because I wanted to look at all of the categories in each column with the frequency of those values):
import pandas as pd
mydict = {
"dummy":[1, 1, 1],
"type":["new", "old", "new"],
"location":["AB", "BC", "ON"]
}
mydf = pd.DataFrame(mydict)
colnames = mydf.columns.tolist()
mydict2 = {}
for i in colnames:
mydict2[i] = pd.DataFrame(mydf.groupby([i, 'dummy']).size())
print(mydict2)
mydf looks like this:
dummy
type
location
1
new
AB
1
old
BC
1
new
ON
the output of print(mydict2) looks like this:
{'dummy': 0
dummy dummy
1 1 3, 'type': 0
type dummy
new 1 2
old 1 1, 'location': 0
location dummy
AB 1 1
BC 1 1
ON 1 1}
I want the final output to look like this:
Type:
Type
Dummy
new
2
old
1
Location
Location
Dummy
AB
1
BC
1
ON
1
I am basically just trying to generate a frequency table for each column in the original table, using a loop. Any help would be much appreciated!
i believe this yeilds the correct output :
type_count = mydf[["type", "dummy"]].groupby(by=['type'])['dummy'].sum().reset_index()
loca_count = mydf[["location", "dummy"]].groupby(by=['location'])['dummy'].sum().reset_index()
Edit :
Dynamically, you could add all the dataframes to a loop like below : (assuming that you want to do it based on the dummy column)
df_list = []
for name in colnames:
if name != "dummy":
df_list.append(mydf[[name, "dummy"]].groupby(by=[name])['dummy'].sum().reset_index())
I was able to solve the problem described below, but as I am a newbie, I am not sure if my solution is good. I'd be grateful for any tips on how to do it in a more efficient and/or more elegant manner.
What I have:
...and so on (the table's quite big).
What I need:
How I solved it:
Load the file
df = pd.read_csv("survey_data_cleaned_ver2.csv")
Define a function
def transform_df(df, list_2, column_2, list_1, column_1='Respondent'):
for ind in df.index:
elements = df[column_2][ind].split(';')
num_of_elements = len(elements)
for num in range(num_of_elements):
list_1.append(df['Respondent'][ind])
for el in elements:
list_2.append(el)
Dropna because NaNs are floats and that was causing errors later on.
df_LanguageWorkedWith = df[['Respondent', 'LanguageWorkedWith']]
df_LanguageWorkedWith.dropna(subset='LanguageWorkedWith', inplace=True)
Create empty lists
Respondent_As_List = []
LanguageWorkedWith_As_List = []
Call the function
transform_df(df_LanguageWorkedWith, LanguageWorkedWith_As_List, 'LanguageWorkedWith', Respondent_As_List)
Tranform the lists into dataframes
df_Respondent = pd.DataFrame(Respondent_As_List, columns=["Respondent"])
df_LanguageWorked = pd.DataFrame(LanguageWorkedWith_As_List, columns=["LanguageWorkedWith"])
Concatenate those dataframes
df_LanguageWorkedWith_final = pd.concat([df_Respondent, df_LanguageWorked], axis=1)
And that's it.
The code and input file can be found on my GitHub: https://github.com/jarsonX/Temp_files
Thanks in advance!
You can try like this. I haven't tested but it should work
df['LanguageWorkedWith'] = df['LanguageWorkedWith'].str.replace(';',',')
df =df.assign(LanguageWorkedWith=df['LanguageWorkedWith'].str.split(',')).explode('LanguageWorkedWith')
#Tested
LanguageWorkedWith Respondent
0 C 4
0 C++ 4
0 C# 4
0 Python 4
0 SQL 4
... ... ...
10319 Go 25142
I'm searching for difference between columns in DataFrame and a data in List.
I'm doing it this way:
# pickled_data => list of dics
pickled_names = [d['company'] for d in pickled_data] # get values from dictionary to list
diff = df[~df['company_name'].isin(pickled_names)]
which works fine, but I realized that I need to check not only for company_name but also for place, because there could be two companies with the same name.
df contains also column place as well as pickled_data contains place key in the dictionary.
I would like to be able to do something like this
pickled_data = [(d['company'], d['place']) for d in pickled_data]
diff = df[~df['company_name', 'place'].isin(pickled_data)] # For two values in same row
You can convert values to MultiIndex by MultiIndex.from_tuples, then convert both columns too and compare:
pickled_data = [(d['company'], d['place']) for d in pickled_data]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
Sample:
data = {'company_name':['A1','A2','A2','A1','A1','A3'],
'place':list('sdasas')}
df = pd.DataFrame(data)
pickled_data = [('A1','s'),('A2','d')]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
print (diff)
company_name place
2 A2 a
4 A1 a
5 A3 s
You can form a set of tuples from your pickled_data for faster lookup later, then using a list comprehension over company_name and place columns of the frame, we get a boolean list of whether they are in the frame or not. Then we use this to index into the frame:
comps_and_places = set((d["company"], d["place"]) for d in pickled_data)
not_in_list = [(c, p) not in comps_and_places
for c, p in zip(df.company_name, df.place)]
diff = df[not_in_list]
I have created a dictionary with the piece of code:
dat[r["author_name"]] = (r["num_deletions"], r["num_insertions"],
r["num_lines_changed"], r["num_files_changed"], r["author_date"])
I want to then take these dictionary and create a panda with columns
author_name | num_deletions | num_insertions | num_lines_changed |num_files changed | author_date
I tried this:
df = pd.DataFrame(list(dat.iteritems()),
columns=['author_name',"num_deletions", "num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
But it does not work since it is reading the key and the tuple of the dictionary as only two columns instead of six. So how can I take each of the five entries in the tuple and divide them into their own columns
You need the key and value at the same nesting level:
df = pd.DataFrame([(key,)+val for key, val in dat.items()],
columns=["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
You could also use
df = pd.DataFrame.from_dict(dat, orient='index').reset_index()
df.columns = ["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"]
Which seems to be a bit faster if you have roughly 10,000 rows or more.
This should work.
import pandas as pd
df = pd.DataFrame(columns=['author_name', 'num_deletions', 'num_insertions', 'num_lines_changed',
'num_files_changed','author_date'])
I am a first data frame looking like this
item_id | options
------------------------------------------
item_1_id | [option_1_id, option_2_id]
And a second like this:
option_id | option_name
---------------------------
option_1_id | option_1_name
And I'd like to transform my first data set to:
item_id | options
----------------------------------------------
item_1_id | [option_1_name, option_2_name]
What is an elegant way to do so using Pandas' data frames?
You can use apply.
For the record, storing lists in DataFrames is typically unnecessary and not very "pandonic". Also, if you only have one column, you can do this with a Series (though this solution also works for DataFrames).
Setup
Build the Series with the lists of options.
index = list('abcde')
s = pd.Series([['opt1'], ['opt1', 'opt2'], ['opt0'], ['opt1', 'opt4'], ['opt3']], index=index)
Build the Series with the names.
index_opts = ['opt%s' % i for i in range(5)]
vals_opts = ['name%s' % i for i in range(5)]
s_opts = pd.Series(vals_opts, index=index_opts)
Solution
Map options to names using apply. The lambda function looks up each option in the Series mapping options to names. It is applied to each element of the Series.
s.apply(lambda l: [s_opts[opt] for opt in l])
outputs
a [name1]
b [name1, name2]
c [name0]
d [name1, name4]
e [name3]