I have a list of dictionaries where keys are identical but values in each dictionary is not same, and the order of each dictionary strictly preserved. I am trying to find an automatic solution to populate these dictionaries to pandas dataframe as new column, but didn't get the expected output.
original data on gist
here is the data that I have on old data on gist.
my attempt
here is my attempt to populate multiple dictionaries with same keys but different values (binary value), my goal is I want to write down handy function to vectorize the code. Here is my inefficient code but works on gist
import pandas as pd
dat= pd.read_csv('old_data.csv', encoding='utf-8')
dat['type']=dat['code'].astype(str).map(typ)
dat['anim']=dat['code'].astype(str).map(anim)
dat['bovin'] = dat['code'].astype(str).map(bov)
dat['catg'] = dat['code'].astype(str).map(cat)
dat['foot'] = dat['code'].astype(str).map(foo)
my code works but it is not vectorized (not efficient I think). I am wondering how can I make this few lines of a simple function. Any idea? how to we make this happen as efficiently as possible?
Here is my current and the desired output:
since I got correct output but code is not well efficient here. this is my current output on gist
If you restructure your dictionaries into a dictionary of dictionaries you can one line it:
for keys in values.keys():
dat[keys]=dat['code'].astype(str).map(values[keys])
Full code:
values = {"typ" :{
'20230' : 'A',
'20130' : 'A',
'20220' : 'A',
'20120' : 'A',
'20329' : 'A',
'20322' : 'A',
'20321' : 'B',
'20110' : 'B',
'20210' : 'B',
'20311' : 'B'
} ,
"anim" :{
'20230' : 'AOB',
'20130' : 'AOB',
'20220' : 'AOB',
'20120' : 'AOB',
'20329' : 'AOC',
'20322' : 'AOC',
'20321' : 'AOC',
'20110' : 'AOB',
'20210' : 'AOB',
'20311' : 'AOC'
} ,
"bov" :{
'20230' : 'AOD',
'20130' : 'AOD',
'20220' : 'AOD',
'20120' : 'AOD',
'20329' : 'AOE',
'20322' : 'AOE',
'20321' : 'AOE',
'20110' : 'AOD',
'20210' : 'AOD',
'20311' : 'AOE'
} ,
"cat" :{
'20230' : 'AOF',
'20130' : 'AOG',
'20220' : 'AOF',
'20120' : 'AOG',
'20329' : 'AOF',
'20322' : 'AOF',
'20321' : 'AOF',
'20110' : 'AOG',
'20210' : 'AOF',
'20311' : 'AOG'
} ,
"foo" :{
'20230' : 'AOL',
'20130' : 'AOL',
'20220' : 'AOM',
'20120' : 'AOM',
'20329' : 'AOL',
'20322' : 'AOM',
'20321' : 'AOM',
'20110' : 'AOM',
'20210' : 'AOM',
'20311' : 'AOM'
}
}
import pandas as pd
dat= pd.read_csv('old_data.csv', encoding='utf-8')
for keys in values.keys():
dat[keys]=dat['code'].astype(str).map(values[keys])
Related
I just want to create DataFrames named by companies containing Financial quotes of stocks using cycle and dict:
financials = {'jp_morgan' : 'JPM', 'bank_of_amerika' : 'BAC', 'credit_suisse' : 'CS', 'visa' :'V',\
'mastercard' : 'MA', 'morgan_stanley' : 'MS', 'citigroup' : 'C', 'wells_fargo' : 'WFC',\
'blackrock' : 'CLOA', 'goldman_sachs' : 'GS'}
for i in financials:
i = yf.download(financials[i],'2016-01-01','2019-08-01')
I want to get dataframes
I have 640,000 rows of data in excel.
I want to append some rows to the data so I used
pd.read_excel and pd.concat([excel, some_data]).
After that, I used df.to_excel() to write back to excel.
But, read_excel is taking a long time, about 3 minutes, and to_excel is too.
How can I fix it?
def update_mecab(new_word_list):
user_dicpath='C:\\mecab\\user-dic\\custom.csv'
dictionary=pd.read_excel('./first_dictionary.xlsx')
dictionary=pd.concat([dictionary, new_word_list])
part_names= {
'일반 명사' : 'NNG',
'고유 명사' : 'NNP',
'의존 명사' : 'NNB',
'수사' : 'NR',
'대명사' : 'NP',
'동사' : 'VV',
'형용사' : 'VA',
'보조 용언' : 'VX',
'관형사' : 'MM',
'일반 부사' : 'MAG',
'접속 부사' : 'MAJ',
'감탄사' : 'IC'
}
new_word_pt=new_word_list.replace({"part":part_names})
user_dict=open(user_dicpath, 'a', encoding="UTF-8")
for index, item in new_word_pt.iterrows() :
custom_word=item['word']+',*,*,*,'+item['part']+',*,T,'+item['word']+',*,*,*,*,*\n'
user_dict.write(custom_word)
user_dict.close()
del user_dict
dictionary=dictionary.reset_index()
dictionary=dictionary[['word', 'part']]
dictionary.to_excel('first_dictionary.xlsx', sheet_name = "Sheet_1", index=None)
subprocess.call("powershell C:\\mecab\\add-userdic-win.ps1")
I have a dict, I need to convert to pandas dataframe, dict have arrays, if the arrays are of same length it is working fine, but different length array throwing valueError, second question is I need to access only few key value pairs from the dict
This case working, as expected I get two rows
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
},
{
"ColB_a" : "2013-10-26T00:00:00Z",
"ColB_b" : 5.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
In this case ColB has only one value
my_dict = {
"ColA" : "No",
"ColB" : [
{
"ColB_a" : "2011-10-26T00:00:00Z",
"ColB_b" : 8.3
}
],
"ColC" : "Graduate",
"ColD" : [
{
"ColD_a" : 5436.0,
"ColD_b" : "RD"
},
{
"ColD_a" : 4658.0,
"ColD_b" : "DV"
}
],
"ColE" : "Work"
}
sa = pd.DataFrame(my_dict)
so this throws ValueError: arrays must all be same length, How this can be fixed?
expected output is
I can do
sa = pd.DataFrame.from_dict(my_dict, orient='index').transpose()
But I have to melt and join again.
Second Question, if I need to choose only ColA, ColB from dict to create dataframe, How this to be done?
for your second question, you could select a couple of columns from your dictionary using 'columns' parameter: For example:
sa = pd.DataFrame(my_dict, columns = ['ColA', 'ColD'])
I have a JSON-array from a mongoexport containing data from the Beddit sleeptracker. Below is an example of one of the truncated documents (removed some unneeded detail).
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"data" : [
{
"end_timestamp" : 1480570804.26226,
"properties" : {
"sleep_efficiency" : 0.8772404,
"resting_heart_rate" : 67.67578,
"short_term_resting_heart_rate" : 61.36963,
"activity_index" : 50.51958,
"average_respiration_rate" : 16.25667,
"total_sleep_score" : 64,
},
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"tags" : [
"not_enough_sleep",
"long_sleep_latency"
],
"updated" : 1480570805.25201
}
],
"__v" : 0
}
Several related questions like this and this do not seem to work for the data structure above. As recommended in other related questions I am trying to stay away from looping over each row for performance reasons (the full dataset is ~150MB). How would I flatten out the "data"-key with json_normalize so that each key is at the top-level? I would prefer one DataFrame where e.g. total_sleep_score is a column.
Any help is much appreciated! Even though I know how to 'prepare' the data using JavaScript, I would like to be able to understand and do it using Python.
edit (request from comment to show preferred structure):
{
"user" : "xxx",
"provider" : "beddit",
"date" : ISODate("2016-11-30T23:00:00.000Z"),
"end_timestamp" : 1480570804.26226,
"properties.sleep_efficiency" : 0.8772404,
"properties.resting_heart_rate" : 67.67578,
"properties.short_term_resting_heart_rate" : 61.36963,
"properties.activity_index" : 50.51958,
"properties.average_respiration_rate" : 16.25667,
"properties.total_sleep_score" : 64,
"date" : "2016-12-01",
"session_range_start" : 1480545636.55059,
"start_timestamp" : 1480545636.55059,
"session_range_end" : 1480570804.26226,
"updated" : 1480570805.25201,
"__v" : 0
}
The 'properties' append is not necessary but would be nice.
Try This algo for flatten:-
def flattenPattern(pattern):
newPattern = {}
if type(pattern) is list:
pattern = pattern[0]
if type(pattern) is not str:
for key, value in pattern.items():
if type(value) in (list, dict):
returnedData = flattenPattern(value)
for i,j in returnedData.items():
if key == "data":
newPattern[i] = j
else:
newPattern[key + "." + i] = j
else:
newPattern[key] = value
return newPattern
print(flattenPattern(dictFromJson))
OutPut:-
{
'session_range_start':1480545636.55059,
'start_timestamp':1480545636.55059,
'properties.average_respiration_rate':16.25667,
'session_range_end':1480570804.26226,
'properties.resting_heart_rate':67.67578,
'properties.short_term_resting_heart_rate':61.36963,
'updated':1480570805.25201,
'properties.total_sleep_score':64,
'properties.activity_index':50.51958,
'__v':0,
'user':'xxx',
'provider':'beddit',
'date':'2016-12-01',
'properties.sleep_efficiency':0.8772404,
'end_timestamp':1480570804.26226
}
Although not explicitly what I asked for, the following worked for me so far:
Step 1
Normalize the data record using json_normalize on the original dataset (not inside a Pandas DataFrame) and prefix the data.
beddit_data = pd.io.json.json_normalize(beddit, record_path='data', record_prefix='data.', meta='_id')
Step 2
The properties record was a Series with dicts so these can be 'formatted' with .apply(pd.Series)
beddit_data_properties = beddit_data['data.properties'].apply(pd.Series)
Step 3
Final step is to merge both DataFrames. In step 1, I kept the 'meta=_id' so that DataFrame can be merged with the original DataFrame from Bedit. I didn't include it in the final step yet because I can spend some time on the results from the results so far.
beddit_final = pd.concat([beddit_data_properties[:], beddit_data[:]], axis=1)
If anyone is interested, I can share the final Jupyter Notebook when it is ready :)
I am trying to load JSON data using python, however, it looks like this:
{
"instrument" : "EUR_USD",
"granularity" : "D",
"candles" : [
{
"time" : "2014-07-02T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36803,
"highMid" : 1.368125,
"lowMid" : 1.364275,
"closeMid" : 1.365315,
"volume" : 28242,
"complete" : true
},
{
"time" : "2014-07-03T04:00:00.000000Z", // time in RFC3339 format
"openMid" : 1.36532,
"highMid" : 1.366445,
"lowMid" : 1.35963,
"closeMid" : 1.3613,
"volume" : 30487,
"complete" : false
}
]
}
My problem is that when I load it using Pandas, instrument, granularity, and candles are processed as the column titles. However, I want to use time, openMid, highMid, lowMid, closeMid, volume, and complete to create my columns. But they are just processed as a belonging to candles. Any ideas on how I can accomplish this? Thanks
You'll have to read the string using the json library first:
import json
data = json.loads(string)
And then you can extract the candles data from the resulting dictionary and build your DataFrame that way, e.g.:
candles_data = data.pop('candles')
df = pd.DataFrame(candles_data)
for k, v in data.iteritems():
df[k] = v