Speed improvement for DataFrame of dict? - python

I have a dict of symbol: DataFrame. Each DataFrame is a time series with an arbitrary number of columns. I want to transform this data structure into a unique time series DataFrame (indexed by date) where each column contains the values of a symbol as a dict.
The following code does what I want, but is slow when it is performed on a dict with hundreds of symbols and DataFrames of 10k rows / 10 columns. I'm looking for ways to improve its speed.
import pandas as pd
dates = pd.bdate_range('2010-01-01', '2049-12-31')[:100]
data = {
'A': pd.DataFrame(data={'col1': range(100), 'col2': range(200, 300)}, index=dates),
'B': pd.DataFrame(data={'col1': range(100), 'col2': range(300, 400)}, index=dates),
'C': pd.DataFrame(data={'col1': range(100), 'col2': range(400, 500)}, index=dates)
}
def convert(data, name):
data = pd.concat([
pd.DataFrame(data={symbol: [dict(zip(df.columns, v)) for v in df.values]},
index=df.index)
for symbol, df in data.items()
], axis=1, join='outer')
data['type'] = name
data.index.name = 'date'
return data
result = convert(data, name='system')
result.head()
A B C type
date
2010-05-18 {'col1': 97, 'col2': 297} {'col1': 97, 'col2': 397} {'col1': 97, 'col2': 497} system
2010-05-19 {'col1': 98, 'col2': 298} {'col1': 98, 'col2': 398} {'col1': 98, 'col2': 498} system
2010-05-20 {'col1': 99, 'col2': 299} {'col1': 99, 'col2': 399} {'col1': 99, 'col2': 499} system
Any help is greatly appreciated! Thank you.

Related

Using column values as key in pandas json

I have a pandas dataframe as below. I'm just wondering if there's any way to have my column values as my key to the json.
df:
|symbol | price|
|:------|------|
|a. |120|
|b. |100|
|c |200|
I expect the json to look like {'a': 120, 'b': 100, 'c': 200}
I've tried the below and got the result as {symbol: 'a', price: 120}{symbol: 'b', price: 100}{symbol: 'c', price: 200}
df.to_json('price.json', orient='records', lines=True)
Let's start by creating the dataframe that OP mentions
import pandas as pd
df = pd.DataFrame({'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]})
Considering that OP doesn't want the JSON values as a list (as OP commented here), the following will do the work
df.groupby('symbol').price.apply(lambda x: x.iloc[0]).to_dict()
[Out]: {'a': 120, 'b': 100, 'c': 200}
If one wants the JSON values as a list, the following will do the work
df.groupby('symbol').price.apply(list).to_json()
[Out]: {"a":[120],"b":[100],"c":[200]}
Try like this :
import pandas as pd
d = {'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]}
df = pd.DataFrame(data=d)
print(df)
print (df.set_index('symbol').rename(columns={'price':'json_data'}).to_json())
# EXPORT TO FILE
df.set_index('symbol').rename(columns={'price':'json_data'}).to_json('price.json')
Output :
symbol price
0 a 120
1 b 100
2 c 200
{"json_data":{"a":120,"b":100,"c":200}}

How to aggregate columns in pandas

I have a data frame with number of columns I want to group them under two main groups A and B while preserving the old columns names as dictionary as follow
index userid col1 col2 col3 col4 col5 col6 col7
0 1 6 3 Nora 100 11 22 44
the desired data frame is as follow
index userid A B
0 1 {"col1":6, "col2":3, "col3":"Nora","col4":100} {"col5":11, "col6":22, "col7":44}
To match your desired dataframe exactly:
>>> import pandas as pd
# recreating your data
>>> df = pd.DataFrame.from_dict({'index': [0], 'userid': [1], 'col1': [6], 'col2': [3], 'col3': ['Nora'], 'col4': [100], 'col5': [11], 'col6': [22], 'col7': [44]})
# copy of unchanged columns
>>> df_new = df[['index', 'userid']].copy()
# grouping columns together
>>> df_new['A'] = df[['col1', 'col2', 'col3', 'col4']].copy().to_dict(orient='records')
>>> df_new['B'] = df[['col5', 'col6', 'col7']].copy().to_dict(orient='records')
>>> df_new
index userid A B
0 0 1 {'col1': 6, 'col2': 3, 'col3': 'Nora', 'col4': 100} {'col5': 11, 'col6': 22, 'col7': 44}
You can try something like this:
d = {'col1': 'A',
'col2': 'A',
'col3': 'A',
'col4': 'A',
'col5': 'B',
'col6': 'B',
'col7': 'B'}
df.groupby(d, axis=1).apply(pd.DataFrame.to_dict, orient='series').to_frame().T
Output:
A B
0 {'col1': [6], 'col2': [3], 'col3': ['Nora'], '... {'col5': [11], 'col6': [22], 'col7': [44]}
Working with the original dataframe.
import pandas as pd
df1 = pd.DataFrame({'index':[0], 'userid':[1],
'col1': [6], 'col2': [3], 'col3': ['Nora'] ,'col4':[100],
'col5':[11], 'col6': [22], 'col7':[44]})
df1['A'] = df1[['col1', 'col2', 'col3', 'col4']].to_dict(orient='records')
df1['B'] = df1[['col5', 'col6', 'col7']].to_dict(orient='records')
df1.drop(df1.columns[range(2, 9)], axis=1, inplace=True)
print(df1)

how to convert pyspark data frame columns into a dict

I have a dataframe with 2 columns.
Col1: String, Col2:String.
I want to create a dict like {'col1':'col2'}.
For example, the below csv data:
var1,InternalCampaignCode
var2,DownloadFileName
var3,ExternalCampaignCode
has to become :
{'var1':'InternalCampaignCode','var2':'DownloadFileName', ...}
The dataframe is having around 200 records.
Please let me know how to achieve this.
The following should do the trick:
df_as_dict = map(lambda row: row.asDict(), df.collect())
Note that this is going to generate a list of dictionaries, where each dictionary represents a single record of your pyspark dataframe:
[
{'Col1': 'var1', 'Col2': 'InternalCampaignCode'},
{'Col1': 'var2', 'Col2': 'DownloadFileName'},
{'Col1': 'var3', 'Col3': 'ExternalCampaignCode'},
]
You can do a dict comprehension:
result = {r[0]: r[1] for r in df.collect()}
which gives
{'var1': 'InternalCampaignCode', 'var2': 'DownloadFileName', 'var3': 'ExternalCampaignCode'}

Convert csv into list of dictionaries in python

I got a CSV file where first row are headers, then other rows are data in columns.
I am using python to parse this data into the list of dictionaries
Normally I would use this code:
def csv_to_list_of_dictionaries(file):
with open(file) as f:
a = []
for row in csv.DictReader(f, skipinitialspace=True):
a.append({k: v for k, v in row.items()})
return a
but because data in one column are stored in dictionary, this code doesn't work (it separates key:value pairs in this dictionary
so data in my csv file looks like this:
col1,col2,col3,col4
1,{'a':'b', 'c':'d'},'bla',sometimestamp
dictionary from this is created as this: {col1:1, col2:{'a':'b', col3: 'c':'d'}, col4: 'bla'}
What I wish to have as result is: {col1:1, col2:{'a':'b', 'c':'d'}, col3: 'bla', col4: sometimestamp}
Don't use the csv module use a regular expression to extract the fields from each row. Then make dictionaries from the extracted rows.
Example file:
col1,col2,col3,col4
1,{'a':'b', 'c':'d'},'bla',sometimestamp
2,{'a':'b', 'c':'d'},'bla',sometimestamp
3,{'a':'b', 'c':'d'},'bla',sometimestamp
4,{'a':'b', 'c':'d'},'bla',sometimestamp
5,{'a':'b', 'c':'d'},'bla',sometimestamp
6,{'a':'b', 'c':'d'},'bla',sometimestamp
.
import re
pattern = r'^([^,]*),({.*}),([^,]*),([^,]*)$'
regex = re.compile(pattern,flags=re.M)
def csv_to_list_of_dictionaries(file):
with open(file) as f:
columns = next(f).strip().split(',')
stuff = regex.findall(f.read())
a = [dict(zip(columns,values)) for values in stuff]
return a
stuff = csv_to_list_of_dictionaries(f)
In [20]: stuff
Out[20]:
[{'col1': '1',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '2',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '3',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '4',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '5',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '6',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'}]

Turn pandas columns into JSON string

I have a pandas dataframe with columns col1, col2 and col3 and respective values. I would need to transform column names and values into a JSON string.
For instance, if the dataset is
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
I need to obtain an output like this
newdata= pd.DataFrame({'index': [0,1,2], 'payload': ['{"col1":"bravo", "col2":"1", "col3":"alpha"}', '{"col1":"charlie", "col2":"2", "col3":"beta"}', '{"col1":"price", "col2":"3", "col3":"gamma"}']})
I didn't find any function or iterative tool to perform this.
Thank you in advance!
You can use:
df = data.agg(lambda s: dict(zip(s.index, s)), axis=1).rename('payload').to_frame()
Result:
# print(df)
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
Here you go:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
new_data = pd.DataFrame({
'payload': data.to_dict(orient='records')
})
print(new_data)
## -- End pasted text --
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
If my understanding is correct, you want the index and the data records as a dict.
So:
dict(index=list(data.index), payload=data.to_dict(orient='records'))
For your example data:
>>> import pprint
>>> pprint.pprint(dict(index=list(data.index), payload=data.to_dict(orient='records')))
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
This is one approach using .to_dict('index').
Ex:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
newdata = data.to_dict('index')
print({'index': list(newdata.keys()), 'payload': list(newdata.values())})
#OR -->newdata= pd.DataFrame({'index': list(newdata.keys()), 'payload': list(newdata.values())})
Output:
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
Use to_dict: newdata = data.T.to_dict()
>>> print(newdata.values())
[
{'col2': 1, 'col3': 'alpha', 'col1': 'bravo'},
{'col2': 2, 'col3': 'beta', 'col1': 'charlie'},
{'col2': 3, 'col3': 'gamma', 'col1': 'price'}
]

Categories