I got a CSV file where first row are headers, then other rows are data in columns.
I am using python to parse this data into the list of dictionaries
Normally I would use this code:
def csv_to_list_of_dictionaries(file):
with open(file) as f:
a = []
for row in csv.DictReader(f, skipinitialspace=True):
a.append({k: v for k, v in row.items()})
return a
but because data in one column are stored in dictionary, this code doesn't work (it separates key:value pairs in this dictionary
so data in my csv file looks like this:
col1,col2,col3,col4
1,{'a':'b', 'c':'d'},'bla',sometimestamp
dictionary from this is created as this: {col1:1, col2:{'a':'b', col3: 'c':'d'}, col4: 'bla'}
What I wish to have as result is: {col1:1, col2:{'a':'b', 'c':'d'}, col3: 'bla', col4: sometimestamp}
Don't use the csv module use a regular expression to extract the fields from each row. Then make dictionaries from the extracted rows.
Example file:
col1,col2,col3,col4
1,{'a':'b', 'c':'d'},'bla',sometimestamp
2,{'a':'b', 'c':'d'},'bla',sometimestamp
3,{'a':'b', 'c':'d'},'bla',sometimestamp
4,{'a':'b', 'c':'d'},'bla',sometimestamp
5,{'a':'b', 'c':'d'},'bla',sometimestamp
6,{'a':'b', 'c':'d'},'bla',sometimestamp
.
import re
pattern = r'^([^,]*),({.*}),([^,]*),([^,]*)$'
regex = re.compile(pattern,flags=re.M)
def csv_to_list_of_dictionaries(file):
with open(file) as f:
columns = next(f).strip().split(',')
stuff = regex.findall(f.read())
a = [dict(zip(columns,values)) for values in stuff]
return a
stuff = csv_to_list_of_dictionaries(f)
In [20]: stuff
Out[20]:
[{'col1': '1',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '2',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '3',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '4',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '5',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'},
{'col1': '6',
'col2': "{'a':'b', 'c':'d'}",
'col3': "'bla'",
'col4': 'sometimestamp'}]
Related
I have a dataframe with 2 columns.
Col1: String, Col2:String.
I want to create a dict like {'col1':'col2'}.
For example, the below csv data:
var1,InternalCampaignCode
var2,DownloadFileName
var3,ExternalCampaignCode
has to become :
{'var1':'InternalCampaignCode','var2':'DownloadFileName', ...}
The dataframe is having around 200 records.
Please let me know how to achieve this.
The following should do the trick:
df_as_dict = map(lambda row: row.asDict(), df.collect())
Note that this is going to generate a list of dictionaries, where each dictionary represents a single record of your pyspark dataframe:
[
{'Col1': 'var1', 'Col2': 'InternalCampaignCode'},
{'Col1': 'var2', 'Col2': 'DownloadFileName'},
{'Col1': 'var3', 'Col3': 'ExternalCampaignCode'},
]
You can do a dict comprehension:
result = {r[0]: r[1] for r in df.collect()}
which gives
{'var1': 'InternalCampaignCode', 'var2': 'DownloadFileName', 'var3': 'ExternalCampaignCode'}
I have a pandas dataframe with columns col1, col2 and col3 and respective values. I would need to transform column names and values into a JSON string.
For instance, if the dataset is
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
I need to obtain an output like this
newdata= pd.DataFrame({'index': [0,1,2], 'payload': ['{"col1":"bravo", "col2":"1", "col3":"alpha"}', '{"col1":"charlie", "col2":"2", "col3":"beta"}', '{"col1":"price", "col2":"3", "col3":"gamma"}']})
I didn't find any function or iterative tool to perform this.
Thank you in advance!
You can use:
df = data.agg(lambda s: dict(zip(s.index, s)), axis=1).rename('payload').to_frame()
Result:
# print(df)
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
Here you go:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
new_data = pd.DataFrame({
'payload': data.to_dict(orient='records')
})
print(new_data)
## -- End pasted text --
payload
0 {'col1': 'bravo', 'col2': 1, 'col3': 'alpha'}
1 {'col1': 'charlie', 'col2': 2, 'col3': 'beta'}
2 {'col1': 'price', 'col2': 3, 'col3': 'gamma'}
If my understanding is correct, you want the index and the data records as a dict.
So:
dict(index=list(data.index), payload=data.to_dict(orient='records'))
For your example data:
>>> import pprint
>>> pprint.pprint(dict(index=list(data.index), payload=data.to_dict(orient='records')))
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
This is one approach using .to_dict('index').
Ex:
import pandas as pd
data= pd.DataFrame({'col1': ['bravo', 'charlie','price'], 'col2': [1, 2, 3],'col3':['alpha','beta','gamma']})
newdata = data.to_dict('index')
print({'index': list(newdata.keys()), 'payload': list(newdata.values())})
#OR -->newdata= pd.DataFrame({'index': list(newdata.keys()), 'payload': list(newdata.values())})
Output:
{'index': [0, 1, 2],
'payload': [{'col1': 'bravo', 'col2': 1, 'col3': 'alpha'},
{'col1': 'charlie', 'col2': 2, 'col3': 'beta'},
{'col1': 'price', 'col2': 3, 'col3': 'gamma'}]}
Use to_dict: newdata = data.T.to_dict()
>>> print(newdata.values())
[
{'col2': 1, 'col3': 'alpha', 'col1': 'bravo'},
{'col2': 2, 'col3': 'beta', 'col1': 'charlie'},
{'col2': 3, 'col3': 'gamma', 'col1': 'price'}
]
I need to convert numeric values of a column (pandas data frame) to float, but they are in string format.
d = {'col1': ['1', '2.1', '3.1'],
'col2': ['yes', '4', '6'],
'col3': ['1', '4', 'not']}
Expected:
{'col1': [1, 2.1, 3.1],
'col2': ['yes', 4, 6],
'col3': [1, 4, 'not']}
It is possible, but not recommended, because if mixed values in columns some function should failed:
d = {'col1': ['1', '2.1', '3.1'],
'col2': ['yes', '4', '6'],
'col3': ['1', '4', 'not']}
df = pd.DataFrame(d)
def func(x):
try:
return float(x)
except Exception:
return x
df = df.applymap(func)
print (df)
col1 col2 col3
0 1.0 yes 1
1 2.1 4 4
2 3.1 6 not
print (df.to_dict('l'))
{'col1': [1.0, 2.1, 3.1], 'col2': ['yes', 4.0, 6.0], 'col3': [1.0, 4.0, 'not']}
Another solution:
df = df.apply(lambda x: pd.to_numeric(x, errors='coerce')).fillna(df)
I have a dict of symbol: DataFrame. Each DataFrame is a time series with an arbitrary number of columns. I want to transform this data structure into a unique time series DataFrame (indexed by date) where each column contains the values of a symbol as a dict.
The following code does what I want, but is slow when it is performed on a dict with hundreds of symbols and DataFrames of 10k rows / 10 columns. I'm looking for ways to improve its speed.
import pandas as pd
dates = pd.bdate_range('2010-01-01', '2049-12-31')[:100]
data = {
'A': pd.DataFrame(data={'col1': range(100), 'col2': range(200, 300)}, index=dates),
'B': pd.DataFrame(data={'col1': range(100), 'col2': range(300, 400)}, index=dates),
'C': pd.DataFrame(data={'col1': range(100), 'col2': range(400, 500)}, index=dates)
}
def convert(data, name):
data = pd.concat([
pd.DataFrame(data={symbol: [dict(zip(df.columns, v)) for v in df.values]},
index=df.index)
for symbol, df in data.items()
], axis=1, join='outer')
data['type'] = name
data.index.name = 'date'
return data
result = convert(data, name='system')
result.head()
A B C type
date
2010-05-18 {'col1': 97, 'col2': 297} {'col1': 97, 'col2': 397} {'col1': 97, 'col2': 497} system
2010-05-19 {'col1': 98, 'col2': 298} {'col1': 98, 'col2': 398} {'col1': 98, 'col2': 498} system
2010-05-20 {'col1': 99, 'col2': 299} {'col1': 99, 'col2': 399} {'col1': 99, 'col2': 499} system
Any help is greatly appreciated! Thank you.
The gist of this post is that I have "23" in my original data, and I want "23" in my resulting dict (not "23.0"). Here's how I've tried to handle it with Pandas.
My Excel worksheet has a coded Region column:
23
11
27
(blank)
25
Initially, I created a dataframe and Pandas set the dtype of Region to float64*
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0)
df
23.0
11.0
27.0
NaN
25.0
Pandas will convert the dtype to object if I use fillna() to replace NaN's with blanks which seems to eliminate the decimals.
df.fillna('', inplace=True)
df
23
11
27
(blank)
25
Except I still get decimals when I convert the dataframe to a dict:
data = df.to_dict('records')
data
[{'region': 23.0,},
{'region': 27.0,},
{'region': 11.0,},
{'region': '',},
{'region': 25.0,}]
Is there a way I can create the dict without the decimal places? By the way, I'm writing a generic utility, so I won't always know the column names and/or value types, which means I'm looking for a generic solution (vs. explicitly handling Region).
Any help is much appreciated, thanks!
The problem is that after fillna('') your underlying values are still float despite the column being of type object
s = pd.Series([23., 11., 27., np.nan, 25.])
s.fillna('').iloc[0]
23.0
Instead, apply a formatter, then replace
s.apply('{:0.0f}'.format).replace('nan', '').to_dict()
{0: '23', 1: '11', 2: '27', 3: '', 4: '25'}
Using a custom function, takes care of integers and keeps strings as strings:
import pprint
def func(x):
try:
return int(x)
except ValueError:
return x
df = pd.DataFrame({'region': [1, 2, 3, float('nan')],
'col2': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'col2': 'a', 'region': 1},
{'col2': 'b', 'region': 2},
{'col2': 'c', 'region': 3},
{'col2': '', 'region': ''}]
A variation that also keeps floats as floats:
import pprint
def func(x):
try:
if int(x) == x:
return int(x)
else:
return x
except ValueError:
return x
df = pd.DataFrame({'region1': [1, 2, 3, float('nan')],
'region2': [1.5, 2.7, 3, float('nan')],
'region3': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'region1': 1, 'region2': 1.5, 'region3': 'a'},
{'region1': 2, 'region2': 2.7, 'region3': 'b'},
{'region1': 3, 'region2': 3, 'region3': 'c'},
{'region1': '', 'region2': '', 'region3': ''}]
You could add: dtype=str
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0, dtype=str)