how to convert csv to dictionary using pandas - python

How can I convert a csv into a dictionary using pandas? For example I have 2 columns, and would like column1 to be the key and column2 to be the value. My data looks like this:
"name","position"
"UCLA","73"
"SUNY","36"
cols = ['name', 'position']
df = pd.read_csv(filename, names = cols)

Since the 1st line of your sample csv-data is a "header",
you may read it as pd.Series using the squeeze keyword of pandas.read_csv():
>>> pd.read_csv(filename, index_col=0, header=None, squeeze=True).to_dict()
{'UCLA': 73, 'SUNY': 36}
If you want to include also the 1st line, remove the header keyword (or set it to None).

Convert the columns to a list, then zip and convert to a dict:
In [37]:
df = pd.DataFrame({'col1':['first','second','third'], 'col2':np.random.rand(3)})
print(df)
dict(zip(list(df.col1), list(df.col2)))
col1 col2
0 first 0.278247
1 second 0.459753
2 third 0.151873
[3 rows x 2 columns]
Out[37]:
{'third': 0.15187291615699894,
'first': 0.27824681093923298,
'second': 0.4597530377539677}

ankostis answer in my opinion is the most elegant solution when you have the file on disk.
However, if you do not want to or cannot go the detour of saving and loading from the file system, you can also do it like this:
df = pd.DataFrame({"name": [73, 36], "position" : ["UCLA", "SUNY"]})
series = df["position"]
series.index = df["name"]
series.to_dict()
Result:
{'UCLA': 73, 'SUNY': 36}

Related

Create a pandas DataFrame where each cell is a set of strings

I am trying to create a DataFrame like so:
col_a
col_b
{'soln_a'}
{'soln_b'}
In case it helps, here are some of my failed attempts:
import pandas as pd
my_dict_a = {"col_a": set(["soln_a"]), "col_b": set("soln_b")}
df_0 = pd.DataFrame.from_dict(my_dict_a) # ValueError: All arrays must be of the same length
df_1 = pd.DataFrame.from_dict(my_dict_a, orient="index").T # splits 'soln_b' into individual letters
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b).apply(set) # TypeError: 'set' type is unordered
df_3 = pd.DataFrame.from_dict(my_dict_b, orient="index").T # creates DataFrame of lists
df_3.apply(set, axis=1) # combines into single set of {soln_a, soln_b}
What's the best way to do this?
You just need to ensure your input data structure is formatted correctly.
The (default) dictionary -> DataFrame constructor, asks for the values in the dictionary be a collection of some type. You just need to make sure you have a collection of set objects, instead of having the key link directly to a set.
So, if I change my input dictionary to have a list of sets, then it works as expected.
import pandas as pd
my_dict = {
"col_a": [{"soln_a"}, {"soln_c"}],
"col_b": [{"soln_b", "soln_d"}, {"soln_c"}]
}
df = pd.DataFrame.from_dict(my_dict)
print(df)
col_a col_b
0 {soln_a} {soln_d, soln_b}
1 {soln_c} {soln_c}
You could apply a list comprehension on the columns:
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b)
df_2 = df_2.apply(lambda col: [set([x]) for x in col])
Output:
col_a col_b
0 {soln_a} {soln_b}
Why not something like this?
df = pd.DataFrame({
'col_a': [set(['soln_a'])],
'col_b': [set(['soln_b'])],
})
Output:
>>> df
col_a col_b
0 {soln_a} {soln_b}

Dataframe with empty column in the data

I have a list of lists with an header row and then the different value rows.
It could happen that is some cases the last "column" has an empty value for all the rows (if just a row has a value it works fine), but DataFrame is not happy about that as the number of columns differs from the header.
I'm thinking to add a None value to the first list without any value before creating the DF, but I wondering if there is a better way to handle this case?
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]]
headers = data.pop(0)
dataframe = pandas.DataFrame(data, columns = headers)
You could do this:
import pandas as pd
data = [
["data1", "data2", "data3"],
["value11", "value12"],
["value21", "value22"],
["value31", "value32"]
]
# create dataframe
df = pd.DataFrame(data)
# set new column names
# this will use ["data1", "data2", "data3"] as new columns, because they are in the first row
df.columns = df.iloc[0].tolist()
# now that you have the right column names, just jump the first line
df = df.iloc[1:].reset_index(drop=True)
df
data1 data2 data3
0 value11 value12 None
1 value21 value22 None
2 value31 value32 None
Is this that you want?
You can use pd.reindex function to add missing columns. You can possibly do something like this:
import pandas as pd
df = pd.DataFrame(data)
# To prevent throwing exception.
df.columns = headers[:df.shape[1]]
df = df.reindex(headers,axis=1)

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Selectively replacing DataFrames column names

I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')

Group by in pandas dataframe and unioning a numpy array column

I have a CSV file where one of the columns looks like a numpy array. The first few lines look like the following
first,second,third
170.0,2,[19 234 376]
170.0,3,[19 23 23]
162.0,4,[1 2 3]
162.0,5,[1 3 4]
When I load the this CSV with pandas data frame and using the following code
data = pd.read_csv('myfile.csv', converters = {'first': np.float64, 'second': np.int64, 'third': np.array})
Now, I want to group by based on the 'first' column and union the 'third' column. So after doing this my dataframe should look like
170.0, [19 23 234 376]
162.0, [1 2 3 4]
How do I achieve this? I tried multiple ways like the following and nothing seems to help achieve this goal.
group_data = data.groupby('first')
group_data['third'].apply(lambda x: np.unique(np.concatenate(x)))
With your current csv file the 'third' column comes in as a string, instead of a list.
There might be nicer ways to convert to a list, but here goes...
from ast import literal_eval
data = pd.read_csv('test_groupby.csv')
# Convert to a string representation of a list...
data['third'] = data['third'].str.replace(' ', ',')
# Convert string to list...
data['third'] = data['third'].apply(literal_eval)
group_data=data.groupby('first')
# Two secrets here revealed
# x.values instead of x since x is a Series
# list(...) to return an aggregated value
# (np.array should work here, but...?)
ans = group_data.aggregate(
{'third': lambda x: list(np.unique(
np.concatenate(x.values)))})
print(ans)
third
first
162 [1, 2, 3, 4]
170 [19, 23, 234, 376]

Categories