Print Row and Column Header if column/row is not NaN Pandas - python

It's a weird question - but can y'all think of a good way to just print the rows or a list of the rows and their corresponding column headers if the dataframe cell is not NaN?
Imagine a dataframe like this:
col1 col2 col3 col4
1 1 NaN 2 NaN
2 NaN NaN 1 2
3 2 NaN NaN 1
Result should look something like this:
1 [col1: 1, col3: 2]
2 [col3: 1, col4: 2]
3 [col1: 2, col4: 1]
Thanks in advance!

You can transpose the dataframe, and for each row, drop NaNs and convert to dict:
out = df.T.apply(lambda x: dict(x.dropna().astype(int)))
Output:
>>> out
1 {'col1': 1, 'col3': 2}
2 {'col3': 1, 'col4': 2}
3 {'col1': 2, 'col4': 1}
dtype: object

Let us try stack
df.stack().reset_index(level=0).groupby('level_0')[0].agg(dict)
Out[184]:
level_0
1 {'col1': 1.0, 'col3': 2.0}
2 {'col3': 1.0, 'col4': 2.0}
3 {'col1': 2.0, 'col4': 1.0}
Name: 0, dtype: object

combine agg(dict) and list comprehension
d = [{k:v for k, v in x.items() if v == v } for x in df.agg(dict,1)]
[{'col1': 1.0, 'col3': 2.0},
{'col3': 1.0, 'col4': 2.0},
{'col1': 2.0, 'col4': 1.0}]

Related

How to show data with different number of columns Pandas?

I load CSV document with different number of columns. Therefore I got this error:
Expected 12 fields in line 29, saw 13
To avoid this error I use the hack names=range(24)
df = pd.read_csv(filename, header=None, quoting=csv.QUOTE_NONE, dtype='object', sep=data_file_delimiter, engine='python', encoding = "utf-8", names=range(24))
Problem is I need to know the real number of columns to group this data further into dict data:
data = {}
for row in df.rows:
line = line.strip()
row = line.split(' ')
if len(row) not in data:
data[ len(row) ] = []
data[ len(row) ].append(row)
You can have the number of columns using len(df.columns) but if you only want to convert a pandas df to a dictionary then there are already many built-in methods as given below,
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.5, 0.75]},index=['row1', 'row2'])
df
col1 col2
row1 1 0.50
row2 2 0.75
df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
# You can specify the return orientation.
df.to_dict('series')
{'col1': row1 1
row2 2
Name: col1, dtype: int64,
'col2': row1 0.50
row2 0.75
Name: col2, dtype: float64}
df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}
# You can also specify the mapping type.
from collections import OrderedDict, defaultdict
df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
Taken from here

Convert Pandas Series to Dictionary Without Index

I need to convert Pandas Series to a Dictionary, without Index (like pandas.DataFrame.to_dict('r')) - code is below:
grouped_df = df.groupby(index_column)
for key, val in tqdm(grouped):
json_dict[key] = val.apply(lambda x: x.to_dict(), axis=1).to_dict()
Currently, I get output like so:
{
"15717":{
"col1":1.61,
"col2":1.53,
"col3":1.0
},
"15718":{
"col1":10.97,
"col2":5.79,
"col3":2.0
},
"15719":{
"col1":15.38,
"col2":12.81,
"col3":1.0
}
}
but i need output like:
[
{
"col1":1.61,
"col2":1.53,
"col3":1.0
},
{
"col1":10.97,
"col2":5.79,
"col3":2.0
},
{
"col1":15.38,
"col2":12.81,
"col3":1.0
}
]
Thanks for your help!
Edit: Here is the original dataframe:
col1 col2 col3
2751 5.46 1.0 1.11
2752 16.47 0.0 6.54
2753 26.51 0.0 18.25
2754 31.04 1.0 28.95
2755 36.45 0.0 32.91
Two ways of doing that:
[v for _, v in df.to_dict(orient="index").items()]
Another one:
df.to_dict(orient="records")
The output, either way, is:
[{'col1': 1.61, 'col2': 1.53, 'col3': 1.0},
{'col1': 10.97, 'col2': 5.79, 'col3': 2.0},
{'col1': 15.38, 'col2': 12.81, 'col3': 1.0}]
You can try:
df.T.to_dict('r')
Output:
[{'col1': 1.61, 'col2': 1.53, 'col3': 1.0},
{'col1': 10.97, 'col2': 5.79, 'col3': 2.0},
{'col1': 15.38, 'col2': 12.81, 'col3': 1.0}]

Convert Pandas DataFrame to dictionairy

I have a simple DataFrame:
Name Format
0 cntry int
1 dweight str
2 pspwght str
3 pweight str
4 nwspol str
I want a dictionairy as such:
{
"cntry":"int",
"dweight":"str",
"pspwght":"str",
"pweight":"str",
"nwspol":"str"
}
Where dict["cntry"] would return int or dict["dweight"] would return str.
How could I do this?
How about this:
import pandas as pd
df = pd.DataFrame({'col_1': ['A', 'B', 'C', 'D'], 'col_2': [1, 1, 2, 3], 'col_3': ['Bla', 'Foo', 'Sup', 'Asdf']})
res_dict = dict(zip(df['col_1'], df['col_3']))
Contents of res_dict:
{'A': 'Bla', 'B': 'Foo', 'C': 'Sup', 'D': 'Asdf'}
You're looking for DataFrame.to_dict()
From the documentation:
>>> df = pd.DataFrame({'col1': [1, 2],
... 'col2': [0.5, 0.75]},
... index=['row1', 'row2'])
>>> df
col1 col2
row1 1 0.50
row2 2 0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can always invert an internal dictionary if it's not mapped how you'd like it to be:
inv_dict = {v: k for k, v in original_dict['Name'].items()}
I think you want is:
df.set_index('Name').to_dict()['Format']
Since you want to use the values in the Name column as the keys to your dict.
Note that you might want to do:
df.set_index('Name').astype(str).to_dict()['Format']
if you want the values of the dictionary to be strings.

python aggregate list to dictionary

I have a file that looks like this -
Col1 Col2 Key Value
101 a f1 abc
101 a f2 def
102 a f2 xyz
102 a f3 fgh
103 b f1 rst
and I need output file that looks like:
{"Col1":101, "Col2":"a", "kvpairs":{"f1":"abc","f2":"def"}}
{"Col1":102, "Col2":"a", "kvpairs":{"f2":"xyz","f3":"fgh"}}
{"Col1":103, "Col2":"b", "kvpairs":{"f1":"rst"}}
I can loop through the file clubbing the key values pairs for the grouping fields Col1 and Col2 into a list and dropping it into a dict but was hoping there was a more pythonic way of doing it. There are questions answered using pandas aggregation but i can't find a neat (and efficient way) of building that nested map. Also, the source file is gonna be large, like 80m records crunching down to 8m in the resulting file.
I can see those eyes lighting up :)
Using itertools.groupby():
from itertools import groupby
for ((c1,c2),items) in groupby(lines, key=lambda x: x[:2]):
d = {"Col1": c1, "Col2:": c2, "kvpairs":dict(x[2:] for x in items)}
print(d)
Produces:
{'Col1': '101', 'Col2:': 'a', 'kvpairs': {'f1': 'abc', 'f2': 'def'}}
{'Col1': '102', 'Col2:': 'a', 'kvpairs': {'f2': 'xyz', 'f3': 'fgh'}}
{'Col1': '103', 'Col2:': 'b', 'kvpairs': {'f1': 'rst'}}
It looks like you're parsing some of the values to literals -- the int you can do with int(c1), but I'm not sure how you want to deal with turning "a" into a.
(Assuming your have a list of iterables, maybe from the csv module:)
lines = [
['101','a','f1','abc'],
['101','a','f2','def'],
['102','a','f2','xyz'],
['102','a','f3','fgh'],
['103','b','f1','rst']
]
data = []
for col1, col2, key, value in input:
# look for an existing dict with col1 and col2
for d in data:
if d['col1'] == col1 and d['col2'] == col2:
d['kvpairs'][key] = value
break
# no existing dict was found
else:
d.append({'col1': col1, 'col2': col2, 'kvpairs': {key: value}})
for d in data:
print d
groupby + agg + to_dict
df.groupby(["Col1", "Col2"])[["Key", "Value"]].agg(list).transform(lambda x: dict(zip(*x)),1).reset_index(name='kvpairs').to_dict('records')
[{'Col1': 101, 'Col2': 'a', 'kvpairs': {'f1': 'abc', 'f2': 'def'}},
{'Col1': 102, 'Col2': 'a', 'kvpairs': {'f2': 'xyz', 'f3': 'fgh'}},
{'Col1': 103, 'Col2': 'b', 'kvpairs': {'f1': 'rst'}}]
Assuming of course, df is
z = io.StringIO("""Col1 Col2 Key Value
101 a f1 abc
101 a f2 def
102 a f2 xyz
102 a f3 fgh
103 b f1 rst""")
df = pd.read_table(z,delim_whitespace=True)
Explanation
First you aggregate using list
df.groupby(["Col1", "Col2"])[["Key", "Value"]].agg(list)
Key Value
Col1 Col2
101 a [f1, f2] [abc, def]
102 a [f2, f3] [xyz, fgh]
103 b [f1] [rst]
Then transform this output to dictionaries and rename the axis altogether
.transform(lambda x: dict(zip(*x)),1).reset_index(name='kvpairs')
Col1 Col2 kvpairs
0 101 a {'f1': 'abc', 'f2': 'def'}
1 102 a {'f2': 'xyz', 'f3': 'fgh'}
2 103 b {'f1': 'rst'}
Finally, use to_dict('records') to get your list of dictionaries
.to_dict('records')
[{'Col1': 101, 'Col2': 'a', 'kvpairs': {'f1': 'abc', 'f2': 'def'}},
{'Col1': 102, 'Col2': 'a', 'kvpairs': {'f2': 'xyz', 'f3': 'fgh'}},
{'Col1': 103, 'Col2': 'b', 'kvpairs': {'f1': 'rst'}}]

Convert a Pandas DataFrame to a dictionary

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Categories