get rows based on a condition and separate them into subsets - python

am trying to subset a dataset based on a condition and pick the rows until it sees the value based on a condition
Condition, if Column A == 0, column B should start with 'a'.
Dataset:
A B
0 aa
1 ss
2 dd
3 ff
0 ee
1 ff
2 bb
3 gg
0 ar
1 hh
2 ww
0 jj
1 ll
expected:
[0:{'A':[0,1,2,3], 'B':['aa','ss','dd','ff']}, 1:{'A':[0,1,2], 'B':['ar','hh,'ww']} ]
The series starts from column A == 0 and ends until the next 0.
In total there are 4 different dictionaries in that dataframe.

May be try with cumsum as well ~
{x : y.to_dict('list')for x , y in df.groupby(df['A'].eq(0).cumsum())}
Out[87]:
{1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
2: {'A': [0, 1, 2, 3], 'B': ['ee', 'ff', 'bb', 'gg']},
3: {'A': [0, 1, 2], 'B': ['rr', 'hh', 'ww']},
4: {'A': [0, 1], 'B': ['jj', 'll']}}

Do a cumsum on the condition to identify the groups, then groupby:
groups = (df['A'].eq(0) & df['B'].str.startswith('a')).cumsum()
{k:v.to_dict(orient='list') for k,v in df.groupby(groups)}
Output:
{1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
2: {'A': [0, 1, 2, 3], 'B': ['ae', 'ff', 'bb', 'gg']},
3: {'A': [0, 1, 2, 0, 1], 'B': ['ar', 'hh', 'ww', 'jj', 'll']}}

This answers this question's revision 2020-11-04 19:29:39Z. Later additions/edits to the question or additional requirements in the comments will not be considered.
First find the desired rows and select them into a new dataframe. Group the rows and convert them to dicts.
g = (df.A.eq(0).astype(int) + df.B.str.startswith('a')).replace(0, method='ffill') - 1
df_BeqA = df[g.astype('bool')]
{x: y.to_dict('list') for x , y in df_BeqA.groupby(df_BeqA.A.eq(0).cumsum() - 1)}
Out:
{0: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
1: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']}}

Related

Is it possible to use two (non-nested) for loops inside a dicitonary?

I have these two lists:
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
And I want to merge them into a dictionary like this:
{'A': 1, 'B': 2, 'C': 3}
I already tried doing stuff like:
{i: j for i in a for j in b}
dict(*a: *b)
Which outputs
{'A': 3, 'B': 3, 'C': 3}
SyntaxError: invalid syntax
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
print (dict(zip(a,b)))
Output:
{'A': 1, 'B': 2, 'C': 3}
You should better use zip for this:
a = ['A', 'B', 'C']
b = [ 1 , 2 , 3 ]
{i:k for i,k in zip(a,b)}
#{'A': 1, 'B': 2, 'C': 3}
You can also use enumerate
d = {elem: b[i] for i, elem in enumerate(a)}
d
{'A': 1, 'B': 2, 'C': 3}

remove a dictionary from a dictionary based on a conditoon

I want to remove a dictionary from a nested dictionary based on a condition.
dict :
{1: {'A': [1, 2, 3, 0], 'B': ['ss', 'dd', 'ff', 'aa']},
2: {'A': [0, 1, 2, 3], 'B': ['ee', 'ff', 'bb', 'gg']},
3: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']},
4: {'A': [ 1, 0], 'B': [ 'll', 'jj']}}
I want to remove if 'A' == 0, then if B isnt starting with a, then I want to delete that particular dictionary.
Expected:
{1: {'A': [1, 2, 3, 0], 'B': ['ss', 'dd', 'ff', 'aa']},
2: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']},
}
Check
s=pd.DataFrame(d)
new_d = s.loc[:,s.loc['B'].str[0].str[0]=='a'].to_dict()
Out[99]:
{1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
3: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']}}
i assume the keys change on the new dictionary is on purpose, so here is a simple solution with no extra libraries(nested is your dictionary)
temp = [x for x in nested.values() if 0 in x['A'] and any([i[0] == 'a' for i in x['B']])]
new_dict = dict(enumerate(temp, 1))
if you want you can make it a one liner, or you can split the condition to be a function
like this
def check(x):
if 0 in x['A']:
return any([i[0] == 'a' for i in x['B']])
return True
temp = [x for x in nested.values() if check(x)]
Now if you prefer using filter() the above 2 variations become:
temp = filter(lambda x: 0 in x['A'] and any(i[0] == 'a' for i in x['B']), nested.values())
and if you decide to declare check() like above you can do it in this elegant way(notice i use new_dict here, instead of temp since it doen's make sence to split it when it so compact)
new_dict = dict(enumerate(filter(check, nested.values()),1)
Your question needs more clarity.
Assuming you intend to delete
{
1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
2: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']},
}
from
{1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
2: {'A': [0, 1, 2, 3], 'B': ['ee', 'ff', 'bb', 'gg']},
3: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']},
4: {'A': [0, 1], 'B': ['jj', 'll']}}
You can try this:
import re
r = re.compile("a.*")
input_dictionary = {
1: {'A': [0, 1, 2, 3], 'B': ['aa', 'ss', 'dd', 'ff']},
2: {'A': [0, 1, 2, 3], 'B': ['ee', 'ff', 'bb', 'gg']},
3: {'A': [0, 1, 2], 'B': ['ar', 'hh', 'ww']},
4: {'A': [0, 1], 'B': ['jj', 'll']}}
key_list = list(input_dictionary.keys())
for key in key_list:
value = input_dictionary[key]
if(any(r.match(b_element) for b_element in value['B']) and 0 in value['A']):
del input_dictionary[key]
print("Deleting key: ", key)

Counting common elements in a list on that occur on same day in a pandas dataframe

I have a DataFrame which looks like:
Users Date
['A', 'B'] 2017-10-21
['B', 'C'] 2017-10-21
['A', 'D'] 2017-10-21
['D', 'E'] 2017-10-22
['A', 'E'] 2017-10-22
['A', 'E', 'D'] 2017-10-22
['C', 'B', 'E'] 2017-10-23
['D', 'C', 'F'] 2017-11-23
I need to make a new DataFrame from this DataFrame which would count the number of times the items show up in the list on each day. The count, therefore, would be across different rows on the same date..
For example, the new DataFrame would look like:
Users Date
[A=2, B=2, C=1, D=1] 2017-10-21
[E=3, D=2, A=2] 2017-10-22
[B=1, C=2, D=1, E=1, F=1] 2017-10-23
Some things to note: the all the items in the first dataset are lists with individual elements being strings. The Date column is of DateTime type.
I understand there would be a groupby function on the Date column but I can't figure out how to write the function that I would apply to.
Using groupby and apply with collections.Counter:
df.groupby('Date').Users.sum().apply(collections.Counter, 1)
Date
2017-10-21 {'A': 2, 'B': 2, 'C': 1, 'D': 1}
2017-10-22 {'D': 2, 'E': 3, 'A': 2}
2017-10-23 {'C': 1, 'B': 1, 'E': 1}
2017-11-23 {'D': 1, 'C': 1, 'F': 1}
Name: Users, dtype: object
If you have multiple columns that you want to count per group:
Setup
s = 'ABCDE'
df = pd.DataFrame({
'Users': [random.sample(s, random.randint(1, 5)) for _ in range(10)],
'Tools': [random.sample(s, random.randint(1, 5)) for _ in range(10)],
'Hours': [random.sample(s, random.randint(1, 5)) for _ in range(10)],
'Date': ['2017-10-21', '2017-10-21', '2017-10-21', '2017-10-22',
'2017-10-22', '2017-10-22', '2017-10-23', '2017-10-23', '2017-10-23', '2017-11-23']
})
Using agg:
df.groupby('Date').sum().agg({
'Users': collections.Counter,
'Tools': collections.Counter,
'Hours': collections.Counter
})
Users Tools Hours
Date
2017-10-21 {'C': 2, 'E': 2, 'A': 2, 'B': 2, 'D': 1} {'E': 3, 'A': 2, 'B': 3, 'D': 2, 'C': 2} {'B': 2, 'C': 2, 'E': 1, 'A': 1, 'D': 1}
2017-10-22 {'D': 2, 'A': 2, 'E': 1, 'C': 1, 'B': 2} {'E': 2, 'B': 3, 'A': 3, 'D': 1, 'C': 1} {'B': 1, 'C': 2, 'E': 2, 'A': 2, 'D': 2}
2017-10-23 {'B': 2, 'A': 2, 'D': 1, 'E': 1, 'C': 2} {'D': 3, 'E': 2, 'B': 2, 'C': 3, 'A': 2} {'C': 3, 'E': 2, 'D': 2, 'B': 1, 'A': 2}
2017-11-23 {'D': 1, 'B': 1, 'C': 1} {'B': 1} {'C': 1, 'E': 1}

Add tuples into a list without unpack the tuple

In Python, I have some tuples like b, and I want to add them into an empty list without unpack them. Here, I simplify b so that it repeats itself, in reality, the values in b would be different, so b would be b1, b2, b3...
b = ({'a': 1, 'b': 1, 'c': 1}, 'y')
bb = [b, b, b]
print(len(bb))
print(len(bb[0]))
bb
This gives
3 2 Out[204]: [({'a': 1, 'b': 1, 'c': 1}, 'y'), ({'a': 1, 'b': 1,'c': 1}, 'y'), ({'a': 1, 'b': 1, 'c': 1}, 'y')]
which is what I want. But since I am now doing in a loop, I can not write bb = [b, b, b]. The syntax I came up with will make hiarachy that I do not want.
bb = ()
b = ({'a': 1, 'b': 1, 'c': 1}, 'y')
bb = [bb, b]
# in reality I loop bb with 3 times in for loop
bb = [bb, b]
bb = [bb, b]
print(len(bb))
print(len(bb[0]))
bb
This gives
[[[(), ({'a': 1, 'b': 1, 'c': 1}, 'y')], ({'a': 1, 'b': 1, 'c': 1},'y')], ({'a': 1, 'b': 1, 'c': 1}, 'y')]
and is not want I wanted. How can I loop and reach the first outcome?
Just use list comprehension:
b = ({'a': 1, 'b': 1, 'c': 1}, 'y')
bb = [b for i in range(3)]
Output:
[({'a': 1, 'c': 1, 'b': 1}, 'y'), ({'a': 1, 'c': 1, 'b': 1}, 'y'), ({'a': 1, 'c': 1, 'b': 1}, 'y')]
Start with a list and use append:
bb = []
b = ({'a': 1, 'b': 1, 'c': 1}, 'y')
for _ in range(3):
bb.append(b)

Convert a Pandas DataFrame to a dictionary

I have a DataFrame with four columns. I want to convert this DataFrame to a python dictionary. I want the elements of first column be keys and the elements of other columns in same row be values.
DataFrame:
ID A B C
0 p 1 3 2
1 q 4 3 2
2 r 4 0 9
Output should be like this:
Dictionary:
{'p': [1,3,2], 'q': [4,3,2], 'r': [4,0,9]}
The to_dict() method sets the column names as dictionary keys so you'll need to reshape your DataFrame slightly. Setting the 'ID' column as the index and then transposing the DataFrame is one way to achieve this.
to_dict() also accepts an 'orient' argument which you'll need in order to output a list of values for each column. Otherwise, a dictionary of the form {index: value} will be returned for each column.
These steps can be done with the following line:
>>> df.set_index('ID').T.to_dict('list')
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
In case a different dictionary format is needed, here are examples of the possible orient arguments. Consider the following simple DataFrame:
>>> df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
>>> df
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
Then the options are as follows.
dict - the default: column names are keys, values are dictionaries of index:data pairs
>>> df.to_dict('dict')
{'a': {0: 'red', 1: 'yellow', 2: 'blue'},
'b': {0: 0.5, 1: 0.25, 2: 0.125}}
list - keys are column names, values are lists of column data
>>> df.to_dict('list')
{'a': ['red', 'yellow', 'blue'],
'b': [0.5, 0.25, 0.125]}
series - like 'list', but values are Series
>>> df.to_dict('series')
{'a': 0 red
1 yellow
2 blue
Name: a, dtype: object,
'b': 0 0.500
1 0.250
2 0.125
Name: b, dtype: float64}
split - splits columns/data/index as keys with values being column names, data values by row and index labels respectively
>>> df.to_dict('split')
{'columns': ['a', 'b'],
'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]],
'index': [0, 1, 2]}
records - each row becomes a dictionary where key is column name and value is the data in the cell
>>> df.to_dict('records')
[{'a': 'red', 'b': 0.5},
{'a': 'yellow', 'b': 0.25},
{'a': 'blue', 'b': 0.125}]
index - like 'records', but a dictionary of dictionaries with keys as index labels (rather than a list)
>>> df.to_dict('index')
{0: {'a': 'red', 'b': 0.5},
1: {'a': 'yellow', 'b': 0.25},
2: {'a': 'blue', 'b': 0.125}}
Should a dictionary like:
{'red': '0.500', 'yellow': '0.250', 'blue': '0.125'}
be required out of a dataframe like:
a b
0 red 0.500
1 yellow 0.250
2 blue 0.125
simplest way would be to do:
dict(df.values)
working snippet below:
import pandas as pd
df = pd.DataFrame({'a': ['red', 'yellow', 'blue'], 'b': [0.5, 0.25, 0.125]})
dict(df.values)
Follow these steps:
Suppose your dataframe is as follows:
>>> df
A B C ID
0 1 3 2 p
1 4 3 2 q
2 4 0 9 r
1. Use set_index to set ID columns as the dataframe index.
df.set_index("ID", drop=True, inplace=True)
2. Use the orient=index parameter to have the index as dictionary keys.
dictionary = df.to_dict(orient="index")
The results will be as follows:
>>> dictionary
{'q': {'A': 4, 'B': 3, 'D': 2}, 'p': {'A': 1, 'B': 3, 'D': 2}, 'r': {'A': 4, 'B': 0, 'D': 9}}
3. If you need to have each sample as a list run the following code. Determine the column order
column_order= ["A", "B", "C"] # Determine your preferred order of columns
d = {} # Initialize the new dictionary as an empty dictionary
for k in dictionary:
d[k] = [dictionary[k][column_name] for column_name in column_order]
Try to use Zip
df = pd.read_csv("file")
d= dict([(i,[a,b,c ]) for i, a,b,c in zip(df.ID, df.A,df.B,df.C)])
print d
Output:
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
If you don't mind the dictionary values being tuples, you can use itertuples:
>>> {x[0]: x[1:] for x in df.itertuples(index=False)}
{'p': (1, 3, 2), 'q': (4, 3, 2), 'r': (4, 0, 9)}
For my use (node names with xy positions) I found #user4179775's answer to the most helpful / intuitive:
import pandas as pd
df = pd.read_csv('glycolysis_nodes_xy.tsv', sep='\t')
df.head()
nodes x y
0 c00033 146 958
1 c00031 601 195
...
xy_dict_list=dict([(i,[a,b]) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_list
{'c00022': [483, 868],
'c00024': [146, 868],
... }
xy_dict_tuples=dict([(i,(a,b)) for i, a,b in zip(df.nodes, df.x,df.y)])
xy_dict_tuples
{'c00022': (483, 868),
'c00024': (146, 868),
... }
Addendum
I later returned to this issue, for other, but related, work. Here is an approach that more closely mirrors the [excellent] accepted answer.
node_df = pd.read_csv('node_prop-glycolysis_tca-from_pg.tsv', sep='\t')
node_df.head()
node kegg_id kegg_cid name wt vis
0 22 22 c00022 pyruvate 1 1
1 24 24 c00024 acetyl-CoA 1 1
...
Convert Pandas dataframe to a [list], {dict}, {dict of {dict}}, ...
Per accepted answer:
node_df.set_index('kegg_cid').T.to_dict('list')
{'c00022': [22, 22, 'pyruvate', 1, 1],
'c00024': [24, 24, 'acetyl-CoA', 1, 1],
... }
node_df.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'kegg_id': 22, 'name': 'pyruvate', 'node': 22, 'vis': 1, 'wt': 1},
'c00024': {'kegg_id': 24, 'name': 'acetyl-CoA', 'node': 24, 'vis': 1, 'wt': 1},
... }
In my case, I wanted to do the same thing but with selected columns from the Pandas dataframe, so I needed to slice the columns. There are two approaches.
Directly:
(see: Convert pandas to dictionary defining the columns used fo the key values)
node_df.set_index('kegg_cid')[['name', 'wt', 'vis']].T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
"Indirectly:" first, slice the desired columns/data from the Pandas dataframe (again, two approaches),
node_df_sliced = node_df[['kegg_cid', 'name', 'wt', 'vis']]
or
node_df_sliced2 = node_df.loc[:, ['kegg_cid', 'name', 'wt', 'vis']]
that can then can be used to create a dictionary of dictionaries
node_df_sliced.set_index('kegg_cid').T.to_dict('dict')
{'c00022': {'name': 'pyruvate', 'vis': 1, 'wt': 1},
'c00024': {'name': 'acetyl-CoA', 'vis': 1, 'wt': 1},
... }
Most of the answers do not deal with the situation where ID can exist multiple times in the dataframe. In case ID can be duplicated in the Dataframe df you want to use a list to store the values (a.k.a a list of lists), grouped by ID:
{k: [g['A'].tolist(), g['B'].tolist(), g['C'].tolist()] for k,g in df.groupby('ID')}
Dictionary comprehension & iterrows() method could also be used to get the desired output.
result = {row.ID: [row.A, row.B, row.C] for (index, row) in df.iterrows()}
df = pd.DataFrame([['p',1,3,2], ['q',4,3,2], ['r',4,0,9]], columns=['ID','A','B','C'])
my_dict = {k:list(v) for k,v in zip(df['ID'], df.drop(columns='ID').values)}
print(my_dict)
with output
{'p': [1, 3, 2], 'q': [4, 3, 2], 'r': [4, 0, 9]}
With this method, columns of dataframe will be the keys and series of dataframe will be the values.`
data_dict = dict()
for col in dataframe.columns:
data_dict[col] = dataframe[col].values.tolist()
DataFrame.to_dict() converts DataFrame to dictionary.
Example
>>> df = pd.DataFrame(
{'col1': [1, 2], 'col2': [0.5, 0.75]}, index=['a', 'b'])
>>> df
col1 col2
a 1 0.1
b 2 0.2
>>> df.to_dict()
{'col1': {'a': 1, 'b': 2}, 'col2': {'a': 0.5, 'b': 0.75}}
See this Documentation for details

Categories