Delete some rows in dataframe based on condition in another column - python

I have a dataframe as follows:
name
value
aa
0
aa
0
aa
1
aa
0
aa
0
bb
0
bb
0
bb
1
bb
0
bb
0
bb
0
I want to delete all rows of the dataframe when there is 1 appeared in column 'value' with relation to 'name' column.
name
value
aa
0
aa
0
aa
1
bb
0
bb
0
bb
1
What is the best way to do so? I thought about pd.groupby method and use some conditions inside, but cannot understand how to make it work.

Not the most beautiful of ways to do it but this should work.
df = df.loc[df['value'].groupby(df['name']).cumsum().groupby(df['name']).cumsum() <=1]

Here's my approach on solving this.
# Imports.
import pandas as pd
# Creating a DataFrame.
df = pd.DataFrame([{'name': 'aa', 'value': 0},
{'name': 'aa', 'value': 0},
{'name': 'aa', 'value': 1},
{'name': 'aa', 'value': 0},
{'name': 'aa', 'value': 0},
{'name': 'bb', 'value': 0},
{'name': 'bb', 'value': 0},
{'name': 'bb', 'value': 1},
{'name': 'bb', 'value': 0},
{'name': 'bb', 'value': 0},
{'name': 'bb', 'value': 0},
{'name': 'bb', 'value': 0}])
# Filtering the DataFrame.
df_filtered = df.groupby('name').apply(lambda x: x[x.index <= x['value'].idxmax()]).reset_index(drop=True)

Related

Converting complex dictionary to pandas dataframe

I am looking into Pythonian ways of extracting part of the dictionary below and turning it into a pandas DataFrame as shown. Appreciate your help with that!
{'data': [{'x': {'name': 'Gamma', 'unit': 'cps', 'values': [10, 20, 30]},
'y': {'name': 'Depth', 'unit': 'm', 'values': [34.3, 34.5, 34.7]}}]}
Depth
Gamma
1
34.3
10
2
34.4
20
3
34.5
30
Sure, basically, you need to iterate over the values of each dict in the 'data' list, which is itself a dict of column information:
In [1]: data = {'data': [{'x': {'name': 'Gamma', 'unit': 'cps', 'values': [10, 20, 30]},
...: 'y': {'name': 'Depth', 'unit': 'm', 'values': [34.3, 34.5, 34.7]}}]}
In [2]: import pandas as pd
In [3]: pd.DataFrame({
...: col["name"]: col["values"]
...: for d in data['data']
...: for col in d.values()
...: })
Out[3]:
Gamma Depth
0 10 34.3
1 20 34.5
2 30 34.7

Pyspark preserving the order of fields when converting dataframe rows to dictionary

I have a dataframe df with below data:
Name Value Code
a 1 1
b 2 1
c 3 2
d 4 2
I want to convert this dataframe to a dictionary. I tried using asDict():
map(lambda row: row.asDict(), df.collect())
and it is giving the following output:
[{'Code': 1, 'Name': u'a', 'Value': 1}, {'Code': 1, 'Name': u'b', 'Value': 2}, {'Code': 2, 'Name': u'c', 'Value': 3}, {'Code': 2, 'Name': u'd', 'Value': 4}]
Here the fields are sorted. But I want to preserve the order of the fields.
My output should look this:
[{'Name': u'a', 'Value': 1,'Code': 1}, {'Name': u'b', 'Value': 2,'Code': 1}, {''Name': u'c', 'Value': 3,Code': 2}, {'Name': u'd', 'Value': 4,'Code': 2}]
Is there any other way to achieve this other than using asDict() method?
In python a dict doesn't have any notion of order. You need to use OrderedDict for that. You could do something like this
from collections import OrderedDict
...
map(lambda row: OrderedDict(zip(df.columns, list(row))), df.collect())

list of dictionary column in a dataframe

A column in my dataframe is list of dictionaries some thing like this:
How can I filter the rows that have specific value for the id key in tag column? for instance the rows that contains {"id" : 18}
Since your tag column is list-valued, you could use explode if you are on pandas 0.25+:
# toy data
df = pd.DataFrame({'type':['df','fg','ff'],
'tag': [[{"id" : 12} ,{"id" : 13}],
[{"id" : 12}],
[{'id':10}]]
})
# make each row contains exactly one dict: {id: val}
s = df['tag'].explode()
# the indexes of interested rows
idx = s.index[pd.DataFrame(s.to_list())['id'].values==12]
# output
df.loc[idx]
Output:
type tag
0 df [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
Sample DatFrame
df=pd.DataFrame({'type':['dg','fg','ff'],'tag':[[{"id" : 12} ,{"id" : 13}] ,[{"id" : 12}],[{"id" : 29}]]})
print(df)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]
2 ff [{'id': 29}]
Then you can use Series.apply to check each cell:
df_filtered=df[df['tag'].apply(lambda x: pd.Series([dict['id'] for dict in x]).eq(12).any())]
print(df_filtered)
type tag
0 dg [{'id': 12}, {'id': 13}]
1 fg [{'id': 12}]

Making new columns of keys with values stores as a list of values from a list of dicts?

I have a data frame (10 million rows) which looks like following. For better understanding, I have simplified it.
user_id event_params
10 [{'key': 'x', 'value': '1'}, {'key': 'y', 'value': '3'}, {'key': 'z', 'value': '4'}]
11 [{'key': 'y', 'value': '5'}, {'key': 'z', 'value': '9'}]
12 [{'key': 'a', 'value': '5'}]
I want to make new columns that are all the unique keys from the dataframe, with values stored in the respective keys. Output should like below:
user_id x y z a
10 1 3 4 NA
11 NA 5 9 NA
12 NA NA NA 5
Just create the new dataframe and append new lines via append function. You can find more alternatives here.
import pandas as pd
df = pd.DataFrame()
data = [
[12, [{'key': 'x', 'value': '1'}, {'key': 'y', 'value': '3'}, {'key': 'z', 'value': '4'}]],
[13, [{'key': 'a', 'value': '5'}]]
]
for user_id, event_params in data:
record = {e['key']: e['value'] for e in event_params}
record['user_id'] = user_id
df = df.append(record, ignore_index=True)
df

python efficient group by

I am looking for the most efficient way to extract items from a list of dictionaries.I have a list of about 5k dictionaries. I need to extract those records/items for which grouping by a particular field gives more than a threshold T number of records. For example, if T = 2 and dictionary key 'id':
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
The result should be:
list = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}]
i.e. All the records with some id such that there are atleast 3 records of same id.
l = [{'name': 'abc', 'id' : 1}, {'name': 'bc', 'id' : 1}, {'name': 'c', 'id' : 1}, {'name': 'bbc', 'id' : 2}]
from collections import defaultdict
from itertools import chain
d = defaultdict(list)
T = 2
for dct in l:
d[dct["id"]].append(dct)
print(list(chain.from_iterable(v for v in d.values() if len(v) > T)))
[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]
If you want to keep them in groups don't chain just use each value:
[v for v in d.values() if len(v) > T] # itervalues for python2
[[{'name': 'abc', 'id': 1}, {'name': 'bc', 'id': 1}, {'name': 'c', 'id': 1}]]
Avoid using list as a variable as it shadows the python list type and if you had a variable list then the code above would cause you a few problems in relation to d = defaultdict(list)
to start out I would make a dictionary to group by your id
control = {}
for d in list:
control.setdefault(d['id'],[]).append(d)
from here all you have to do is check the length of control to see if its greater than your specified threshold
put it in a function like so
def find_by_id(obj, threshold):
control = {}
for d in obj:
control.setdefault(d['id'], []).append(d)
for val in control.values():
if len(val) > threshold:
print val

Categories