dplython: remove duplicates

dplython: remove duplicates - python

Is there a way to remove duplicate rows for two specified columns using dplython?
This is an example of what I want to accomplish:
import pandas as pd
from dplython import *
data = {'store': [1, 1, 2, 2, 4, 4],
'Type': ['A', 'A', 'A', 'B', 'B', 'B'],
'weekly_sales': [100, 200, 300, 400, 500, 200]}
df = pd.DataFrame(data)
df.drop_duplicates(subset=["store", "Type"])
This is my dplython attempt:
df_R = DplyFrame(df)
df_R >> sift(drop_duplicates(subset=[X.store,X.Type]))
Thanks!

Related

Using column values as key in pandas json

I have a pandas dataframe as below. I'm just wondering if there's any way to have my column values as my key to the json.
df:
|symbol | price|
|:------|------|
|a. |120|
|b. |100|
|c |200|
I expect the json to look like {'a': 120, 'b': 100, 'c': 200}
I've tried the below and got the result as {symbol: 'a', price: 120}{symbol: 'b', price: 100}{symbol: 'c', price: 200}
df.to_json('price.json', orient='records', lines=True)

Let's start by creating the dataframe that OP mentions
import pandas as pd
df = pd.DataFrame({'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]})
Considering that OP doesn't want the JSON values as a list (as OP commented here), the following will do the work
df.groupby('symbol').price.apply(lambda x: x.iloc[0]).to_dict()
[Out]: {'a': 120, 'b': 100, 'c': 200}
If one wants the JSON values as a list, the following will do the work
df.groupby('symbol').price.apply(list).to_json()
[Out]: {"a":[120],"b":[100],"c":[200]}

Try like this :
import pandas as pd
d = {'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]}
df = pd.DataFrame(data=d)
print(df)
print (df.set_index('symbol').rename(columns={'price':'json_data'}).to_json())
# EXPORT TO FILE
df.set_index('symbol').rename(columns={'price':'json_data'}).to_json('price.json')
Output :
symbol price
0 a 120
1 b 100
2 c 200
{"json_data":{"a":120,"b":100,"c":200}}

how to remove count from a plotly express bar chart hover data?

Given the following code:
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1')
fig.show()
that generates the following bar plot:
how do I remove count from hover_data?
plotly==5.1.0

You can remove it from hovertemplate
import pandas as pd
import plotly.express as px
d = {'col1': ['a', 'a', 'b', 'b', 'b'], 'col2': [5, 6, 7, 8, 9]}
df = pd.DataFrame(data=d)
fig = px.bar(df, y='col1', color='col1').update_traces(hovertemplate='col1=%{y}<br><extra></extra>')
fig.show()

Change the order of columns using Pandas dataframe and drop a column

I have the following an illustrative example dataframe df
df = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
The real dataframe has much more columns and rows. As I said this is only an illustrative example.
I want to change the order of the columns, so that I get the following result:
df = pd.DataFrame({'name': ['A', 'C', 'B'],
'value': [100, 150, 300]})
How can I do this?
And how can I drop column A after reordering, so that I get the new df:
df = pd.DataFrame({'name': ['C', 'B'],
'value': [150, 300]})

You can do sort_values then slice the df by position with iloc
out = df.sort_values('value').iloc[1:]
Out[190]:
name value
2 C 150
1 B 300

Aggregate by percentile and count for groups in python

I'm a new python user familiar with R.
I want to calculate user-defined quantiles for groups complete with the count of observations in each group.
In R I would do:
df_sum <- df %>% group_by(group) %>%
dplyr::summarise(q85 = quantile(obsval, probs = 0.85, type = 8),
n = n())
In python I can get the grouped percentile by:
df_sum = df.groupby(['group'])['obsval'].quantile(0.85)
How do I add the group count to this?
I have tried:
df_sum = df.groupby(['group'])['obsval'].describe(percentile=[0.85])[[count]]
df_sum = df.groupby(['group'])['obsval'].quantile(0.85).describe(['count'])
Example data:
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df
Expected result:
group percentile count
A 7.4 5
B 6.55 4

You can use pandas.DataFrame.agg() to apply multiple functions.
In this case you should use numpy.quantile().
import pandas as pd
import numpy as np
data = {'group':['A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'A'], 'obsval':[1, 3, 3, 5, 4, 6, 7, 7, 8]}
df = pd.DataFrame(data)
df_sum = df.groupby(['group'])['obsval'].agg([lambda x : np.quantile(x, q=0.85), "count"])
df_sum.columns = ['percentile', 'count']
print(df_sum)

Why does the np.where function also seem to work on values

I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?

The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2

The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

dplython: remove duplicates - python

Related

Using column values as key in pandas json

how to remove count from a plotly express bar chart hover data?

Change the order of columns using Pandas dataframe and drop a column

Aggregate by percentile and count for groups in python

Why does the np.where function also seem to work on values

Categories

Resources