I have a pandas dataframe as below. I'm just wondering if there's any way to have my column values as my key to the json.
df:
|symbol | price|
|:------|------|
|a. |120|
|b. |100|
|c |200|
I expect the json to look like {'a': 120, 'b': 100, 'c': 200}
I've tried the below and got the result as {symbol: 'a', price: 120}{symbol: 'b', price: 100}{symbol: 'c', price: 200}
df.to_json('price.json', orient='records', lines=True)
Let's start by creating the dataframe that OP mentions
import pandas as pd
df = pd.DataFrame({'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]})
Considering that OP doesn't want the JSON values as a list (as OP commented here), the following will do the work
df.groupby('symbol').price.apply(lambda x: x.iloc[0]).to_dict()
[Out]: {'a': 120, 'b': 100, 'c': 200}
If one wants the JSON values as a list, the following will do the work
df.groupby('symbol').price.apply(list).to_json()
[Out]: {"a":[120],"b":[100],"c":[200]}
Try like this :
import pandas as pd
d = {'symbol': ['a', 'b', 'c'], 'price': [120, 100, 200]}
df = pd.DataFrame(data=d)
print(df)
print (df.set_index('symbol').rename(columns={'price':'json_data'}).to_json())
# EXPORT TO FILE
df.set_index('symbol').rename(columns={'price':'json_data'}).to_json('price.json')
Output :
symbol price
0 a 120
1 b 100
2 c 200
{"json_data":{"a":120,"b":100,"c":200}}
Related
Is there a way to remove duplicate rows for two specified columns using dplython?
This is an example of what I want to accomplish:
import pandas as pd
from dplython import *
data = {'store': [1, 1, 2, 2, 4, 4],
'Type': ['A', 'A', 'A', 'B', 'B', 'B'],
'weekly_sales': [100, 200, 300, 400, 500, 200]}
df = pd.DataFrame(data)
df.drop_duplicates(subset=["store", "Type"])
This is my dplython attempt:
df_R = DplyFrame(df)
df_R >> sift(drop_duplicates(subset=[X.store,X.Type]))
Thanks!
I have this dataframe:
import pandas as pd
data = [{'a':2,'b': 2, 'c':3},{'b': 2, 'c':np.nan}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': np.nan, 'c': np.nan}]
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])
What I am trying to do is to fill the missing data of every user.
My goal dataframe would be:
data = [{'a':2,'b': 2, 'c':3},{'a':2, 'b': 2, 'c':3}, {'a': 10, 'b': 20, 'c': 30}, {'a': 10, 'b': 20, 'c': 30}]
df = pd.DataFrame(data, index =['John', 'John', 'Mike' ,'Mike'])
Now this should be applied for thousands of rows, but I believe this minimalistic example should be fine to accomplish that in a big dataframe.
I do not want to use pd.merge since this would add thousands of columns to my dataframe since my original dataframes have that amount of columns
You can use groupby().transform('first') to extract the first valid values for each user, then fillna:
df = df.fillna(df.groupby(level=0).transform('first'))
Note: You can
replace 'first' with other functions, e.g. 'mean' if you like.
apply the function directly instead of transform: groupby().first(), since you are grouping based on index.
I have the following an illustrative example dataframe df
df = pd.DataFrame({'name': ['A', 'B', 'C'],
'value': [100, 300, 150]})
The real dataframe has much more columns and rows. As I said this is only an illustrative example.
I want to change the order of the columns, so that I get the following result:
df = pd.DataFrame({'name': ['A', 'C', 'B'],
'value': [100, 150, 300]})
How can I do this?
And how can I drop column A after reordering, so that I get the new df:
df = pd.DataFrame({'name': ['C', 'B'],
'value': [150, 300]})
You can do sort_values then slice the df by position with iloc
out = df.sort_values('value').iloc[1:]
Out[190]:
name value
2 C 150
1 B 300
I'm trying to change the values of only certain values in a dataframe:
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a':2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr[x]), test.col2)
However, this doesn't seem to work because even though I'm looking only at the values in col1 that are 'a', the error says
KeyError: 'b'
Implying that it also looks at the values of col1 with values 'b'. Why is this? And how do I fix it?
The error is originating from the test.col1.map(lambda x: dict_curr[x]) part. You look up the values from col1 in dict_curr, which only has an entry for 'a', not for 'b'.
You can also just index the dataframe:
test.loc[test.col1 == 'a', 'col2'] = 2
The problem is that when you call np.where all of its parameters are evaluated first, and then the result is decided depending on the condition. So the dictionary is queried also for 'b' and 'c', even if those values will be discarded later. Probably the easiest fix is:
import pandas as pd
import numpy as np
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = np.where(test.col1 == 'a', test.col1.map(lambda x: dict_curr.get(x, 0)), test.col2)
This will give the value 0 for keys not in the dictionary, but since it will be discarded later it does not matter which value you use.
Another easy way of getting the same result is:
import pandas as pd
test = pd.DataFrame({'col1': ['a', 'a', 'b', 'c'], 'col2': [1, 2, 3, 4]})
dict_curr = {'a': 2}
test['col2'] = test.apply(lambda x: dict_curr.get(x.col1, x.col2), axis=1)
The gist of this post is that I have "23" in my original data, and I want "23" in my resulting dict (not "23.0"). Here's how I've tried to handle it with Pandas.
My Excel worksheet has a coded Region column:
23
11
27
(blank)
25
Initially, I created a dataframe and Pandas set the dtype of Region to float64*
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0)
df
23.0
11.0
27.0
NaN
25.0
Pandas will convert the dtype to object if I use fillna() to replace NaN's with blanks which seems to eliminate the decimals.
df.fillna('', inplace=True)
df
23
11
27
(blank)
25
Except I still get decimals when I convert the dataframe to a dict:
data = df.to_dict('records')
data
[{'region': 23.0,},
{'region': 27.0,},
{'region': 11.0,},
{'region': '',},
{'region': 25.0,}]
Is there a way I can create the dict without the decimal places? By the way, I'm writing a generic utility, so I won't always know the column names and/or value types, which means I'm looking for a generic solution (vs. explicitly handling Region).
Any help is much appreciated, thanks!
The problem is that after fillna('') your underlying values are still float despite the column being of type object
s = pd.Series([23., 11., 27., np.nan, 25.])
s.fillna('').iloc[0]
23.0
Instead, apply a formatter, then replace
s.apply('{:0.0f}'.format).replace('nan', '').to_dict()
{0: '23', 1: '11', 2: '27', 3: '', 4: '25'}
Using a custom function, takes care of integers and keeps strings as strings:
import pprint
def func(x):
try:
return int(x)
except ValueError:
return x
df = pd.DataFrame({'region': [1, 2, 3, float('nan')],
'col2': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'col2': 'a', 'region': 1},
{'col2': 'b', 'region': 2},
{'col2': 'c', 'region': 3},
{'col2': '', 'region': ''}]
A variation that also keeps floats as floats:
import pprint
def func(x):
try:
if int(x) == x:
return int(x)
else:
return x
except ValueError:
return x
df = pd.DataFrame({'region1': [1, 2, 3, float('nan')],
'region2': [1.5, 2.7, 3, float('nan')],
'region3': ['a', 'b', 'c', float('nan')]})
df.fillna('', inplace=True)
pprint.pprint(df.applymap(func).to_dict('records'))
Output:
[{'region1': 1, 'region2': 1.5, 'region3': 'a'},
{'region1': 2, 'region2': 2.7, 'region3': 'b'},
{'region1': 3, 'region2': 3, 'region3': 'c'},
{'region1': '', 'region2': '', 'region3': ''}]
You could add: dtype=str
import pandas as pd
filepath = 'data_file.xlsx'
df = pd.read_excel(filepath, sheetname=0, header=0, dtype=str)