I am trying to create new dataframe out of the dictionary which includes lists.
It looks like something like that:
{'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]...
Those are UNIX date and price of stablecoins, however the columns are not named properly as it is all under 'prices' key.
However how could I create a new df which would include 2 columns (Date, Price) using values from this dictionary?
My goal is to get something like this:
| Date | Price |
| 15741216000000 | 1.000650588888066 |
| 15742080000000 | 0.9954110116644869 |
You can use pd.Dataframe directly and then pass a list of column names to the parameter columns
import pandas as pd
a = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]]}
df = pd.DataFrame(a['prices'], columns=['Date', 'Price'])
print(df)
# prints
Date Price
0 1574121600000 1.000651
1 1574208000000 0.995411
d = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]}
df = {"date":[],"prices":[]}
for k,v in d.items():
for item in v:
df["date"].append(item[0])
df["prices"].append(item[1])
dataframe = pd.DataFrame(df)
Related
I have the following data frame:
| Category | idlist |
| -------- | -----------------------------------------------------------------|
| new | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']} |
| old | ['99A', '99B'] |
I want it to be converted into this:
| Category | id | list |
| -------- | --- | -------------------- |
| new | 100 | ['1A', '2B'] |
| new | 200 | ['2A', '3B'] |
| new | 300 | ['3A', '3B', '4B'] |
| old | -1 | ['99A', '99B'] |
The column 'idlist' is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn't have any id we can default it to -1 or 0.
I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as '-1' because I really don't need to flatten the cell corresponding to old. But is there a way to handle all this together?
As requested, I am sharing the original JSON format before I converted it into a df:
{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' : ['99-A', '99-B']}
For your original JSON of this:
jj = '''
{
"new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
"old": ["99A", "99B"]
}
'''
I think it's easier to pre-process the JSON rather than post-process the dataframe:
import json
import pandas as pd
dd = json.loads(jj)
ml = []
for k, v in dd.items():
if isinstance(v, dict):
ml += [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
else:
ml += [{ 'Category' : k, 'id' : -1, 'list' : v }]
df = pd.DataFrame(ml)
Output:
Category id list
0 new 100 [1A, 2B]
1 new 200 [2A, 3B]
2 new 300 [3A, 3B, 4B]
3 old -1 [99A, 99B]
click here
def function1(s:pd.Series):
global df2
list1 = []
var1=eval(s.idlist)
if isinstance(var1,dict):
for i in var1:
list1.append({'Category':s.Category,'id':i,'list':var1[i]})
else:
list1.append({'Category':s.Category,'id':-1,'list':var1})
df2=pd.concat([df2,pd.DataFrame(list1)])
df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
First isolate new and old in separate dataframes.
Create new columns corresponding to each key in the idlist dictionary for df_new.
Use pd.melt to "unpivot" df_new
Modify df_old to match column for column with df_new,
Use pd.concat to rejoin the two dataframes on axis 0 (rows).
See below:
# Recreated before you json file was attached
import pandas as pd
df = pd.DataFrame({'Category': ['new', 'old'],
'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})
# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]
# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())
for k,v in zip(keys, values):
df_new[k] = f"{v}" # F-string to format the values in output as "string lists"
# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)
# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new,
id_vars='Category',
value_vars=list(df_new.columns[1:]),
var_name='id',
value_name='list')
# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1
# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)
Using pd.json_normalize, this can be accomplished without using loops.
The data in the OP is a dict, not a JSON string.
A dict or JSON from an API, that isn't a str, can be loaded directly into pandas.
data = {"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]} - This is a dict
If the data is a str, then use data = json.loads(json_str), where json_str is your data.
json_str = '{"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}' - This is a JSON string because it's surrounded by quotes.
import json
import pandas
# 1. load data into pandas
df = pd.json_normalize(data)
# 2. melt the dataframe from wide to lone form
df = df.melt(var_name='Category', value_name='list')
# 3. split the numbers from the Category column and add -1
df[['Category', 'id']] = df.Category.str.split('.', expand=True).fillna(-1)
df views:
1.
old new.100 new.200 new.300
0 [99-A, 99-B] [1-A, 2-B] [2-A, 3-B] [3-A, 3-B, 4-B]
2.
Category list
0 old [99-A, 99-B]
1 new.100 [1-A, 2-B]
2 new.200 [2-A, 3-B]
3 new.300 [3-A, 3-B, 4-B]
3.
Category list id
0 old [99-A, 99-B] -1
1 new [1-A, 2-B] 100
2 new [2-A, 3-B] 200
3 new [3-A, 3-B, 4-B] 300
def create_df(src,header=None):
df =spark.read.csv(src, header=header)
return df
result = source_df.filter(f.col('Job_name') == job_name).select(source_df['dfname'],source_df['srcpath']).collect()
for x in result:
src=str('"' +x[1] + '"'.strip(' '))
src = str(src)
x[0] = create_df(src, header=True) //throwing an uft-8 encod
result is a list having 2 columns called dfname and source path, need to loop the result list and based on the dfname value need to create pass df name dynamically.
| dfname | SPath |
|------------+--------------|
| Account_Df | s3://path... |
| ProdMet_Df | s3://path... |
Based on the df name need to create dfnames?
expected output
Account_Df and ProdMet_Df two sepate dfs.
If you are absolutely sure you need to do this, you can update the globals() dictionary to create a variable in the global (module) namespace. Your last line of code should then be:
globals()[x[0]] = create_df(src, header=True)
I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
| 0|[6469640, 6531963]|
| 1|[6469640, 6531963]|
+---------+------------------+
I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:
+---------+------------------+--------
|ceil_temp| test2|final |
+---------+------------------+--------
| -1|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6469640, 6531963]|6469640|
| 0|[6469640, 6531963]|6469640|
| 1|[6469640, 6531963]|6531963|
+---------+------------------+--------
To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:
m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()
l=[]
for i in range(len(num)):
if m[i]<0:
m[i]=0
else:
pass
l.append(q[i][m[i]])
Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:
w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).
However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:
m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]
Expected result:
m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]
Please advise on how to achieve the "final" column.
I think you could achieve your desired outcome using UDF on the rows of your dataframe.
You could then withColumn with the result of your udf.
val df = spark.sparkContext.parallelize(List(
(-1, List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6469640, 6531963)),
(0,List(6469640, 6531963)),
(1,List(6469640, 6531963)))).toDF("ceil_temp", "test2")
import org.apache.spark.sql.functions.udf
val selectRightElement = udf {
(ceilTemp: Int, test2: Seq[Int]) => {
// dummy code for the example
if (ceilTemp <= 0) test2(0) else test2(1)
}
}
df.withColumn("final", selectRightElement(df("ceil_temp"), df("test2"))).show
Doing like that will prevent shuffling of your row order.
I solved the above issue by:
df=df.withColumn("final",(df.test2).getItem(df.ceil_temp))
I have created a dictionary with the piece of code:
dat[r["author_name"]] = (r["num_deletions"], r["num_insertions"],
r["num_lines_changed"], r["num_files_changed"], r["author_date"])
I want to then take these dictionary and create a panda with columns
author_name | num_deletions | num_insertions | num_lines_changed |num_files changed | author_date
I tried this:
df = pd.DataFrame(list(dat.iteritems()),
columns=['author_name',"num_deletions", "num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
But it does not work since it is reading the key and the tuple of the dictionary as only two columns instead of six. So how can I take each of the five entries in the tuple and divide them into their own columns
You need the key and value at the same nesting level:
df = pd.DataFrame([(key,)+val for key, val in dat.items()],
columns=["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
You could also use
df = pd.DataFrame.from_dict(dat, orient='index').reset_index()
df.columns = ["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"]
Which seems to be a bit faster if you have roughly 10,000 rows or more.
This should work.
import pandas as pd
df = pd.DataFrame(columns=['author_name', 'num_deletions', 'num_insertions', 'num_lines_changed',
'num_files_changed','author_date'])
I am a first data frame looking like this
item_id | options
------------------------------------------
item_1_id | [option_1_id, option_2_id]
And a second like this:
option_id | option_name
---------------------------
option_1_id | option_1_name
And I'd like to transform my first data set to:
item_id | options
----------------------------------------------
item_1_id | [option_1_name, option_2_name]
What is an elegant way to do so using Pandas' data frames?
You can use apply.
For the record, storing lists in DataFrames is typically unnecessary and not very "pandonic". Also, if you only have one column, you can do this with a Series (though this solution also works for DataFrames).
Setup
Build the Series with the lists of options.
index = list('abcde')
s = pd.Series([['opt1'], ['opt1', 'opt2'], ['opt0'], ['opt1', 'opt4'], ['opt3']], index=index)
Build the Series with the names.
index_opts = ['opt%s' % i for i in range(5)]
vals_opts = ['name%s' % i for i in range(5)]
s_opts = pd.Series(vals_opts, index=index_opts)
Solution
Map options to names using apply. The lambda function looks up each option in the Series mapping options to names. It is applied to each element of the Series.
s.apply(lambda l: [s_opts[opt] for opt in l])
outputs
a [name1]
b [name1, name2]
c [name0]
d [name1, name4]
e [name3]