How to create df dynamically while looping list in python? - python

def create_df(src,header=None):
df =spark.read.csv(src, header=header)
return df
result = source_df.filter(f.col('Job_name') == job_name).select(source_df['dfname'],source_df['srcpath']).collect()
for x in result:
src=str('"' +x[1] + '"'.strip(' '))
src = str(src)
x[0] = create_df(src, header=True) //throwing an uft-8 encod
result is a list having 2 columns called dfname and source path, need to loop the result list and based on the dfname value need to create pass df name dynamically.
| dfname | SPath |
|------------+--------------|
| Account_Df | s3://path... |
| ProdMet_Df | s3://path... |
Based on the df name need to create dfnames?
expected output
Account_Df and ProdMet_Df two sepate dfs.

If you are absolutely sure you need to do this, you can update the globals() dictionary to create a variable in the global (module) namespace. Your last line of code should then be:
globals()[x[0]] = create_df(src, header=True)

Related

How to add dummy record to the existing column in pyspark?

I have a data frame where I want to add one dummy record each. So, to do that, I read a dataframe from a parquet file, and created a list out of them., and then used python dict(zip()) to add them. Below is the code snippet.
prem_df = read_parquet_file(folder_path, logger)
row_list = prem_df.select(col("cat")).collect()
y = [o[0] for o in row_list]
t = y.append("ABC")
row_list1 = prem_df.select(col("Val")).collect()
x = [o[0] for o in row_list1]
p = x.append("23.54")
dict(zip(t, p))
But not sure how would I create a dataframe out of it again, as I need to merge it back to the DF prem_df.
Basically, I want to add ABC at the end of the "cat" column, and "23.54" at the end of the "Val" column in such a way that if I filter on "cat" == "ABC, I should get the "Val" as 23.54.
df.filter("cat" == "ABC).select(col("cat", "val")
Note: the parquet file has 43 columns in total.
Please suggest.. Thank you
You can simply concat.
df = spark.createDataFrame([['cat1', 1]], ['cat', 'Val'])
df.show(truncate=False)
df.withColumn('cat', f.concat(f.col('cat'), f.lit('ABC'))) \
.withColumn('Val', f.concat(f.col('Val'), f.lit(23.54))) \
.show(truncate=False)
+----+---+
|cat |Val|
+----+---+
|cat1|1 |
+----+---+
+-------+------+
|cat |Val |
+-------+------+
|cat1ABC|123.54|
+-------+------+

Lists in dictionary using Pandas

I am trying to create new dataframe out of the dictionary which includes lists.
It looks like something like that:
{'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]...
Those are UNIX date and price of stablecoins, however the columns are not named properly as it is all under 'prices' key.
However how could I create a new df which would include 2 columns (Date, Price) using values from this dictionary?
My goal is to get something like this:
| Date | Price |
| 15741216000000 | 1.000650588888066 |
| 15742080000000 | 0.9954110116644869 |
You can use pd.Dataframe directly and then pass a list of column names to the parameter columns
import pandas as pd
a = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]]}
df = pd.DataFrame(a['prices'], columns=['Date', 'Price'])
print(df)
# prints
Date Price
0 1574121600000 1.000651
1 1574208000000 0.995411
d = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]}
df = {"date":[],"prices":[]}
for k,v in d.items():
for item in v:
df["date"].append(item[0])
df["prices"].append(item[1])
dataframe = pd.DataFrame(df)

Flatten JSON with columns having values as a list in dataframe

I have the following data frame:
| Category | idlist |
| -------- | -----------------------------------------------------------------|
| new | {100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']} |
| old |  ['99A', '99B'] |
I want it to be converted into this:
| Category | id | list |
| -------- | --- | -------------------- |
| new | 100 | ['1A', '2B'] |
| new | 200 | ['2A', '3B'] |
| new | 300 | ['3A', '3B', '4B'] |
| old | -1 | ['99A', '99B'] |
The column 'idlist' is a json for category new with id as key and value a list of items. Then there is category that is old and it just has a list of items and since it doesn't have any id we can default it to -1 or 0.
I was trying some examples on internet but I was not able to flatten the json for category new because the key is a number. Can this be achieved? I can handle the old separately as a subset by putting id as '-1' because I really don't need to flatten the cell corresponding to old. But is there a way to handle all this together?
As requested, I am sharing the original JSON format before I converted it into a df:
{'new' : {100: ['1-A', '2-B'], 200: ['2-A', '3-B'], 300: ['3-A', '3-B', '4-B']}, 'old' : ['99-A', '99-B']}
For your original JSON of this:
jj = '''
{
"new": {"100": ["1A", "2B"], "200": ["2A", "3B"], "300": ["3A", "3B", "4B"]},
"old": ["99A", "99B"]
}
'''
I think it's easier to pre-process the JSON rather than post-process the dataframe:
import json
import pandas as pd
dd = json.loads(jj)
ml = []
for k, v in dd.items():
if isinstance(v, dict):
ml += [{ 'Category' : k, 'id' : k2, 'list' : v2 } for k2, v2 in v.items()]
else:
ml += [{ 'Category' : k, 'id' : -1, 'list' : v }]
df = pd.DataFrame(ml)
Output:
Category id list
0 new 100 [1A, 2B]
1 new 200 [2A, 3B]
2 new 300 [3A, 3B, 4B]
3 old -1 [99A, 99B]
click here
def function1(s:pd.Series):
global df2
list1 = []
var1=eval(s.idlist)
if isinstance(var1,dict):
for i in var1:
list1.append({'Category':s.Category,'id':i,'list':var1[i]})
else:
list1.append({'Category':s.Category,'id':-1,'list':var1})
df2=pd.concat([df2,pd.DataFrame(list1)])
df2=pd.DataFrame()
df1.apply(function1,axis=1)
df2.set_index('Category')
First isolate new and old in separate dataframes.
Create new columns corresponding to each key in the idlist dictionary for df_new.
Use pd.melt to "unpivot" df_new
Modify df_old to match column for column with df_new,
Use pd.concat to rejoin the two dataframes on axis 0 (rows).
See below:
# Recreated before you json file was attached
import pandas as pd
df = pd.DataFrame({'Category': ['new', 'old'],
'idlist': [{100: ['1A', '2B'], 200: ['2A', '3B'], 300: ['3A', '3B', '4B']}, ['99A', '99B']]})
# isolate new and old in separate dataframes
df_new = df[df['Category'] == "new"]
df_old = df[df['Category'] == "old"]
# Create new columns corresponding to each key in the idlist dictionary for df_new
my_dict = df_new['idlist'][0]
keys = list(my_dict.keys())
values = list(my_dict.values())
for k,v in zip(keys, values):
df_new[k] = f"{v}" # F-string to format the values in output as "string lists"
# Drop idlist because you don't need it anymore
df_new = df_new.drop('idlist', 1)
# Use pd.melt to "unpivot" df_new
df_new = pd.melt(df_new,
id_vars='Category',
value_vars=list(df_new.columns[1:]),
var_name='id',
value_name='list')
# Modify df_old to match column for column with df_new
df_old['list'] = df_old['idlist']
df_old = df_old.drop('idlist', axis = 1)
df_old['id'] = -1
# Use pd.concat to rejoin the two dataframes on axis 0 (rows)
clean_df = pd.concat([df_new, df_old], axis = 0).reset_index(drop = True)
Using pd.json_normalize, this can be accomplished without using loops.
The data in the OP is a dict, not a JSON string.
A dict or JSON from an API, that isn't a str, can be loaded directly into pandas.
data = {"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]} - This is a dict
If the data is a str, then use data = json.loads(json_str), where json_str is your data.
json_str = '{"new": {"100": ["1-A", "2-B"], "200": ["2-A", "3-B"], "300": ["3-A", "3-B", "4-B"]}, "old": ["99-A", "99-B"]}' - This is a JSON string because it's surrounded by quotes.
import json
import pandas
# 1. load data into pandas
df = pd.json_normalize(data)
# 2. melt the dataframe from wide to lone form
df = df.melt(var_name='Category', value_name='list')
# 3. split the numbers from the Category column and add -1
df[['Category', 'id']] = df.Category.str.split('.', expand=True).fillna(-1)
df views:
1.
old new.100 new.200 new.300
0 [99-A, 99-B] [1-A, 2-B] [2-A, 3-B] [3-A, 3-B, 4-B]
2.
Category list
0 old [99-A, 99-B]
1 new.100 [1-A, 2-B]
2 new.200 [2-A, 3-B]
3 new.300 [3-A, 3-B, 4-B]
3.
Category list id
0 old [99-A, 99-B] -1
1 new [1-A, 2-B] 100
2 new [2-A, 3-B] 200
3 new [3-A, 3-B, 4-B] 300

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
| 0|[6469640, 6531963]|
| 1|[6469640, 6531963]|
+---------+------------------+
I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:
+---------+------------------+--------
|ceil_temp| test2|final |
+---------+------------------+--------
| -1|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6469640, 6531963]|6469640|
| 0|[6469640, 6531963]|6469640|
| 1|[6469640, 6531963]|6531963|
+---------+------------------+--------
To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:
m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()
l=[]
for i in range(len(num)):
if m[i]<0:
m[i]=0
else:
pass
l.append(q[i][m[i]])
Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:
w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).
However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:
m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]
Expected result:
m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]
Please advise on how to achieve the "final" column.
I think you could achieve your desired outcome using UDF on the rows of your dataframe.
You could then withColumn with the result of your udf.
val df = spark.sparkContext.parallelize(List(
(-1, List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6469640, 6531963)),
(0,List(6469640, 6531963)),
(1,List(6469640, 6531963)))).toDF("ceil_temp", "test2")
import org.apache.spark.sql.functions.udf
val selectRightElement = udf {
(ceilTemp: Int, test2: Seq[Int]) => {
// dummy code for the example
if (ceilTemp <= 0) test2(0) else test2(1)
}
}
df.withColumn("final", selectRightElement(df("ceil_temp"), df("test2"))).show
Doing like that will prevent shuffling of your row order.
I solved the above issue by:
df=df.withColumn("final",(df.test2).getItem(df.ceil_temp))

Pandas - join item from different dataframe within an array

I am a first data frame looking like this
item_id | options
------------------------------------------
item_1_id | [option_1_id, option_2_id]
And a second like this:
option_id | option_name
---------------------------
option_1_id | option_1_name
And I'd like to transform my first data set to:
item_id | options
----------------------------------------------
item_1_id | [option_1_name, option_2_name]
What is an elegant way to do so using Pandas' data frames?
You can use apply.
For the record, storing lists in DataFrames is typically unnecessary and not very "pandonic". Also, if you only have one column, you can do this with a Series (though this solution also works for DataFrames).
Setup
Build the Series with the lists of options.
index = list('abcde')
s = pd.Series([['opt1'], ['opt1', 'opt2'], ['opt0'], ['opt1', 'opt4'], ['opt3']], index=index)
Build the Series with the names.
index_opts = ['opt%s' % i for i in range(5)]
vals_opts = ['name%s' % i for i in range(5)]
s_opts = pd.Series(vals_opts, index=index_opts)
Solution
Map options to names using apply. The lambda function looks up each option in the Series mapping options to names. It is applied to each element of the Series.
s.apply(lambda l: [s_opts[opt] for opt in l])
outputs
a [name1]
b [name1, name2]
c [name0]
d [name1, name4]
e [name3]

Categories