Pandas - join item from different dataframe within an array - python

I am a first data frame looking like this
item_id | options
------------------------------------------
item_1_id | [option_1_id, option_2_id]
And a second like this:
option_id | option_name
---------------------------
option_1_id | option_1_name
And I'd like to transform my first data set to:
item_id | options
----------------------------------------------
item_1_id | [option_1_name, option_2_name]
What is an elegant way to do so using Pandas' data frames?

You can use apply.
For the record, storing lists in DataFrames is typically unnecessary and not very "pandonic". Also, if you only have one column, you can do this with a Series (though this solution also works for DataFrames).
Setup
Build the Series with the lists of options.
index = list('abcde')
s = pd.Series([['opt1'], ['opt1', 'opt2'], ['opt0'], ['opt1', 'opt4'], ['opt3']], index=index)
Build the Series with the names.
index_opts = ['opt%s' % i for i in range(5)]
vals_opts = ['name%s' % i for i in range(5)]
s_opts = pd.Series(vals_opts, index=index_opts)
Solution
Map options to names using apply. The lambda function looks up each option in the Series mapping options to names. It is applied to each element of the Series.
s.apply(lambda l: [s_opts[opt] for opt in l])
outputs
a [name1]
b [name1, name2]
c [name0]
d [name1, name4]
e [name3]

Related

Lists in dictionary using Pandas

I am trying to create new dataframe out of the dictionary which includes lists.
It looks like something like that:
{'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]...
Those are UNIX date and price of stablecoins, however the columns are not named properly as it is all under 'prices' key.
However how could I create a new df which would include 2 columns (Date, Price) using values from this dictionary?
My goal is to get something like this:
| Date | Price |
| 15741216000000 | 1.000650588888066 |
| 15742080000000 | 0.9954110116644869 |
You can use pd.Dataframe directly and then pass a list of column names to the parameter columns
import pandas as pd
a = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]]}
df = pd.DataFrame(a['prices'], columns=['Date', 'Price'])
print(df)
# prints
Date Price
0 1574121600000 1.000651
1 1574208000000 0.995411
d = {'prices': [[1574121600000, 1.000650588888066], [1574208000000, 0.9954110116644869]}
df = {"date":[],"prices":[]}
for k,v in d.items():
for item in v:
df["date"].append(item[0])
df["prices"].append(item[1])
dataframe = pd.DataFrame(df)

Check for existence of data from two Dataframe's columns in List

I'm searching for difference between columns in DataFrame and a data in List.
I'm doing it this way:
# pickled_data => list of dics
pickled_names = [d['company'] for d in pickled_data] # get values from dictionary to list
diff = df[~df['company_name'].isin(pickled_names)]
which works fine, but I realized that I need to check not only for company_name but also for place, because there could be two companies with the same name.
df contains also column place as well as pickled_data contains place key in the dictionary.
I would like to be able to do something like this
pickled_data = [(d['company'], d['place']) for d in pickled_data]
diff = df[~df['company_name', 'place'].isin(pickled_data)] # For two values in same row
You can convert values to MultiIndex by MultiIndex.from_tuples, then convert both columns too and compare:
pickled_data = [(d['company'], d['place']) for d in pickled_data]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
Sample:
data = {'company_name':['A1','A2','A2','A1','A1','A3'],
'place':list('sdasas')}
df = pd.DataFrame(data)
pickled_data = [('A1','s'),('A2','d')]
mux = pd.MultiIndex.from_tuples(pickled_data)
diff = df[~df.set_index(['company_name', 'place']).index.isin(mux)]
print (diff)
company_name place
2 A2 a
4 A1 a
5 A3 s
You can form a set of tuples from your pickled_data for faster lookup later, then using a list comprehension over company_name and place columns of the frame, we get a boolean list of whether they are in the frame or not. Then we use this to index into the frame:
comps_and_places = set((d["company"], d["place"]) for d in pickled_data)
not_in_list = [(c, p) not in comps_and_places
for c, p in zip(df.company_name, df.place)]
diff = df[not_in_list]

flatMap doesn't preserve order when creating lists from pyspark dataframe columns

I have a PySpark dataframe df:
+---------+------------------+
|ceil_temp| test2|
+---------+------------------+
| -1|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6397024, 6425417]|
| 0|[6469640, 6531963]|
| 0|[6469640, 6531963]|
| 1|[6469640, 6531963]|
+---------+------------------+
I eventually want to add a new column(final) to this dataframe whose values are elements of list in test2 column based on the index of ceil_temp column. For example: if ceil_temp column has <0 or 0 value in it, final column has the element in the 0th index of test2 column.Something like this:
+---------+------------------+--------
|ceil_temp| test2|final |
+---------+------------------+--------
| -1|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6397024, 6425417]|6397024|
| 0|[6469640, 6531963]|6469640|
| 0|[6469640, 6531963]|6469640|
| 1|[6469640, 6531963]|6531963|
+---------+------------------+--------
To achieve this, I tried to extract ceil_temp and test2 as lists using flatMap:
m =df.select("ceil_temp").rdd.flatMap(lambda x: x).collect()
q= df.select("test2").rdd.flatMap(lambda x: x).collect()
l=[]
for i in range(len(num)):
if m[i]<0:
m[i]=0
else:
pass
l.append(q[i][m[i]])
Then converting this list l to a new df and joining it with original dataframe based on row index column that i add based on window function:
w = Window().orderBy()
df=df.withColumn("columnindex", rowNumber().over(w)).
However, the order of the lists extracted by flatMap doesn't seem to remain the same as that of parent dataframe df. I get the following:
m=[-1,0,0,0,0,1]
q=[[6469640, 6531963],[6469640, 6531963],[6469640, 6531963],[6397024, 6425417],[6397024, 6425417],[6397024, 6425417]]
Expected result:
m=[-1,0,0,0,0,1]
q=[[6397024, 6425417],[6397024, 6425417],[6397024, 6425417],[6469640, 6531963],[6469640, 6531963],[6469640, 6531963]]
Please advise on how to achieve the "final" column.
I think you could achieve your desired outcome using UDF on the rows of your dataframe.
You could then withColumn with the result of your udf.
val df = spark.sparkContext.parallelize(List(
(-1, List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6397024, 6425417)),
(0,List(6469640, 6531963)),
(0,List(6469640, 6531963)),
(1,List(6469640, 6531963)))).toDF("ceil_temp", "test2")
import org.apache.spark.sql.functions.udf
val selectRightElement = udf {
(ceilTemp: Int, test2: Seq[Int]) => {
// dummy code for the example
if (ceilTemp <= 0) test2(0) else test2(1)
}
}
df.withColumn("final", selectRightElement(df("ceil_temp"), df("test2"))).show
Doing like that will prevent shuffling of your row order.
I solved the above issue by:
df=df.withColumn("final",(df.test2).getItem(df.ceil_temp))

Splitting a Dictionary of tuples into a pandas dataframe

I have created a dictionary with the piece of code:
dat[r["author_name"]] = (r["num_deletions"], r["num_insertions"],
r["num_lines_changed"], r["num_files_changed"], r["author_date"])
I want to then take these dictionary and create a panda with columns
author_name | num_deletions | num_insertions | num_lines_changed |num_files changed | author_date
I tried this:
df = pd.DataFrame(list(dat.iteritems()),
columns=['author_name',"num_deletions", "num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
But it does not work since it is reading the key and the tuple of the dictionary as only two columns instead of six. So how can I take each of the five entries in the tuple and divide them into their own columns
You need the key and value at the same nesting level:
df = pd.DataFrame([(key,)+val for key, val in dat.items()],
columns=["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"])
You could also use
df = pd.DataFrame.from_dict(dat, orient='index').reset_index()
df.columns = ["author_name", "num_deletions",
"num_insertions", "num_lines_changed",
"num_files_changed", "author_date"]
Which seems to be a bit faster if you have roughly 10,000 rows or more.
This should work.
import pandas as pd
df = pd.DataFrame(columns=['author_name', 'num_deletions', 'num_insertions', 'num_lines_changed',
'num_files_changed','author_date'])

Create and instantiate python 2d dictionary

I have two python dictionaries:
ccyAr = {'AUDCAD','AUDCHF','AUDJPY','AUDNZD','AUDUSD','CADCHF','CADJPY','CHFJPY','EURAUD','EURCAD','EURCHF','EURGBP','EURJPY','EURNZD','EURUSD','GBPAUD','GBPCAD','GBPCHF','GBPJPY','GBPNZD','GBPUSD','NZDCAD','NZDCHF','NZDJPY','NZDUSD','USDCAD','USDCHF','USDJPY'}
data = {'BTrades', 'BPips', 'BProfit', 'STrades', 'SPips', 'SProfit', 'Trades', 'Pips', 'Profit', 'Won', 'WonPC', 'Lost', 'LostPC'}
I've been trying to get my head round how to most elegantly create a construct in which each of 'data' exists in each of 'ccyAr'. The following are the two I feel are the closest, but the first results (now I realise) in arrays and the latter i more like pseudocode:
1.
table={ { data:[] for d in data } for ccy in ccyAr }
2.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
I also want to set each of the entries to int 0 and I'd like to do it in one go. I'm struggling with the comprehension method as I end up creating each value of each inside directory member as a list instead of a value 0.
I've seen the autovivification piece but I don't want to mimic perl, I want to do it the pythonic way. Any help = cheers.
for ccy in ccyAr:
for d in data:
table['ccy']['d'] = 0
Is close.
table = {}
for ccy in ccyAr:
table[ccy] = {}
for d in data:
table[ccy][d] = 0
Also, ccyAr and data in your question are sets, not dictionaries.
What you are searching for is a pandas DataFrame of shape data x ccyAr. I give a minimal example here:
import pandas as pd
data = {'1', '2'}
ccyAr = {'a','b','c'}
df = pd.DataFrame(np.zeros((len(data), len(ccyAr))))
Then the most important step is to set both the columns and the index. If your two so-called dictionaries are in fact sets (as it seems in your code), use:
df.columns = ccyAr
df.index = data
If they are indeed dictionaries, you instead have to call their keys method:
df.columns = ccyAr.keys()
df.index = data.keys()
You can print df to see that this is actually what you wanted:
| a | c | b
-------------
1 | 0 0 0
2 | 0 0 0
And now if you try to access via df['a'][1] it returns you 0. It is the best solution to your problem.
How to do this using a dictionary comprehension:
table = {ccy:{d:0 for d in data} for ccy in ccyAr}

Categories