Match entire list with values from another dataframe - python

I have a dataframe with a list in one column and want to match all items in this list with a second dataframe. The matched values should then be added (as a list) to a new column in the first dataframe.
data = {'froots': [['apple','banana'], ['apple','strawberry']]
}
df1 = pd.DataFrame(data)
data = {'froot': ['apple','banana','strawberry'],
'age': [2,3,5]
}
df2 = pd.DataFrame(data)
DF1
index fruits
1 ['apple','banana']
2 ['apple','strawberry']
DF2
index fruit age
1 apple 2
2 banana 3
3 strawberry 5
New DF1
index froots age
1 ['apple','banana'] [2,3]
2 ['apple','strawberry'] [2,5]
I have a simple solution that takes way too long:
age = list()
for index,row in df1.iterrows():
numbers = row.froots
tmp = df2[['froot','age']].apply(lambda x: x['age'] if x['froot'] in numbers else None, axis=1).dropna().tolist()
age.append(tmp)
df1['age'] = age
Is there maybe a faster solution to this problem?
Thanks in Advance!

Use lsit comprehension with dictionary created by df2 and add new values to list if exist in dictionary tested by if:
d = df2.set_index('froot')['age'].to_dict()
df1['ag1e'] = df1['froots'].apply(lambda x: [d[y] for y in x if y in d])
print (df1)
froots ag1e
0 [apple, banana] [2, 3]
1 [apple, strawberry] [2, 5]

Related

Matching value with column to retrieve index value

Please see example dataframe below:
I'm trying match values of columns X with column names and retrieve value from that matched column
so that:
A B C X result
1 2 3 B 2
5 6 7 A 5
8 9 1 C 1
Any ideas?
Here are a couple of methods:
# Apply Method:
df['result'] = df.apply(lambda x: df.loc[x.name, x['X']], axis=1)
# List comprehension Method:
df['result'] = [df.loc[i, x] for i, x in enumerate(df.X)]
# Pure Pandas Method:
df['result'] = (df.melt('X', ignore_index=False)
.loc[lambda x: x['X'].eq(x['variable']), 'value'])
Here I just build a dataframe from your example and call it df
dict = {
'A': (1,5,8),
'B': (2,6,9),
'C': (3,7,1),
'X': ('B','A','C')}
df = pd.DataFrame(dict)
You can extract the value from another column based on 'X' using the following code. There may be a better way to do this without having to convert first to list and retrieving the first element.
list(df.loc[df['X'] == 'B', 'B'])[0]
I'm going to create a column called 'result' and fill it with 'NA' and then replace the value based on your conditions. The loop below, extracts the value and uses .loc to replace it in your dataframe.
df['result'] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(df.loc[df['X'] == val, val])[0]
df.loc[idx, 'result'] = extracted
Here it is as a function:
def search_replace(dataframe, search_col='X', new_col_name='result'):
dataframe[new_col_name] = 'NA'
for idx, val in enumerate(list(vals)):
extracted = list(dataframe.loc[dataframe[search_col] == val, val])[0]
dataframe.loc[idx, new_col_name] = extracted
return df
and the output
>>> search_replace(df)
A B C X result
0 1 2 3 B 2
1 5 6 7 A 5
2 8 9 1 C 1

Match column to another column containing array

I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?
Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C

How to get new pandas dataframe with certain columns and rows depending on list elements?

I have such a list:
l = ['A','B']
And such a dataframe df
Name x y
A 1 2
B 2 1
C 2 2
I now want to get a new dataframe where only the entries for Name and x which are included in l are kept.
new_df should look like this:
Name x
A 1
B 2
I was playing around with isin but did not solve this problem.
Use DataFrame.loc with Series.isin:
new_df = df.loc[df.Name.isin(l), ["Name", "x"]]
This should do it:
# assuming Name is the index
new_df = df[df.index.isin(l)]
# if you only want column x
new_df = df.loc[df.index.isin(l), "x"]
simple as that
l = ['A','B']
def make_empty(row):
print(row)
for idx, value in enumerate(row):
row[idx] = value if value in l else ''
return row
df_new = df[df['Name'].isin(l) | df['x'].isin(l)][['Name','x']]
df_new.apply(lambda row: make_empty(row)
Output:
Name x
0 A
1 B

Question regarding how to generate a new key in a panda dataframe

I have a data frame like this:
df = pd.DataFrame({"item" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
And I want to create a new key (lets call it "range") that holds every value of "beg" and "end" for every item,
ie:
item beg end range
0 a 1 10 1 10
1 b 2 11 2 11
I'm guessing I have to make a list of list, but I can't figure out how.
I know that I shloud loop over the df like this:
for index, row in df.iterrows():
df["range"] = #here goes the function
Thanks in advance.
You can try:
df['range'] = df[['beg','end']].astype(str).agg(' '.join, axis=1)
or:
df['range'] = df.apply(lambda x: f"{x['beg']} {x['end']}", axis=1)
Which gives you:
item beg end range
0 a 1 10 1 10
1 b 2 11 2 11
where range is string type. If you want list type:
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories