Match column to another column containing array

Match column to another column containing array - python

I have very junior question in python - i have a dataframe with a column containing some IDs and separate dataframe that contains 2 columns, out of which 1 is an array:
df1 = pd.DataFrame({"some_id": [1, 2, 3, 4, 5]})
df2 = pd.DataFrame([["A", [1, 2]], ["B", [3, 4]], ["C", [5]]], columns=['letter', 'some_ids'])
I want to add do df1 new column "letter' that for a given "some_id" will look up df2, check if this id is in df2['some_ids'] and return df2['letter']
I tried this:
df1['letter'] = df2[df1[some_id].isin(df2['some_ids')].letter
and get NaNs - any suggestion where I make mistake?

Create dictionary with flatten nested lists in dict comprehension and then use Series.map:
d = {x: a for a,b in zip(df2['letter'], df2['some_ids']) for x in b}
df1['letter'] = df1['some_id'].map(d)
Or mapping by Series created by DataFrame.explode with DataFrame.set_index:
df1['letter'] = df1['some_id'].map(df2.explode('some_ids').set_index('some_ids')['letter'])
Or use left join with rename column:
df1 = df1.merge(df2.explode('some_ids').rename(columns={'some_ids':'some_id'}), how='left')
print (df1)
some_id letter
0 1 A
1 2 A
2 3 B
3 4 B
4 5 C

Related

How to map values to a DataFrame with multiple columns as keys?

I have two dataframes like so:
data = {'A': [3, 2, 1, 0], 'B': [1, 2, 3, 4]}
data2 = {'A': [3, 2, 1, 0, 3, 2], 'B': [1, 2, 3, 4, 20, 2], 'C':[5,3,2,1, 5, 1]}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data2)
Now I did a groupby of df2 for C
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
Now I would like to map df1['new C'] where the columns A and B match.
A B new_C
0 3 1 1.0
1 2 2 2.0
2 1 3 2.0
3 0 4 12.5
where new c is basically the averages of C for every pair A, B from df2
Note that A and B don't have to be keys of the dataframe (i.e. they aren't unique identifiers which is why I want to map it with a dictionary originally, but failed with multiple keys)
How would I go about that?
Thank you for looking into it with me!

I found a solution to this
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = df1.apply(lambda x: values_to_map[x['A'], x['B']], axis=1)
Thanks for looking into it!

Just do np.vectorize:
values_to_map = df2.groupby(['A', 'B']).mean().to_dict()
df1['new_c'] = np.vectorize(lambda x: values_to_map.get(x['A'], x['B']))(df1[['A', 'B']])

You can first form a MultiIndex from the [["A", "B"]] subset of the frame df1 and use its map function to map the A-B pairs to the desired grouped mean values:
cols = ["A", "B"]
mapper = df2.groupby(cols).C.mean()
df1["new_c"] = pd.MultiIndex.from_frame(df1[cols]).map(mapper)
to get
>>> df1
A B new_c
0 3 1 5.0
1 2 2 2.0
2 1 3 2.0
3 0 4 1.0
(if an A-B pair in df1 isn't found in df2's groups, new_c corresponding to that pair will be NaN with this method.)
Note that neither pandas' apply nor np.vectorize are "vectorized" routines. However, they might be fast enough for one's purposes and might prove more readable in places.

Change column names of Pandas dataframes contained in a list

I have a list of Pandas dataframes:
df_list = [df1, df2, df3]
The dataframes have the same column names; let's call them "col1", "col2" and "col3".
How can I change the column names to "colnew1", "colnew2" and "colnew3", without using a loop?

df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
This is taken right from the pandas website.

how do you filter a Pandas dataframe by a multi-column set?

Is there a way to filter a large dataframe by comparing multiple columns against a set of tuples where each element in the tuple corresponds to a different column value?
For example, is there a .isin() method that compares multiple columns of the DataFrame against a set of tuples?
Example:
df = pd.DataFrame({
'a': [1, 1, 1],
'b': [2, 2, 0],
'c': [3, 3, 3],
'd': ['not', 'relevant', 'column'],
})
# Filter the DataFrame by checking if the values in columns [a, b, c] match any tuple in value_set
value_set = set([(1,2,3), (1, 1, 1)])
new_df = ?? # should contain just the first two rows of df

You can use Series.isin, but first is necessary create tuples by first 3 columns:
print (df[df[['a','b','c']].apply(tuple, axis=1).isin(value_set)])
Or convert columns to index and use Index.isin:
print (df[df.set_index(['a','b','c']).index.isin(value_set)])
a b c d
0 1 2 3 not
1 1 2 3 relevant
Another idea is use inner join of DataFrame.merge by helper DataFrame by same 3 columns names, then on parameter should be omit, because join by intersection of columns names of both df:
print (df.merge(pd.DataFrame(value_set, columns=['a','b','c'])))
a b c d
0 1 2 3 not
1 1 2 3 relevant

Concatendate n dataframes by index

I'm trying to union several pd.DataFrames along the column axis, using the index to remove duplicates (A and B are from the same source "table" filterd by different predicates and I'm tring to recombine).
A = pd.DataFrame({"values": [1, 2]}, pd.MultiIndex.from_tuples([(1,1),(1,2)], names=('l1', 'l2')))
B = pd.DataFrame({"values": [2, 3, 2]}, pd.MultiIndex.from_tuples([(1,2),(2,1),(2,2)], names=('l1', 'l2')))
pd.concat([A,B]).drop_duplicates() fails since it ignores the index and de-dups on the values so it removed index item (2,2)
pd.concat([A.reset_index(),B.reset_index()]).drop_duplicates(subset=('l1', 'l2')).set_index(['l1', 'l2']) does what I want, but I feel like there should be a better way.

you may do a simple concat and filter out dups by using index.duplicated
df1 = pd.concat([A,B])
df1[~df1.index.duplicated()]
Out[123]:
values
l1 l2
1 1 1
2 2
2 1 3
2 2

Pandas - Sorting By Column

I have a pandas data frame known as "df":
x y
0 1 2
1 2 4
2 3 8
I am splitting it up into two frames, and then trying to merge back together:
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
My goal is to get it back in the same order, but when I concat, I am getting the following:
frames = [df_1, df_2]
solution = pd.concat(frames)
solution.sort_values(by='x', inplace=False)
x y
1 2 4
2 3 8
0 1 2
The problem is I need the 'x' values to go back into the new dataframe in the same order that I extracted. Is there a solution?

use .loc to specify the order you want. Choose the original index.
solution.loc[df.index]
Or, if you trust the index values in each component, then
solution.sort_index()
setup
df = pd.DataFrame([[1, 2], [2, 4], [3, 8]], columns=['x', 'y'])
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_1, df_2]
solution = pd.concat(frames)

Try this:
In [14]: pd.concat([df_1, df_2.sort_values('y')])
Out[14]:
x y
0 1 2
1 2 4
2 3 8

When you are sorting the solution using
solution.sort_values(by='x', inplace=False)
you need to specify inplace = True. That would take care of it.

Based on these assumptions on df:
Columns x and y are note necessarily ordered.
The index is ordered.
Just order your result by index:
df = pd.DataFrame({'x': [1, 2, 3], 'y': [2, 4, 8]})
df_1 = df[df['x']==1]
df_2 = df[df['x']!=1]
frames = [df_2, df_1]
solution = pd.concat(frames).sort_index()
Now, solution looks like this:
x y
0 1 2
1 2 4
2 3 8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match column to another column containing array - python

Related

How to map values to a DataFrame with multiple columns as keys?

Change column names of Pandas dataframes contained in a list

how do you filter a Pandas dataframe by a multi-column set?

Concatendate n dataframes by index

Pandas - Sorting By Column

Categories

Resources