Python Pandas: How can I make labels for dropped data? - python

I used drop_duplicates() on original data(subset = A and B), and I made labels for the refined data.
Now I have to make labels for the original data, but It costs to much time and not that efficient.
For example,
My original dataframe is as follows:
A B
1 1
1 1
2 2
2 3
5 3
6 4
5 4
5 4
after drop_duplicates():
A B
1 1
2 2
2 3
5 3
6 4
5 4
after labeling:
A B label
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
Following is my expected output:
A B label
1 1 1
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
5 4 1
My current code for achieving above result is as follows:
for i in range(origin_data):
check = False
j = 0
while not check:
if origin_data['A'].iloc[i] == dropped_data['A'].iloc[j] and origin_data['B'].iloc[i] == dropped_data['B'].iloc[j]:
origin_data['label'].iloc[i] = dropped_data['label'].iloc[j]
check = True
j+=1
As my code takes much more time, is there any way I can perform it more efficiently ?

You can merge the labeled dataset with the original one:
original.merge(labeled, how="left", on=["A", "B"])
result:
A B label
0 1 1 1
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 1
5 1 4 1
Full code:
import pandas as pd
original = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
'B': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 4}}
)
labeled = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: 1, 1: 2, 2: 3, 3: 4},
'label': {0: 1, 1: 0, 2: 0, 3: 1}}
)
print(original.merge(labeled, how="left", on=["A", "B"]))

If the problem is just mapping the 'B' labels to the original dataframe, you can use map:
origin_data.B.map(dropped_data.set_index('B').label)

Related

How to convert df with item ratings to df that contains number of users that rated a pair of items?

I'm trying to create a df that contains the number of users that rated rated both movies from a pair for every pair of movies.
My original df
df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}, 'userID': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 1, 6: 2}, 'MovieID': {0: 0, 1: 1, 2: 0, 3: 2, 4: 1, 5: 1, 6: 2}, 'rating': {0: 4, 1: 1, 2: 3, 3: 2, 4: 2, 5: 2, 6: 3}})
which looks like:
index
userID
MovieID
rating
0
1
0
4
1
2
1
1
2
2
0
3
3
3
2
2
4
3
1
2
5
1
1
2
6
2
2
3
What i want to achieve
movieID / movieID
0
1
2
0
nan
2
1
1
2
nan
2
2
1
2
nan
Currently I'm computing this df iteratively; for each unique combination of movieID's, and passing the ids to this function
def foo(id1, id2):
id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
combined = len(id1_users & id2_users)
return combined
Is there a faster way to compute this?
Here's an alternative way to do it. Using itertools.combinations, we can find pairs of MovieIDs and find the set of users who rated each pair to obtain values dictionary.
Then reformat this dictionary to get out dictionary which we cast to a DataFrame:
from itertools import combinations
users = df.groupby('MovieID')['userID'].apply(list).to_dict()
values = {(mov1, mov2): len(set(users[mov1]).intersection(users[mov2])) for mov1, mov2 in combinations(set(df['MovieID']), 2)}
out = {}
for (i,c), v in values.items():
out.setdefault(c, {})[i] = v
out.setdefault(i, {})[c] = v
df = pd.DataFrame(out)[[0,1,2]].sort_index()
Output:
0 1 2
0 NaN 2.0 1.0
1 2.0 NaN 2.0
2 1.0 2.0 NaN
Note that this outcome is different from yours but it appears your expected outcome has a mistake because for MovieIDs 1 and 2, userIDs 2 and 3 both rate them, so the value of the corresponding cell should be 2 not 0.
If you want to compute your table without loops, you should first generate a pivot_table with any to identify the users that voted at least once for a movie. Then use a dot product to count the cross correlations, with eventually a numpy.fill_diagonal to hide the self-correlations.
d = df.pivot_table(index='userID',
columns='MovieID',
values='rating',
aggfunc=any,
fill_value=False)
out = d.T.dot(d)
# optional, to remove self correlations (in place)
import numpy as np
np.fill_diagonal(out.values, np.nan)
Output:
MovieID 0 1 2
MovieID
0 NaN 2 1
1 2 NaN 2
2 1 2 NaN
Intermediate pivot table:
MovieID 0 1 2
userID
1 1 1 0
2 1 1 1
3 0 1 1

Creating a new binary column by series mapping from a dictionary

Good day,
I have a dictionary like this:
dict_one = {M:[1, 3, 5, 10, 12, 14], A:[2, 4, 6, 7, 9, 11, 13, 15]}
I wish to map the dictionary to a data frame with the respected values inside the keys. However, I wish to turn the keys M and A into binary numbers where M =1 and A = 0 and place them in a new column like this. The new mapped column should map the new values based on the 'object' column which is an already existing column in existing data frame.
object new_column
1 1
2 0
3 1
4 0
5 1
6 0
How do I go about doing this? Help would be truly appreciated.
Thanks
We can just using np.where
np.where(df.object.isin(dict_one['A']),0,1)
Out[690]: array([1, 0, 1, 0, 1, 0])
You can create your dataframe using a list comprehension and then use map:
df = (pd.DataFrame([(x,key) for key,i in dict_one.items() for x in i],
columns=['object', 'new_column'])
.sort_values('object'))
df['new_column'] = df.new_column.map({'M':1,'A':0})
>>> df
object new_column
0 1 1
6 2 0
1 3 1
7 4 0
2 5 1
8 6 0
9 7 0
10 9 0
3 10 1
11 11 0
4 12 1
12 13 0
5 14 1
13 15 0
You could even do it all in one go using replace instead of map:
df = (pd.DataFrame([(x,key) for key,i in dict_one.items() for x in i],
columns=['object', 'new_column'])
.sort_values('object')
.replace({'new_column':{'M':1, 'A':0}}))
EDIT Based on your comments, it seems like you are starting from a dataframe, which I am assuming looks something like:
>>> df
object
0 1
1 2
2 3
3 4
4 5
5 6
In this case, I think your best bet is to create a new mapping dictionary, and just use map:
new_dict = {x:(1 if key=='M' else 0) for key, i in dict_one.items() for x in i}
# {1: 1, 3: 1, 5: 1, 10: 1, 12: 1, 14: 1, 2: 0, 4: 0, 6: 0, 7: 0, 9: 0, 11: 0, 13: 0, 15: 0}
df['new_column'] = df.object.map(new_dict)
>>> df
object new_column
0 1 1
1 2 0
2 3 1
3 4 0
4 5 1
5 6 0

What does "col_level" do in the melt function?

From the documentation:
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
What does col_level do?
Examples with different values of col_level would be great.
My current dataframe is created by the following:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
df.columns = [list('ABC'), list('DEF'), list('GHI')]
Thanks.
You can check melt:
col_level : int or string, optional
If columns are a MultiIndex then use this level to melt.
And examples:
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
'B': {0: 1, 1: 3, 2: 5},
'C': {0: 2, 1: 4, 2: 6}})
#use Multiindex.from_arrays for set levels names
df.columns = pd.MultiIndex.from_arrays([list('ABC'), list('DEF'), list('GHI')],
names=list('abc'))
print (df)
a A B C
b D E F
c G H I
0 a 1 2
1 b 3 4
2 c 5 6
#melt by first level of MultiIndex
print (df.melt(col_level=0))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level a of MultiIndex
print (df.melt(col_level='a'))
a value
0 A a
1 A b
2 A c
3 B 1
4 B 3
5 B 5
6 C 2
7 C 4
8 C 6
#melt by level c of MultiIndex
print (df.melt(col_level='c'))
c value
0 G a
1 G b
2 G c
3 H 1
4 H 3
5 H 5
6 I 2
7 I 4
8 I 6

Convert a Pandas dataframe to a N-to-1 dictionary, where N columns are keys pointing to a single column as the values

I'd like to make a single dictionary from a Pandas dataframe where each row from N columns points to the values in a single column, and was wondering if there is an efficient way to do this without having to construct a bunch of for loops and dictionary updates.
For example, is a more programmatic/Pandas'y way to accomplish the following.
import pandas as pd
columns = ["A", "B", "C"]
data = [[1, 11, 111],
[2, 22, 222],
[3, 33, 333]]
df = pd.DataFrame(data=data, columns=columns)
df
Out[1]:
A B C
0 1 11 111
1 2 22 222
2 3 33 333
mdict = {}
for c in df.columns:
mdict.update(dict(zip(df[c], df['A'])))
mdict
Out[2]:
{1: 1, 2: 2, 3: 3, 11: 1, 22: 2, 33: 3, 111: 1, 222: 2, 333: 3}
I'm ultimately trying to create a long dictionary of keys all pointing back to the same value so that I can go to another dataframe and apply the map function to standardize the entries. Is this dictionary step even needed, or is there a simpler way to accomplish this without having to go through an intermediate dictionary? Thanks!
df2 = pd.DataFrame(data=[1, 11, 111, 2, 22, 222, 3, 33, 333], columns=['D'])
df2['D'] = df2['D'].map(mdict)
df2
Out[3]:
D
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Another way of doing this would be:
g = df.set_index('A', drop=False).unstack()
m = dict(zip(g.values, g.index.get_level_values(1)))
m
{1: 1, 2: 2, 3: 3, 11: 1, 22: 2, 33: 3, 111: 1, 222: 2, 333: 3}
df1.D.map(m)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: D, dtype: int64
In a similar manner, you can pass a pd.Series object to map.
s = pd.Series(g.index.get_level_values(1), index=g.values)
s
1 1
2 2
3 3
11 1
22 2
33 3
111 1
222 2
333 3
Name: A, dtype: int64
df1.D.map(s)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: D, dtype: int64

pandas DataFrame split the column and extend the rows

like:
A B C D
1 1 2 3 ['a','b']
2 4 6 7 ['b','c']
3 1 0 1 ['a']
4 2 1 1 ['b']
5 1 2 3 []
to:
A B C D
1 1 2 3 ['a']
2 1 2 3 ['b']
3 4 6 7 ['b']
4 4 6 7 ['c']
5 1 0 1 ['a']
6 2 1 1 ['b']
7 1 2 3 []
ps: split the row in "D" and extend the row
use: pandas dataframe deal with the data
One way would be to use a list comprehension with a doubly nest for-loop:
>>> [(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]]
# [(1, 2, 3, ['a']),
# (1, 2, 3, ['b']),
# (4, 6, 7, ['b']),
# (4, 6, 7, ['c']),
# (1, 0, 1, ['a']),
# (2, 1, 1, ['b']),
# (1, 2, 3, [])]
Passing the data in this form to pd.DataFrame produces the desired result:
import pandas as pd
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
result = pd.DataFrame(
[(key + (item,))
for key, val in df.set_index(['A','B','C'])['D'].iteritems()
for item in map(list, val) or [[]]])
yields
0 1 2 3
0 1 2 3 [a]
1 1 2 3 [b]
2 4 6 7 [b]
3 4 6 7 [c]
4 1 0 1 [a]
5 2 1 1 [b]
6 1 2 3 []
Another option is to use df['D'].apply to expand the items in the list into different columns, and then use stack to expand the rows:
df = pd.DataFrame({'A': {1: 1, 2: 4, 3: 1, 4: 2, 5: 1},
'B': {1: 2, 2: 6, 3: 0, 4: 1, 5: 2},
'C': {1: 3, 2: 7, 3: 1, 4: 1, 5: 3},
'D': {1: ['a', 'b'], 2: ['b', 'c'], 3: ['a'], 4: ['b'], 5: []}})
df = df.set_index(['A', 'B', 'C'])
result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
# 0 1
# A B C
# 1 2 3 [a] [b]
# 4 6 7 [b] [c]
# 1 0 1 [a] NaN
# 2 1 1 [b] NaN
# 1 2 3 [] NaN
result = result.stack()
# A B C
# 1 2 3 0 [a]
# 1 [b]
# 4 6 7 0 [b]
# 1 [c]
# 1 0 1 0 [a]
# 2 1 1 0 [b]
# 1 2 3 0 []
# dtype: object
result.index = result.index.droplevel(-1)
result = result.reset_index()
# A B C 0
# 0 1 2 3 [a]
# 1 1 2 3 [b]
# 2 4 6 7 [b]
# 3 4 6 7 [c]
# 4 1 0 1 [a]
# 5 2 1 1 [b]
# 6 1 2 3 []
Although this does not use explicit for-loops or a list comprehension, there is an implicit for-loop hidden in the call to apply. In fact, it is much slower than using a list comprehension:
In [170]: df = pd.concat([df]*10)
In [171]: %%timeit
.....: result = df['D'].apply(lambda x: pd.Series(map(list, x) if x else [[]]))
result = result.stack()
result.index = result.index.droplevel(-1)
result = result.reset_index()
100 loops, best of 3: 11.5 ms per loop
In [172]: %%timeit
.....: result = pd.DataFrame(
[(key + (item,))
for key, val in df['D'].iteritems()
for item in map(list, val) or [[]]])
1000 loops, best of 3: 618 µs per loop
Assuming your column D content is of type string:
print(type(df.loc[0, 'D']))
<class 'str'>
df = df.set_index(['A', 'B', 'C']).sortlevel()
df.loc[:, 'D'] = df.loc[:, 'D'].str.strip('[').str.strip(']')
df = df.loc[:, 'D'].str.split(',', expand=True).stack()
df = df.str.strip('').apply(lambda x: '[{}]'.format(x)).reset_index().drop('level_3', axis=1).rename(columns={0: 'D'})
A B C D
0 1 0 1 ['a']
1 1 2 3 ['a']
2 1 2 3 ['b']
3 1 2 3 []
4 2 1 1 ['b']
5 4 6 7 ['b']
6 4 6 7 ['c']

Categories