Good day,
I have a dictionary like this:
dict_one = {M:[1, 3, 5, 10, 12, 14], A:[2, 4, 6, 7, 9, 11, 13, 15]}
I wish to map the dictionary to a data frame with the respected values inside the keys. However, I wish to turn the keys M and A into binary numbers where M =1 and A = 0 and place them in a new column like this. The new mapped column should map the new values based on the 'object' column which is an already existing column in existing data frame.
object new_column
1 1
2 0
3 1
4 0
5 1
6 0
How do I go about doing this? Help would be truly appreciated.
Thanks
We can just using np.where
np.where(df.object.isin(dict_one['A']),0,1)
Out[690]: array([1, 0, 1, 0, 1, 0])
You can create your dataframe using a list comprehension and then use map:
df = (pd.DataFrame([(x,key) for key,i in dict_one.items() for x in i],
columns=['object', 'new_column'])
.sort_values('object'))
df['new_column'] = df.new_column.map({'M':1,'A':0})
>>> df
object new_column
0 1 1
6 2 0
1 3 1
7 4 0
2 5 1
8 6 0
9 7 0
10 9 0
3 10 1
11 11 0
4 12 1
12 13 0
5 14 1
13 15 0
You could even do it all in one go using replace instead of map:
df = (pd.DataFrame([(x,key) for key,i in dict_one.items() for x in i],
columns=['object', 'new_column'])
.sort_values('object')
.replace({'new_column':{'M':1, 'A':0}}))
EDIT Based on your comments, it seems like you are starting from a dataframe, which I am assuming looks something like:
>>> df
object
0 1
1 2
2 3
3 4
4 5
5 6
In this case, I think your best bet is to create a new mapping dictionary, and just use map:
new_dict = {x:(1 if key=='M' else 0) for key, i in dict_one.items() for x in i}
# {1: 1, 3: 1, 5: 1, 10: 1, 12: 1, 14: 1, 2: 0, 4: 0, 6: 0, 7: 0, 9: 0, 11: 0, 13: 0, 15: 0}
df['new_column'] = df.object.map(new_dict)
>>> df
object new_column
0 1 1
1 2 0
2 3 1
3 4 0
4 5 1
5 6 0
Related
I'm trying to create a df that contains the number of users that rated rated both movies from a pair for every pair of movies.
My original df
df = pd.DataFrame({'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6}, 'userID': {0: 1, 1: 2, 2: 2, 3: 3, 4: 3, 5: 1, 6: 2}, 'MovieID': {0: 0, 1: 1, 2: 0, 3: 2, 4: 1, 5: 1, 6: 2}, 'rating': {0: 4, 1: 1, 2: 3, 3: 2, 4: 2, 5: 2, 6: 3}})
which looks like:
index
userID
MovieID
rating
0
1
0
4
1
2
1
1
2
2
0
3
3
3
2
2
4
3
1
2
5
1
1
2
6
2
2
3
What i want to achieve
movieID / movieID
0
1
2
0
nan
2
1
1
2
nan
2
2
1
2
nan
Currently I'm computing this df iteratively; for each unique combination of movieID's, and passing the ids to this function
def foo(id1, id2):
id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
combined = len(id1_users & id2_users)
return combined
Is there a faster way to compute this?
Here's an alternative way to do it. Using itertools.combinations, we can find pairs of MovieIDs and find the set of users who rated each pair to obtain values dictionary.
Then reformat this dictionary to get out dictionary which we cast to a DataFrame:
from itertools import combinations
users = df.groupby('MovieID')['userID'].apply(list).to_dict()
values = {(mov1, mov2): len(set(users[mov1]).intersection(users[mov2])) for mov1, mov2 in combinations(set(df['MovieID']), 2)}
out = {}
for (i,c), v in values.items():
out.setdefault(c, {})[i] = v
out.setdefault(i, {})[c] = v
df = pd.DataFrame(out)[[0,1,2]].sort_index()
Output:
0 1 2
0 NaN 2.0 1.0
1 2.0 NaN 2.0
2 1.0 2.0 NaN
Note that this outcome is different from yours but it appears your expected outcome has a mistake because for MovieIDs 1 and 2, userIDs 2 and 3 both rate them, so the value of the corresponding cell should be 2 not 0.
If you want to compute your table without loops, you should first generate a pivot_table with any to identify the users that voted at least once for a movie. Then use a dot product to count the cross correlations, with eventually a numpy.fill_diagonal to hide the self-correlations.
d = df.pivot_table(index='userID',
columns='MovieID',
values='rating',
aggfunc=any,
fill_value=False)
out = d.T.dot(d)
# optional, to remove self correlations (in place)
import numpy as np
np.fill_diagonal(out.values, np.nan)
Output:
MovieID 0 1 2
MovieID
0 NaN 2 1
1 2 NaN 2
2 1 2 NaN
Intermediate pivot table:
MovieID 0 1 2
userID
1 1 1 0
2 1 1 1
3 0 1 1
I used drop_duplicates() on original data(subset = A and B), and I made labels for the refined data.
Now I have to make labels for the original data, but It costs to much time and not that efficient.
For example,
My original dataframe is as follows:
A B
1 1
1 1
2 2
2 3
5 3
6 4
5 4
5 4
after drop_duplicates():
A B
1 1
2 2
2 3
5 3
6 4
5 4
after labeling:
A B label
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
Following is my expected output:
A B label
1 1 1
1 1 1
2 2 0
2 3 1
5 3 1
6 4 0
5 4 1
5 4 1
My current code for achieving above result is as follows:
for i in range(origin_data):
check = False
j = 0
while not check:
if origin_data['A'].iloc[i] == dropped_data['A'].iloc[j] and origin_data['B'].iloc[i] == dropped_data['B'].iloc[j]:
origin_data['label'].iloc[i] = dropped_data['label'].iloc[j]
check = True
j+=1
As my code takes much more time, is there any way I can perform it more efficiently ?
You can merge the labeled dataset with the original one:
original.merge(labeled, how="left", on=["A", "B"])
result:
A B label
0 1 1 1
1 1 1 1
2 1 2 0
3 1 3 0
4 1 4 1
5 1 4 1
Full code:
import pandas as pd
original = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1},
'B': {0: 1, 1: 1, 2: 2, 3: 3, 4: 4, 5: 4}}
)
labeled = pd.DataFrame(
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: 1, 1: 2, 2: 3, 3: 4},
'label': {0: 1, 1: 0, 2: 0, 3: 1}}
)
print(original.merge(labeled, how="left", on=["A", "B"]))
If the problem is just mapping the 'B' labels to the original dataframe, you can use map:
origin_data.B.map(dropped_data.set_index('B').label)
I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
I've a dataset where one of the column is as below. I'd like to create a new column based on the below condition.
For values in column_name, if 1 is present, create a new id. If 0 is present, also create a new id. But if 1 is repeated in more than 1 continuous rows, then id should be same for all rows. The sample output result can be seen below.
column_name
1
0
0
1
1
1
1
0
0
1
column_name -- ID
1 -- 1
0 -- 2
0 -- 3
1 -- 4
1 -- 4
1 -- 4
1 -- 4
0 -- 5
0 -- 6
1 -- 7
Say your Series is
s = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
Then you can use:
>>> ((s != 1) | (s.shift(1) != 1)).cumsum()
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
dtype: int64
This checks that either the current entry is not 1, or that the previous entry is not 1, and then performs a cumulative sum on the result.
Essentially leveraging the fact that a 1 in the Series lagged by another 1 should be treated as part of the same group, while every 0 calls for an increment. One of four things will happen:
1) 0 with a preceding 0 : Increment by 1
2) 0 with a preceding 1 : Increment by 1
3) 1 with a preceding 1 : Increment by 0
4) 1 with a preceding 0: Increment by 1
(df['column_name'] + df['column_name'].shift(1)).\ ## Creates a Series with values 0, 1, or 2 (first field is NaN)
fillna(0).\ ## Fills first field with 0
isin([0,1]).\ ## True for cases 1, 2, and 4 described above, else False (case 3)
astype('int').\ ## Integerizes it
cumsum()
Output:
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
At this stage I would just use a regular python for loop
column_name = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
ID = [1]
for i in range(1, len(column_name)):
ID.append(ID[-1] + ((column_name[i] + column_name[i-1]) < 2))
print(ID)
>>> [1, 2, 3, 4, 4, 4, 4, 5, 6, 7]
And then you can assign ID as a column in your dataframe
I'd like to make a single dictionary from a Pandas dataframe where each row from N columns points to the values in a single column, and was wondering if there is an efficient way to do this without having to construct a bunch of for loops and dictionary updates.
For example, is a more programmatic/Pandas'y way to accomplish the following.
import pandas as pd
columns = ["A", "B", "C"]
data = [[1, 11, 111],
[2, 22, 222],
[3, 33, 333]]
df = pd.DataFrame(data=data, columns=columns)
df
Out[1]:
A B C
0 1 11 111
1 2 22 222
2 3 33 333
mdict = {}
for c in df.columns:
mdict.update(dict(zip(df[c], df['A'])))
mdict
Out[2]:
{1: 1, 2: 2, 3: 3, 11: 1, 22: 2, 33: 3, 111: 1, 222: 2, 333: 3}
I'm ultimately trying to create a long dictionary of keys all pointing back to the same value so that I can go to another dataframe and apply the map function to standardize the entries. Is this dictionary step even needed, or is there a simpler way to accomplish this without having to go through an intermediate dictionary? Thanks!
df2 = pd.DataFrame(data=[1, 11, 111, 2, 22, 222, 3, 33, 333], columns=['D'])
df2['D'] = df2['D'].map(mdict)
df2
Out[3]:
D
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Another way of doing this would be:
g = df.set_index('A', drop=False).unstack()
m = dict(zip(g.values, g.index.get_level_values(1)))
m
{1: 1, 2: 2, 3: 3, 11: 1, 22: 2, 33: 3, 111: 1, 222: 2, 333: 3}
df1.D.map(m)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: D, dtype: int64
In a similar manner, you can pass a pd.Series object to map.
s = pd.Series(g.index.get_level_values(1), index=g.values)
s
1 1
2 2
3 3
11 1
22 2
33 3
111 1
222 2
333 3
Name: A, dtype: int64
df1.D.map(s)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: D, dtype: int64