Extracting specific rows from a data frame - python

I have a data frame df1 with two columns 'ids' and 'names' -
ids names
fhj56 abc
ty67s pqr
yu34o xyz
I have another data frame df2 which has some of the columns being -
user values
1 ['fhj56','fg7uy8']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. Result should look like -
user values_responsible names
1 ['fhj56'] ['abc']
3 ['fhj56','ty67s'] ['abc','pqr']
User 2 doesn't come in resultant table because none of its values exist in df1.
I was trying to do it as follows -
df2.query('values in #df1.ids')
But this doesn't seem to work well.

You can iterate through the rows and then use .loc together with isin to find the matching rows from df2. I converted this filtered dataframe into a dictionary
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.append(result['ids'].tolist())
names.append(result['names'].tolist())
users.append(row['user'])
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
user values_responsible names
0 1 [fhj56] [abc]
1 3 [fhj56, ty67s] [abc, pqr]
Or, for tidy data:
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.extend(result['ids'].tolist())
names.extend(result['names'].tolist())
users.extend([row['user']] * len(result['ids']))
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
user values_responsible names
0 1 fhj56 abc
1 3 fhj56 abc
2 3 ty67s pqr

Try this , using the idea of unnest a list cell.
Temp_unnest = pd.DataFrame([[i, x]
for i, y in df['values'].apply(list).iteritems()
for x in y], columns=list('IV'))
Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})
Out[942]:
V names
<lambda> <lambda>
user
1 [fhj56] [abc]
3 [fhj56, ty67s] [abc, pqr]

I would refactor your second dataframe (essentially, normalizing your database). Something like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
Then, all you have to do is merge the first and second dataframe on the id column.
r = df2.merge(df1, left_on='id', right_on='ids', how='left')
You can exclude any gids for which some of the ids don't have a matching name.
r[~r[gid].isin( r[r['names'] == None][gid].unique() )]
where r[r['names'] == None][gid].unique() finds all the gids that have no name and then r[~r[gid].isin( ... )] grabs only entries that aren't in the list argument for isin.
If you had more id groups, the second table might look like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
1 2 '1asdf3'
1 2 '7ada2a'
1 2 'asd341'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
which would be equivalent to
user values
1 ['fhj56','fg7uy8']
1 ['1asdf3', '7ada2a', 'asd341']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']

Related

How can I create a column in one DataFrame containing the count of matches in another DataFrame's column?

I have two pandas DataFrames. The first one, df1, contains a column of file paths and a column of lists containing what users have read access to these file paths. The second DataFrame, df2, contains a list of all possible users. I've created an example below:
df1 = pd.DataFrame()
df1['path'] = ['C:/pathA', 'C:/pathB', 'C:/pathC', 'C:/pathD']
df1['read'] = [['userA', 'userC', 'userD'],
['userA', 'userB'],
['userB', 'userD'],
['userA', 'userB', 'userC', 'userD']]
print(df1)
path read
0 C:/pathA [userA, userC, userD]
1 C:/pathB [userA, userB]
2 C:/pathC [userB, userD]
3 C:/pathD [userA, userB, userC, userD]
df2 = pd.DataFrame(data=['userA', 'userB', 'userC', 'userD'], columns=['user'])
print(df2)
user
0 userA
1 userB
2 userC
3 userD
The end goal is to create a new column df2['read_count'], which should take each user string from df2['user'] and find the total number of matches in the column df1['read'].
The expected output would be exactly that - a count of matches of each user string in the column of lists in df1['read']. Here is what I am expecting based on the example:
df2
user read_count
0 userA 3
1 userB 3
2 userC 2
3 userD 3
I tried putting something together using another question and list comprehension, but no luck. Here is what I currently have:
df2['read_count'] = [sum(all(val in cell for val in row)
for cell in df1['read'])
for row in df2['user']]
print(df2)
user read_count
0 userA 0
1 userB 0
2 userC 0
3 userD 0
What is wrong with the code I currently have? I've tried actually following through the loops but it all seemed right, but it seems like my code can't detect the matches I want.
You can use:
df2.merge(df1['read'].explode().value_counts(),
left_on='user', right_index=True)
Or, if you really need to use a different kind of aggregation that depends on "path":
df2.merge(df1.explode('read').groupby('read')['path'].count(),
left_on='user', right_index=True)
output:
user path
0 userA 3
1 userB 3
2 userC 2
3 userD 3
Without df2:
df1['read'].explode().value_counts().reset_index()
# or
# df1.explode('read').groupby('read', as_index=False)['path'].count()
output:
index path
0 userA 3
1 userB 3
2 userC 2
3 userD 3
Let's use np.hstack to flatten the lists then count the unique values using Counter and map the counts back to df2
from collections import Counter
df2['count'] = df2['user'].map(Counter(np.hstack(df1.read)))
Result
user count
0 userA 3
1 userB 3
2 userC 2
3 userD 3

pandas groupby with length of lists

I need display in dataframe columns both the user_id and length of content_id which is a list object. But struggling to do using groupby.
Please help in both groupby as well as my question asked at the bottom of this post (how do I get the results along with user_id in dataframe?)
Dataframe types:
df.dtypes
output:
user_id object
content_id object
dtype: object
Sample Data:
user_id content_id
0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_...
2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19...
Pandas query:
df.groupby('user_id')['content_id'].count().reset_index()
df.groupby(['user_id'])['content_id'].apply(lambda x: get_count(x))
output:
user_id content_id
0 user_10013 1
1 user_10034 1
2 user_10042 1
When I tried without grouping, I am getting fine as below -
df['content_id'].apply(lambda x: len(x))
0 11
1 9
2 11
3 1
But, how do I get the results along with user_id in dataframe? Like I want in below format -
user_id content_id
some xxx 11
some yyy 6
pandas.Groupby returns a grouper element not the contents of each cell. As such it is not possible (without alot of workarounding) to do what you want. Instead you need to simply rewrite the columns (as suggested by #ifly6)
Using
df_agg = df.copy()
df_agg.content_id = df_agg.content_id.apply(len)
df_agg = df_agg.groupby('user_id').sum()
will result in the same dataframe as the Groupby you described.
For completeness sake the instruction for a single groupby would be
df.groupby('user_id').agg(lambda x: x.apply(len).sum())
try converting content_id to a string, split it by comma, then reassemble as a list of lists then count the list items.
data="""index user_id content_id
0 user_18085 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
1 user_16044 [cont_2738_2_49,cont_4482_2_19,cont_4994_18_]
2 user_13110 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
3 user_18909 [cont_3170_2_28]
4 user_15509 [cont_2598_4_4,cont_2738_2_49,cont_4482_2_19]
"""
df = pd.read_csv(StringIO(data), sep='\s+')
def convert_to_list(x):
x=re.sub(r'[\[\]]', '', x)
lst=list(x.split(','))
return lst
df['content_id2']= [list() for x in range(len(df.index))]
for key,item in df.iterrows():
lst=convert_to_list(str(item['content_id']))
for item in lst:
df.loc[key,'content_id2'].append(item)
def count_items(x):
return len(x)
df['count'] = df['content_id2'].apply(count_items)
df.drop(['content_id'],axis=1,inplace=True)
df.rename(columns={'content_id2':'content_id'},inplace=True)
print(df)
output:
index user_id content_id count
0 0 user_18085 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
1 1 user_16044 [cont_2738_2_49, cont_4482_2_19, cont_4994_18_] 3
2 2 user_13110 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
3 3 user_18909 [cont_3170_2_28] 1
4 4 user_15509 [cont_2598_4_4, cont_2738_2_49, cont_4482_2_19] 3
​

How to parse from one column to create another with Pandas and Regex?

I have a pd dataframe with one column user_id and each row ends with "/tgroup..."
I want to create a new column group_id where each row will have the corresponding "tgroup..." matching user_id.
This is my implementation so far:
user_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4
df['group_id'] = df['user_id'].apply(lambda x: re.findall('(^\t)',x))
print(df.head())
user_id group_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0 []
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1 []
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2 []
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3 []
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4 []
Clearly lambda/regex method is not grabbing the string selection that I want.
Any ideas?
is \t the tab character or backslash and t? If the latter you can try:
df['group_id'] = df.user_id.str.extract(r'\\t(.*)')
Output:
user_id group_id
0 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-0 group-0
1 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-1 group-1
2 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-2 group-2
3 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-3 group-3
4 87dce49a-f752-47f8-9bc4-b97a446a85f5\tgroup-4 group-4

Efficient way to check if 2 columns of DataFrame are subsets of each other

I have 2 DataFrames which has a column whose value is of type set containing 8 digit integers.
df1 (contains around 200k rows)
id s1
0 0 {43649632, 95799329, 40649644, 23335890, 81779...
1 1 {69900026, 74441229}
2 2 {85195648, 55750338, 98936902, 82000264, 43544...
3 3 {21916700, 13627806}
4 4 {62929026, 38592365, 44179790, 38355127}
df2 (contains around 900k rows)
id s1
0 0 {58209736, 25405713, 28691898, 94682562}
1 1 {81089732, 82343077}
2 2 {59692896, 33234306, 40445479, 18728345, 24464...
3 3 {71406042, 69900026, 74441229}
4 4 {62929026}
I want to know the FASTEST way to find the pair of ids from df2 and df2 that match ONE OF THIS condition:
df1.s1 is a subset of d2.s1
OR
df2.s1 is a subset of d1.s1
For example,
id=1 of df1 is a subset of id=3 of df2, so (1, 3) is a valid pair
id=4 of df2 is a subset of id=4 of df1, so (1, 4) is a valid pair
I have tried this code below but it's going to take about 20 hours:
id_pairs = []
for i in tqdm(list(df2.itertuples(index=False))):
for j in df1.itertuples(index=False):
if i.s1.issubset(j.s1) or j.s1.issubset(i.s1):
id_pairs.append((i.id, j.id))
Is there a faster or more efficient way to do this?
You could do a cartesian join and then apply the condition
df1["key"] = 0
df2["key"] = 0
merged = df1.merge(df2, how="outer", on="key")
def subset(row):
if (row.s1.issubset(row.s2)) or (row.s2.issubset(row.s1)):
return (row.id1, row.id2)
else:
return None
merged.apply(lambda row: subset(row), axis=1).dropna()

Editing then concatenating values of several columns into a single one (pandas, python)

I'm looking for a way to use pandas and python to combine several columns in an excel sheet with known column names into a new, single one, keeping all the important information as in the example below:
input:
ID,tp_c,tp_b,tp_p
0,transportation - cars,transportation - boats,transportation - planes
1,checked,-,-
2,-,checked,-
3,checked,checked,-
4,-,checked,checked
5,checked,checked,checked
desired output:
ID,tp_all
0,transportation
1,cars
2,boats
3,cars+boats
4,boats+planes
5,cars+boats+planes
The row with ID of 0 contans a description of the contents of the column. Ideally the code would parse the description in the second row, look after the '-' and concatenate those values in the new "tp_all" column.
This is quite interesting as it's a reverse get_dummies...
I think I would manually munge the column names so that you have a boolean DataFrame:
In [11]: df1 # df == 'checked'
Out[11]:
cars boats planes
0
1 True False False
2 False True False
3 True True False
4 False True True
5 True True True
Now you can use an apply with zip:
In [12]: df1.apply(lambda row: '+'.join([col for col, b in zip(df1.columns, row) if b]),
axis=1)
Out[12]:
0
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
dtype: object
Now you just have to tweak the headers, to get the desired csv.
Would be nice if there were a less manual way / faster to do reverse get_dummies...
OK a more dynamic method:
In [63]:
# get a list of the columns
col_list = list(df.columns)
# remove 'ID' column
col_list.remove('ID')
# create a dict as a lookup
col_dict = dict(zip(col_list, [df.iloc[0][col].split(' - ')[1] for col in col_list]))
col_dict
Out[63]:
{'tp_b': 'boats', 'tp_c': 'cars', 'tp_p': 'planes'}
In [64]:
# define a func that tests the value and uses the dict to create our string
def func(x):
temp = ''
for col in col_list:
if x[col] == 'checked':
if len(temp) == 0:
temp = col_dict[col]
else:
temp = temp + '+' + col_dict[col]
return temp
df['combined'] = df[1:].apply(lambda row: func(row), axis=1)
df
Out[64]:
ID tp_c tp_b tp_p \
0 0 transportation - cars transportation - boats transportation - planes
1 1 checked NaN NaN
2 2 NaN checked NaN
3 3 checked checked NaN
4 4 NaN checked checked
5 5 checked checked checked
combined
0 NaN
1 cars
2 boats
3 cars+boats
4 boats+planes
5 cars+boats+planes
[6 rows x 5 columns]
In [65]:
df = df.ix[1:,['ID', 'combined']]
df
Out[65]:
ID combined
1 1 cars
2 2 boats
3 3 cars+boats
4 4 boats+planes
5 5 cars+boats+planes
[5 rows x 2 columns]
Here is one way:
newCol = pandas.Series('',index=d.index)
for col in d.ix[:, 1:]:
name = '+' + col.split('-')[1].strip()
newCol[d[col]=='checked'] += name
newCol = newCol.str.strip('+')
Then:
>>> newCol
0 cars
1 boats
2 cars+boats
3 boats+planes
4 cars+boats+planes
dtype: object
You can create a new DataFrame with this column or do what you like with it.
Edit: I see that you have edited your question so that the names of the modes of transportation are now in row 0 instead of in the column headers. It is easier if they're in the column headers (as my answer assumes), and your new column headers don't seem to contain any additional useful information, so you should probably start by just setting the column names to the info from row 0, and deleting row 0.

Categories