Select rows that based on a where statement - python

How can I select values that have the word "link" in them and make them in category1 and "popcorn" in them to make them category2 and all else put in category3?
Here is a sample but my actual dataset has hundreds of rows
data = {'model': [['Lisa', 'link'], ['Lisa 2', 'popcorn'], ['telephone', 'rabbit']],
'launched': [1983, 1984, 1991]}
df = pd.DataFrame(data, columns = ['model', 'launched'])
Desired
Model launched category
['Lisa', 'link'] 1983 1
['Lisa 2', 'popcorn'] 1984 2
['telephone', 'rabbit'] 1991 3

You could use np.select to set category to 1 or 2 depending on whether 'link' or 'popcorn' is contained in a given list. Set default to 3 for the case where neither of them are contained:
import numpy as np
c1 = ['link' in i for i in df.model]
c2 = ['popcorn' in i for i in df.model]
df['category'] = np.select([c1,c2], [1,2], 3)
model launched category
0 [Lisa, link] 1983 1
1 [Lisa 2, popcorn] 1984 2
2 [telephone, rabbit] 1991 3

You can use apply function:
Create a def:
def get_categories(row):
if 'link' in row.model:
return 1
elif 'popcorn' in row.model:
return 2
else:
return 3
And then call it like that:
df['category'] = df.apply(get_categories, axis=1)
df
Outputs:
model launched category
0 [Lisa, link] 1983 1
1 [Lisa 2, popcorn] 1984 2
2 [telephone, rabbit] 1991 3
EDIT:
Based on #gred_data comment, you can actually do that in one line in order to increase performance:
df['category'] = df.model.apply(lambda x: 1 if 'link' in x else 2 if 'popcorn' in x else 3)
df
Gets you the same result.

Related

JSON list flatten to dataframe as multiple columns with prefix

I have a json with some nested/array items like the one below
I'm looking at flattening it before saving it into a csv
[{'SKU':'SKU1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1A', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1B', 'UnitPrice': 1500}],
},
{'SKU':'SKU2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
}
]
I have attempted with the good solution here flattern nested JSON and retain columns (or Panda json_normalize) but got no where so I'm hoping to get some tips from the community
SKU
Name
ItemSalesPrices_OEM_UnitPrice
ItemSalesPrices_OEM_AssetNumber
ItemSalesPrices_RRP_UnitPrice
ItemSalesPrices_RRP_AssetNumber
SKU1
test name 1
1600
TEST1A
1500
TEST1B
SKU2
test name 2
1500
TEST2
Thank you
Use json_normalize:
first = ['SKU','name']
df = pd.json_normalize(L,'ItemSalesPrices', first)
print (df)
SourceNumber AssetNumber UnitPrice SKU name
0 OEM TEST1A 1600 TEST1 test name 1
1 RRP TEST1B 1500 TEST1 test name 1
2 RRP TEST2 1500 TEST2 test name 2
Then you can pivoting values - if numeric use sum, if strings use join:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x)
df1 = (df.pivot_table(index=first,
columns='SourceNumber',
aggfunc=f))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.rename_axis(None, axis=1).reset_index()
print (df1)
SKU name AssetNumber_OEM AssetNumber_RRP UnitPrice_OEM \
0 SKU1 test name 1 TEST1A TEST1B 1600.0
1 SKU2 test name 2 NaN TEST2 NaN
UnitPrice_RRP
0 1500.0
1 1500.0

subset columns based on partial match and group level in python

I am trying to split my dataframe based on a partial match of the column name, using a group level stored in a separate dataframe. The dataframes are here, and the expected output is below
df = pd.DataFrame(data={'a19-76': [0,1,2],
'a23pz': [0,1,2],
'a23pze': [0,1,2],
'b887': [0,1,2],
'b59lp':[0,1,2],
'c56-6u': [0,1,2],
'c56-6uY': [np.nan, np.nan, np.nan]})
ids = pd.DataFrame(data={'id': ['a19', 'a23', 'b8', 'b59', 'c56'],
'group': ['test', 'sub', 'test', 'pass', 'fail']})
desired output
test_ids = 'a19-76', 'b887'
sub_ids = 'a23pz', 'a23pze', 'c56-6u'
pass_ids = 'b59lp'
fail_ids = 'c56-6u', 'c56-6uY'
I have written thise onliner, which assigned the group to each column name, but doesnt create two seperate lists as required above
gb = ids.groupby([[col for col in df.columns if col.startswith(tuple(i for i in ids.id))], 'group']).agg(lambda x: list(x)).reset_index()
gb.groupby('group').agg({'level_0':lambda x: list(x)})
thanks for reading
May be not what you are looking for, but anyway.
A pending question is what to do with not matched columns, the answer obviously depends on what you will do after matching.
Plain python solution
Simple collections wrangling, but there may be a simpler way.
from collections import defaultdict
groups = defaultdict(list)
idsr = ids.to_records(index=False)
for col in df.columns:
for id, group in idsr:
if col.startswith(id):
groups[group].append(col)
break
# the following 'else' clause is optional, it creates a group for not matched columns
else: # for ... else ...
groups['UNGROUPED'].append(col)
Groups =
{'sub': ['a23pz', 'c56-6u'], 'test': ['a19-76', 'b887', 'b59lp']}
Then after
df.columns = pd.MultiIndex.from_tuples(sorted([(k, col) for k,id in groups.items() for col in id]))
df =
sub test
a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
pandas solution
Columns to dataframe
product of dataframes (join )
filtering of the resulting dataframe
There is surely a better way
df1 = ids.copy()
df2 = df.columns.to_frame(index=False)
df2.columns = ['col']
# Not tested enhancement:
# with pandas version >= 1.2, the four following lines may be replaced by a single one :
# dfm = df1.merge(df2, how='cross')
df1['join'] = 1
df2['join'] = 1
dfm = df1.merge(df2, on='join').drop('join', axis=1)
df1.drop('join', axis=1, inplace = True)
dfm['match'] = dfm.apply(lambda x: x.col.find(x.id), axis=1).ge(0)
dfm = dfm[dfm.match][['group', 'col']].sort_values(by=['group', 'col'], axis=0)
dfm =
group col
6 sub a23pz
24 sub c56-6u
0 test a19-76
18 test b59lp
12 test b887
# Note 1: The index can be removed
# note 2: Unmatched columns are not taken in account
then after
df.columns = pd.MultiIndex.from_frame(dfm)
df =
group sub test
col a23pz c56-6u a19-76 b59lp b887
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
You can use a regex generated from the values in iidf and filter:
Example with "test":
s = iddf.set_index('group')['id']
regex_test = '^(%s)' % '|'.join(s.loc['test'])
# the generated regex is: '^(a19|b8|b59)'
df.filter(regex=regex_test)
output:
a19-76 b887 b59lp
0 0 0 0
1 1 1 1
2 2 2 2
To get a list of columns for each unique group in iidf, apply the same process in a dictionary comprehension:
{x: list(df.filter(regex='^(%s)' % '|'.join(s.loc[x])).columns)
for x in s.index.unique()}
output:
{'test': ['a19-76', 'b887', 'b59lp'],
'sub': ['a23pz', 'c56-6u']}
NB. this should generalize to any number of groups, however, if really there are many groups, it will be preferable to loop on the columns names rather than using filter repeatedly
A straightforward groupby(...).apply(...) can achieve this result:
def id_match(group, to_match):
regex = "[{}]".format("|".join(group))
matches = to_match.str.match(regex)
return pd.Series(to_match[matches])
matched_groups = ids.groupby("group")["id"].apply(id_match, df.columns)
print(matched_groups)
group
fail 0 c56-6u
1 c56-6uY
pass 0 b887
1 b59lp
sub 0 a19-76
1 a23pz
2 a23pze
test 0 a19-76
1 a23pz
2 a23pze
3 b887
4 b59lp
You can treat this Series as a dictionary-like entity to access each of the groups independently:
print(matched_ids["fail"])
0 c56-6u
1 c56-6uY
Name: id, dtype: object
print(matched_ids["pass"])
0 b887
1 b59lp
Name: id, dtype: object
Then you can take it a step further to can subset your original DataFrame with this new Series like so:
print(df[matched_ids["fail"]])
c56-6u c56-6uY
0 0 NaN
1 1 NaN
2 2 NaN
print(df[matched_ids["pass"]])
b887 b59lp
0 0 0
1 1 1
2 2 2

Python: Compare two dataframes in Python with different number rows and a Compsite key

I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)

count unique values in groups pandas

I have a dataframe like this:
data = {'id': [1,1,1,2,2,3],
'value': ['a','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
I would like to get the unique counts of obj_id groupby id and value:
1 a 3
2 b 1
3 c 1
But when I do:
result=df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
the result I got was:
1 a 2
1 a 1
2 b 1
3 c 1
so the first two rows with same id and value don't group together.
How can I fix this? Many thanks!
For me your solution working nice with sample data.
Like mentioned #YOBEN_S in comments is possible problem traling whitespeces, then solution is add Series.str.strip:
data = {'id': [1,1,1,2,2,3],
'value': ['a ','a','a','b','b','c'],
'obj_id': [1,2,3,3,3,4]
}
df = pd.DataFrame (data, columns = ['id','value','obj_id'])
df['value'] = df['value'].str.strip()
df = df.groupby(['id','value'])['obj_id'].nunique().reset_index(name='obj_counts')
print (df)
id value obj_counts
0 1 a 3
1 2 b 1
2 3 c 1

How to factorize two data frame meanwhile with python-pandas?

I have two data frame, one is user-item-rating and the other is side information of the items:
#df1
A12VH45Q3H5R5I B000NWJTKW 5.0
A3J8AQWNNI3WSN B000NWJTKW 4.0
A1XOBWIL4MILVM BDASK99000 1.0
#df2
B000NWJTKW ....
BDASK99000 ....
Now I w'd like to map the name of item and user to integer ID. I know there is a way of factorize:
df.apply(lambda x: pd.factorize(x)[0] + 1)
But I 'd like to ensure that the integer of the items in two data frame are consistent. So the resulting data frames is:
#df1
1 1 5.0
2 1 4.0
3 2 1.0
#df2
1 ...
2 ...
Do you know how to ensure that? Thanks in advance!
Concatenate the common column(s), and apply pd.factorize (or pd.Categorical) on that:
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
For example,
import pandas as pd
df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0),
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0),
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating'])
df2 = pd.DataFrame(
[('B000NWJTKW', 10),
('BDASK99000', 20)], columns=['item', 'extra'])
codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']]))
df1['item'] = codes[:len(df1)] + 1
df2['item'] = codes[len(df1):] + 1
codes, uniques = pd.factorize(df1['user'])
df1['user'] = codes + 1
print(df1)
print(df2)
yields
# df1
user item rating
0 1 1 5
1 2 1 4
2 3 2 1
# df2
item extra
0 1 10
1 2 20
Another way to work-around the problem (if you have enough memory) would be to merge the two DataFrames: df3 = pd.merge(df1, df2, on='item', how='outer'), and then factorize df3['item']:
df3 = pd.merge(df1, df2, on='item', how='outer')
for col in ['item', 'user']:
df3[col] = pd.factorize(df3[col])[0] + 1
print(df3)
yields
user item rating extra
0 1 1 5 10
1 2 1 4 10
2 3 2 1 20
Another option could be to apply factorize on the first dataframe, and then apply the resulting mapping to the second dataframe:
# create factorization:
idx, levels = pd.factorize(df1['item'])
# replace the item codes in the first dataframe with the new index value
df1['item'] = idx
# create a dictionary mapping the original code to the new index value
d = {code: i for i, code in enumerate(codes)}
# apply this mapping to the second dataframe
df2['item'] = df2.item.apply(lambda code: d[code])
This approach will only work if every level is present in both dataframes.

Categories