I have the following list of strings:
my_list=["health","nutrition","nature","nutritionist", "nutritionists", "wellness", "food", "drink", "diet"]
I would like to assign a label to all the rows which contains one or more of the above words:
Search_Col
heathen
dietcoke
loveguru
drinkwine
lovefood
Pringles
then
Search_Col Tag
heathen 1
dietcoke 1
loveguru 0
drinkwine 1
lovefood 1
Pringles 0
I have tried first to select rows which contains elements in my_list as follows
df.Search_Col.str.contains(my_list)
but it does not select rows.
Chain values in list by | for regex or and then convert boolean mask to 0,1 by Series.view:
df['Tag'] = df.Search_Col.str.contains('|'.join(my_list)).view('i1')
print (df)
Search_Col Tag
0 heathen 0
1 dietcoke 1
2 loveguru 0
3 drinkwine 1
4 lovefood 1
5 Pringles 0
This should do what you are looking for. The final answer is different because the first row in your example does not contain any of the entries in the list you provided. (Perhaps you meant "healthen"?)
import pandas as pd
my_list=["health","nutrition","nature","nutritionist", "nutritionists", "wellness", "food", "drink", "diet"]
df = pd.DataFrame(['heathen','dietcoke','loveguru','drinkwine','lovefood','Pringles'], columns = ['Search_Col'] )
df['Tag'] = df.Search_Col.str.contains('|'.join(my_list)).astype('int')
Gives:
Search_Col Tag
0 heathen 0
1 dietcoke 1
2 loveguru 0
3 drinkwine 1
4 lovefood 1
5 Pringles 0
Related
I'm really new to pandas and python in general, so I apologize if this is too basic.
I have a list of indices that I must use to take a subset of the rows of a dataframe. First, I simply sliced the dataframe using the indices to produce (df_1). Then I tried to use index.isin just to see if it also works (df_2). Well, it works but it produces a shorter dataframe (and seemingly ignores some of the rows that are supposed to be selected).
df_1 = df.iloc[df_idx]
df_2 = df[df.index.isin(df_idx)]
So my question is, why are they different? How exactly does index.isin work and when is it appropriate to use it?
Synthesising duplicates in index and then it re-produces the behaviour you note. If your index has duplicates it's absolutely expected the two will give different results. If you want to use these interchangeably you need to ensure that your index values uniquely identify a row
n = 6
df = pd.DataFrame({"idx":[i//2 for i in range(n)],"col1":[f"text {i}" for i in range(n)]}).set_index("idx")
df_idx = df.index
print(f"""
{df}
{df.iloc[df_idx]}
{df[df.index.isin(df_idx)]}
""")
output
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5
col1
idx
0 text 0
0 text 0
0 text 1
0 text 1
1 text 2
1 text 2
col1
idx
0 text 0
0 text 1
1 text 2
1 text 3
2 text 4
2 text 5
We have a large dataset that needs to be modified based on specific criteria.
Here is a sample of the data:
Input
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
1 0 0 1 0 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,1],[0,0,1,0,0]],columns =
['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
The fields of this data are all formatted 'family.member', and a family may have any number of members. We need to remove all rows of the dataframe which have all 0's for any family.
Simply put, we want to only keep rows of the data that contain at least one member of every family.
We have no reproducible code for this problem because we are unsure of where to start.
We thought about using iterrows() but the documentation says:
#You should **never modify** something you are iterating over.
#This is not guaranteed to work in all cases. Depending on the
#data types, the iterator returns a copy and not a view, and writing
#to it will have no effect.
Other questions on S.O. do not quite solve our problem.
Here is what we want the SampleData to look like after we run it:
Expected output
BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
0 0 1 1 1 0 1
SampleData1 = pd.DataFrame([[0,1,1,1,0]],columns = ['BL.DB',
'BL.KB',
'MI.RO',
'MI.RA',
'MI.XZ'])
Also, could you please explain why we should not modify a data we iterate over when we do that all the time with for loops, and what is the correct way to modify DataFrame's too, please?
Thanks for the help in advance!
Start from copying df and reformatting its columns into a MultiIndex:
df2 = df.copy()
df2.columns = df.columns.str.split(r'\.', expand=True)
The result is:
BL MI
DB KB RO RA XZ
0 0 1 1 1 0
1 0 0 1 0 0
To generate "family totals", i.e. sums of elements in rows over the top
(0) level of column index, run:
df2.groupby(level=[0], axis=1).sum()
The result is:
BL MI
0 1 2
1 0 1
But actually we want to count zeroes in each row of the above table,
so extend the above code to:
(df2.groupby(level=[0], axis=1).sum() == 0).astype(int).sum(axis=1)
The result is:
0 0
1 1
dtype: int64
meaning:
row with index 0 has no "family zeroes",
row with index 1 has one such zero (for one family).
And to print what we are looking for, run:
df[(df2.groupby(level=[0], axis=1).sum() == 0)\
.astype(int).sum(axis=1) == 0]
i.e. print rows from df, with indices for which the count of
"family zeroes" in df2 is zero.
It's possible to group along axis=1. For each row, check that all families (grouped on the column name before '.') have at least one 1, then slice by this Boolean Series to retain these rows.
m = df.groupby(df.columns.str.split('.').str[0], axis=1).any(1).all(1)
df[m]
# BL.DB BL.KB MI.RO MI.RA MI.XZ MAY.BE
#0 0 1 1 1 0 1
As an illustration, here's what grouping along axis=1 looks like; it partitions the DataFrame by columns.
for idx, gp in df.groupby(df.columns.str.split('.').str[0], axis=1):
print(idx, gp, '\n')
#BL BL.DB BL.KB
#0 0 1
#1 0 0
#MAY MAY.BE
#0 1
#1 1
#MI MI.RO MI.RA MI.XZ
#0 1 1 0
#1 1 0 0
Now it's rather straightforward to find the rows where all of these groups have any single non-zero column, by using those with axis=1.
You basically want to group on families and retain rows where there is one or more member for all families in the row.
One way to do this is to transpose the original dataframe and then split the index on the period, taking the first element which is the family identifier. The columns are the index values in the original dataframe.
We can then group on the families (level=0) and sum the number of members in each for every record (df2.groupby(level=0).sum()). No we retain the index values with more than one member in each family (.gt(0).all()). We create a mask using these values, and apply it to a boolean index on the original dataframe to get the relevant rows.
df2 = SampleData1.T
df2.index = [idx.split('.')[0] for idx in df2.index]
# >>> df2
# 0 1
# BL 0 0
# BL 1 0
# MI 1 1
# MI 1 0
# MI 0 0
# >>> df2.groupby(level=0).sum()
# 0 1
# BL 1 0
# MI 2 1
mask = df2.groupby(level=0).sum().gt(0).all()
>>> SampleData1[mask]
BL.DB BL.KB MI.RO MI.RA MI.XZ
0 0 1 1 1 0
I have a data frame df1 with two columns 'ids' and 'names' -
ids names
fhj56 abc
ty67s pqr
yu34o xyz
I have another data frame df2 which has some of the columns being -
user values
1 ['fhj56','fg7uy8']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
My result should give me those users from df2 whose values contains at least one of the ids from df1 and also tell which ids are responsible to put them into resultant table. Result should look like -
user values_responsible names
1 ['fhj56'] ['abc']
3 ['fhj56','ty67s'] ['abc','pqr']
User 2 doesn't come in resultant table because none of its values exist in df1.
I was trying to do it as follows -
df2.query('values in #df1.ids')
But this doesn't seem to work well.
You can iterate through the rows and then use .loc together with isin to find the matching rows from df2. I converted this filtered dataframe into a dictionary
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.append(result['ids'].tolist())
names.append(result['names'].tolist())
users.append(row['user'])
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']]
user values_responsible names
0 1 [fhj56] [abc]
1 3 [fhj56, ty67s] [abc, pqr]
Or, for tidy data:
ids = []
names = []
users = []
for _, row in df2.iterrows():
result = df1.loc[df1['ids'].isin(row['values'])]
if not result.empty:
ids.extend(result['ids'].tolist())
names.extend(result['names'].tolist())
users.extend([row['user']] * len(result['ids']))
>>> pd.DataFrame({'user': users, 'values_responsible': ids, 'names': names})[['user', 'values_responsible', 'names']])
user values_responsible names
0 1 fhj56 abc
1 3 fhj56 abc
2 3 ty67s pqr
Try this , using the idea of unnest a list cell.
Temp_unnest = pd.DataFrame([[i, x]
for i, y in df['values'].apply(list).iteritems()
for x in y], columns=list('IV'))
Temp_unnest['user']=Temp_unnest.I.map(df.user)
df1.index=df1.ids
Temp_unnest.assign(names=Temp_unnest.V.map(df1.names)).dropna().groupby('user')['V','names'].agg({(lambda x: list(x))})
Out[942]:
V names
<lambda> <lambda>
user
1 [fhj56] [abc]
3 [fhj56, ty67s] [abc, pqr]
I would refactor your second dataframe (essentially, normalizing your database). Something like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
Then, all you have to do is merge the first and second dataframe on the id column.
r = df2.merge(df1, left_on='id', right_on='ids', how='left')
You can exclude any gids for which some of the ids don't have a matching name.
r[~r[gid].isin( r[r['names'] == None][gid].unique() )]
where r[r['names'] == None][gid].unique() finds all the gids that have no name and then r[~r[gid].isin( ... )] grabs only entries that aren't in the list argument for isin.
If you had more id groups, the second table might look like
user gid id
1 1 'fhj56'
1 1 'fg7uy8'
1 2 '1asdf3'
1 2 '7ada2a'
1 2 'asd341'
2 1 'glao0'
2 1 'rt56yu'
2 1 're23u'
3 1 'fhj56'
3 1 'ty67s'
3 1 'hgjl09'
which would be equivalent to
user values
1 ['fhj56','fg7uy8']
1 ['1asdf3', '7ada2a', 'asd341']
2 ['glao0','rt56yu','re23u']
3 ['fhj56','ty67s','hgjl09']
This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.
I'm learning data science and would like to make dummy variables for my dataset.
I have a Dataframe that has "Product Category" column that is a list of matching categories looking like ["Category1", "Category2".."CategoryN"]
I know that Pandas has nice function that makes dummy variables automatically (pandas.get_dummies) but in this case, I can't use it, I guess(?).
I know how to loop over the each row to append 1 to matching elements of each columns.
My current code is this:
for column_name in df.columns[1:]: #first column is "Product Category" and appended dummy columns (product category names) to the right previously
for index, _ in enumerate(df[column_name][:10]): #limit 10 rows
if column_name in df["Product Category"][index]:
df[column_name][index] = 1
However, the above code is not efficient and I cannot use it since I have more than 100,000 rows. I'd like to somehow do the operations on the whole array, but I can't figure out how to do it.
Could someone help?
I assume your problem is that every row can have multiple dummies set, so the values for "Product Category" is a column of lists of categories. Maybe this should work, although I'm not sure how memory efficient it would be.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"Product Category": [['Category1', 'Category2'],
...: ['Category3'],
...: ['Category1', 'Category4'],
...: ['Category1', 'Category3', 'Category5']]})
In [3]: df
Out[3]:
Product Category
0 [Category1, Category2]
1 [Category3]
2 [Category1, Category4]
3 [Category1, Category3, Category5]
In [4]: def list_to_dict(category_list):
...: n_categories = len(category_list)
...: return dict(zip(category_list, [1]*n_categories))
...:
In [5]: df_dummies = pd.DataFrame(list(df['Product Category'].apply(list_to_dict).values)).fillna(0)
In [6]: df_new = df.join(df_dummies)
In [7]: df_new
Out[7]:
Product Category Category1 Category2 Category3 Category4 Category5
0 [Category1, Category2] 1 1 0 0 0
1 [Category3] 0 0 1 0 0
2 [Category1, Category4] 1 0 0 1 0
3 [Category1, Category3, Category5] 1 0 1 0 1
Using get_dummies(), you can specify which columns to transform into dummy variables. Consider the following example where multiple items can share same category but will only fall into one dummy variable:
df = pd.DataFrame({'Languages': ['R', 'Python', 'C#', 'PHP', 'Java', 'XSLT', 'SQL'],
'ProductCategory': ['Statistical', 'General Purpose',
'General Purpose', 'Web', 'General Purpose',
'Special Purpose', 'Special Purpose']})
# BEFORE
print(df)
# Languages ProductCategory
# 0 R Statistical
# 1 Python General Purpose
# 2 C# General Purpose
# 3 PHP Web
# 4 Java General Purpose
# 5 XSLT Special Purpose
# 6 SQL Special Purpose
newdf = pd.get_dummies(df, columns=['ProductCategory'], prefix=['Categ'])
# AFTER
print(newdf)
# Languages Categ_General Purpose Categ_Special Purpose Categ_Statistical Categ_Web
# 0 R 0 0 1 0
# 1 Python 1 0 0 0
# 2 C# 1 0 0 0
# 3 PHP 0 0 0 1
# 4 Java 1 0 0 0
# 5 XSLT 0 1 0 0
# 6 SQL 0 1 0 0