pandas - groupby by partial string

pandas - groupby by partial string - python

I would like to group a DataFrame by partial substrings. This is a sample .csv file:
GridCode,Key
1000,Colour
1000,Colours
1001,Behaviours
1001,Behaviour
1002,Favourite
1003,COLORS
1004,Honours
What I did so far is importing the file as df = pd.read_csv(sample.csv), and then I put all the strings to lowercases with df['Key'] = df['Key'].str.lower(). The first thing I tried is groupby by GridCode and Key with:
g = df.groupby([df['GridCode'],df['Key']]).size()
then unstack and fill:
d = g.unstack().fillna(0)
and the resulting DataFrame is:
Key behaviour behaviours colors colour colours favourite honours
GridCode
1000 0 0 0 1 1 0 0
1001 1 1 0 0 0 0 0
1002 0 0 0 0 0 1 0
1003 0 0 1 0 0 0 0
1004 0 0 0 0 0 0 1
Now what I would like to do is to group only strings containing the substring 'our', in this case avoiding only the colors Key, creating a new column with the desired substring.
The expected result would be like:
Key 'our'
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
I tried also to mask the DataFrame with masked = df['Key'].str.contains('our'), then df1 = df[mask], but I can't figured out how to make a new column with the new groupby counts. Any help would be really appreciated.

>>> import re # for the re.IGNORECASE flag
>>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum()
GridCode
1000 2
1001 2
1002 1
1003 0
1004 1
Name: Key, dtype: float64
also, instead of
df.groupby([df['GridCode'],df['Key']])
it is better to do:
df.groupby(['GridCode', 'Key'])

Related

pandas: convert a list in a column into separate columns

dataframe df has a column
id data_words
1 [salt,major,lab,water]
2 [lab,plays,critical,salt]
3 [water,success,major]
I want to make one-hot-code of the column
id critical lab major plays salt success water
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 1 0 1 0
What I tried:
Attempt 1:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('data_words')),
columns=mlb.classes_,
index=df.index))
Error: ValueError: columns overlap but no suffix specified: Index(['class'], dtype='object')
Attempt 2:
I converted the list into simple comma separated string with the following code
df['data_words_Joined'] = df.data_words.apply(','.join)
it makes the dataframe as following
id data_words
1 salt,major,lab,water
2 lab,plays,critical,salt
3 water,success,major
Then I tried
pd.concat([df,pd.get_dummies(df['data_words_Joined'])],axis=1)
But It makes all the words into one column name instead of separate words as separate columns
id salt,major,lab,water lab,plays,critical,salt water,success,major
1 1 0 0
2 0 1 0
3 0 0 1

You can try with explode followed by pivot_table
df_e = df.explode('data_words')
print(df_e.pivot_table(index=df_e['id'],columns=df_e['data_words'],values='id',aggfunc='count',fill_value=0))
Returning the following output:
data_words critical lab major plays salt success water
id
1 0 1 1 0 1 0 1
2 1 1 0 1 1 0 0
3 0 0 1 0 0 1 1
Edit: Adding data for replication purposes:
df = pd.DataFrame({'id':[1,2,3],
'data_words':[['salt','major','lab','water'],['lab','plays','critical','salt'],['water','success','major']]})
Which looks like:
id data_words
0 1 [salt, major, lab, water]
1 2 [lab, plays, critical, salt]
2 3 [water, success, major]

One possible approach could be to use get_dummies with your apply function:
new_df = df.data_words.apply(','.join).str.get_dummies(sep=',')
print(new_df)
Output:
critical lab major plays salt success water
0 0 1 1 0 1 0 1
1 1 1 0 1 1 0 0
2 0 0 1 0 0 1 1
Tested with pandas version 1.1.2 and borrowed input data from Celius Stingher's Answer.

Create new columns based on distinct row values and calculate frequency of every value

I would like to extract all distinct row values from specific columns and create new columns and calculate their frequency in every row.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output:

You can use Series.str.get_dummies to create a dummy indicator dataframe for each of the columns alpha and beta then using pd.concat concat these dataframes along axis=1:
cs = (('alpha', ','), ('beta', '|'))
df1 = pd.concat([df] + [df[c].str.get_dummies(sep=s) for c, s in cs], axis=1)
Result:
print(df1)
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1

Append count of rows meeting a condition within a group to Pandas dataframe

I know how to append a column counting the number of elements in a group, but I need to do so just for the number within that group that meets a certain condition.
For example, if I have the following data:
import numpy as np
import pandas as pd
columns=['group1', 'value1']
data = np.array([np.arange(5)]*2).T
mydf = pd.DataFrame(data, columns=columns)
mydf.group1 = [0,0,1,1,2]
mydf.value1 = ['P','F',100,10,0]
valueslist={'50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S'}
and my dataframe therefore looks like this:
mydf
group1 value1
0 0 P
1 0 F
2 1 100
3 1 10
4 2 0
I would then want to count the number of rows within each group1 value where value1 is in valuelist.
My desired output is:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

After changing the type of the value1 column to match your valueslist (or the other way around), you can use isin to get a True/False column, and convert that to 1s and 0s with astype(int). Then we can apply an ordinary groupby transform:
In [13]: mydf["value1"] = mydf["value1"].astype(str)
In [14]: mydf["count"] = (mydf["value1"].isin(valueslist).astype(int)
.groupby(mydf["group1"]).transform(sum))
In [15]: mydf
Out[15]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

mydf.value1=mydf.value1.astype(str)
mydf['count']=mydf.group1.map(mydf.groupby('group1').apply(lambda x : sum(x.value1.isin(valueslist))))
mydf
Out[412]:
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0
Data input :
valueslist=['50','51','52','53','54','55','56','57','58','59','60','61','62','63','64','65','66','67','68','69','70','71','72','73','74','75','76','77','78','79','80','81','82','83','84','85','86','87','88','89','90','91','92','93','94','95','96','97','98','99','100','A','B','C','D','P','S']

You can groupby each group1 and then use transform to find the max of whether your values are in the list.
mydf['count'] = mydf.groupby('group1').transform(lambda x: x.astype(str).isin(valueslist).sum())
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Here is one way to do it, albeit a one-liner:
mydf.merge(mydf.groupby('group1').apply(lambda x: len(set(x['value1'].values).intersection(valueslist))).reset_index().rename(columns={0: 'count'}), how='inner', on='group1')
group1 value1 count
0 0 P 1
1 0 F 1
2 1 100 1
3 1 10 1
4 2 0 0

Pandas groupby sum changes the key, why?

I have this dataset called 'event'
id event_type_1 event_type_2 event_type_3
234 0 1 0
234 1 0 0
345 0 0 0
and I want to produce this
id event_type_1 event_type_2 event_type_3
234 1 1 0
345 0 0 0
I tried using
event.groupby('id').sum()
but that just produced
id event_type_1 event_type_2 event_type_3
1 1 1 0
2 0 0 0
The id has has been replaced with an incremental value starting at '1'. Why? And how do I get my desired result?

Use as_index=False parameter:
In [163]: event.groupby('id', as_index=False).sum()
Out[163]:
id event_type_1 event_type_2 event_type_3
0 234 1 1 0
1 345 0 0 0
From the docs:
as_index : boolean, default True
For aggregated output, return object with group labels as the index.
Only relevant for DataFrame input. as_index=False is effectively
“SQL-style” grouped output

Concatenate Data -Python

I am working with data formatted in a .txt file in the format below:
family1 1 0 0 2 0 2 2 0 0 0 1 0 1 1 0 0 0 0 1 NA NA 4
family1 2 0 0 2 2 1 4 0 0 0 0 0 0 0 0 0 0 0 0 NA NA 4
family1 3 0 0 2 5 1 2 0 0 0 1 1 0 1 1 1 0 0 0 NA NA 2
family2 1 0 0 2 5 2 1 1 1 1 0 0 0 0 0 0 0 0 0 NA NA 3
etc.
where the second column is a member of the family and the other columns are numbers that correspond to traits.
I need to compare the relatives listed in this data set to create an output like this:
family1 1 2 traitnumber traitnumber ...
family1 1 3 traitnumber traitnumber ...
family1 2 3 traitnumber traitnumber ...
where the numbers are the relatives.
I have created a data frame using:
import pandas as pd
data = pd.read_csv('file.txt.', sep=" ", header = None)
print(data)
Can you offer any advice on the most efficient way to concatenate this data into the desired rows? I am having trouble comparing thinking of a way to write code for the different combinations i.e. relative 1 and 2, 1 and 3, and 2 and 3.
Thank you!

You might find combinations from itertools to be helpful.
from itertools import combinations
print([thing for thing in combinations((1,2,3), 2)])
Yields
[(1, 2), (1, 3), (2, 3)]

Building on DragonBobZ comment. You could do something like this using the groupby function of the dataframe to split out the families
import pandas as pd
data = pd.read_csv('file.txt', sep=" ", header = None)
print(data)
from itertools import combinations
grouped_df = data.groupby(0)
for key, item in grouped_df:
print key
current_subgroup = grouped_df.get_group(key)
print current_subgroup
print current_subgroup.shape, "\n"
print([thing for thing in combinations(range(current_subgroup.shape[0]), 2)])
Grabbing the output of the "combinations" line will give you a list of tuples that you can use in conjunction with row indexing to perform the comparisons for the appropriate columns.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas - groupby by partial string - python

>>> import re # for the re.IGNORECASE flag >>> df['Key'].str.contains('our', re.IGNORECASE).groupby(df['GridCode']).sum() GridCode 1000 2 1001 2 1002 1 1003 0 1004 1 Name: Key, dtype: float64 also, instead of df.groupby([df['GridCode'],df['Key']]) it is better to do: df.groupby(['GridCode', 'Key'])

Related

pandas: convert a list in a column into separate columns

Create new columns based on distinct row values and calculate frequency of every value

Append count of rows meeting a condition within a group to Pandas dataframe

Pandas groupby sum changes the key, why?

Concatenate Data -Python

Categories

Resources