I have the following dataframe df:
df =
REGION GROUP_1 GROUP_2 GROUP_3
Reg1 AAA BBB AAA
Reg2 BBB AAA CCC
Reg1 BBB CCC CCC
I need to count the number of unique occurences of the values of GROUP_1, GROUP_2 and GROUP_3 grouped per REGION (the quantity of GROUP_ columns is 50 in my real dataset).
For the above example, the result should be the following:
result =
REGION COUNT_AAA COUNT_BBB COUNT_CCC
Reg1 1 2 1
Reg2 1 1 1
This is my code:
df = (pd.melt(df, id_vars=['REGION'], value_name='GROUP')
.drop('variable', axis=1).drop_duplicates()
.groupby(['REGION', 'GROUP']).agg({'GROUP' : 'count'})
.reset_index())
The problem is that it takes too much time for 1Gb of data. I cannot even check the result on the whole dataset because of very long calculation time. In my opinion, there is something wrong in the code or it can be simplified.
You could start off with dropping duplicated values present among the GROUP_X columns. Then with the help of lreshape, consolidate these into a single GROUP column.
Perform groupby by making REGION as the grouped key and compute value_counts to get respective unique counts present in the GROUP column.
Finally, unstack to make the multi-index series to a dataframe and add an optional prefix to the column headers obtained.
slow approach:
(pd.lreshape(df.apply(lambda x: x.drop_duplicates(), 1),
{"GROUP": df.filter(like='GROUP').columns})
.groupby('REGION')['GROUP'].value_counts().unstack().add_prefix('COUNT_'))
To obtain a flat DF:
(pd.lreshape(df.apply(lambda x: x.drop_duplicates(), 1),
{"GROUP": df.filter(like='GROUP').columns}).groupby('REGION')['GROUP'].value_counts()
.unstack().add_prefix('COUNT_').rename_axis(None, 1).reset_index())
slightly fast approach:
With the help of MultiIndex.from_arrays, we could compute unique rows too.
midx = pd.MultiIndex.from_arrays(df.filter(like='GROUP').values, names=df.REGION.values)
d = pd.DataFrame(midx.levels, midx.names)
d.stack().groupby(level=0).value_counts().unstack().rename_axis('REGION')
faster approach:
A Faster way would be to create the unique row values using pd.unique(faster than np.unique as it does not perform sort operation after finding the unique elements) while iterating through the array corresponding to GROUP_X column. This takes the major chunk of the time. Then, stack, groupby, value_counts and finally unstack it back.
d = pd.DataFrame([pd.unique(i) for i in df.filter(like='GROUP').values], df.REGION)
d.stack().groupby(level=0).value_counts(sort=False).unstack()
set_index
value_counts
notnull converts 1s and 2s to True and np.nan to False
groupby + sum
df.set_index('REGION').apply(
pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int)
AAA BBB CCC
REGION
Reg1 1 2 1
Reg2 1 1 1
Even Faster
val = df.filter(like='GROUP').values
reg = df.REGION.values.repeat(val.shape[1])
idx = df.index.values.repeat(val.shape[1])
grp = val.ravel()
pd.Series({(i, r, g): 1 for i, r, g in zip(idx, reg, grp)}).groupby(level=[1, 2]).sum().unstack()
Faster Still
from collections import Counter
val = df.filter(like='GROUP').values
reg = df.REGION.values.repeat(val.shape[1])
idx = df.index.values.repeat(val.shape[1])
grp = val.ravel()
pd.Series(Counter([(r, g) for _, r, g in pd.unique([(i, r, g) for i, r, g in zip(idx, reg, grp)]).tolist()])).unstack()
Related
I need to add the number of unique values in column C (right table) to the related row in the left table based on the values in common column A (as shown in the picture):
thank you in advance
Groupby column A in second dataset and calculate count of each unique value in column C. merge it with first dataset on column A. Rename column C to C-count if needed:
>>> count_df = df2.groupby('A', as_index=False).C.nunique()
>>> output = pd.merge(df1, count_df, on='A')
>>> output.rename(columns={'C':'C-count'}, inplace=True)
>>> output
A B C-count
0 2 22 3
1 3 23 2
2 5 21 1
3 1 24 1
4 6 21 1
Use DataFrameGroupBy.nunique with Series.map for new column in df1:
df1['C-count'] = df1['A'].map(df2.groupby('A')['C'].nunique())
This may not be the most effective way of doing this, so if your databases are too big be careful.
Define the following function:
def c_value(a_value, right_table):
c_ids = []
for index, row in right_table.iterrows():
if row['A'] == a_value:
if row['C'] not in c_ids:
c_ids.append(row['C'])
return len(c_ids)
For this function I'm supposing that the right_table is a pandas.Dataframe.
Now, you do the following to build the new column (assuming that the left table is a pandas.Dataframe):
new_column = []
for index, row in left_table.iterrows():
new_column.append(c_value(row['A'],right_table))
left_table["C-count"] = new_column
After this, the left_table Dataframe should be the one dessired (as far as I understand what you need).
I'm trying to do a conditional count across records in a pandas dataframe. I'm new at Python and have a working solution using a for loop, but running this on a large dataframe with ~200k rows takes a long time and I believe there is a better way to do this by defining a function and using apply, but I'm having trouble figuring it out.
Here's a simple example.
Create a pandas dataframe with two columns:
import pandas as pd
data = {'color': ['blue','green','yellow','blue','green','yellow','orange','purple','red','red'],
'weight': [4,5,6,4,1,3,9,8,4,1]
}
df = pd.DataFrame(data)
# for each row, count the number of other rows with the same color and a lesser weight
counts = []
for i in df.index:
c = df.loc[i, 'color']
w = df.loc[i, 'weight']
ct = len(df.loc[(df['color']==c) & (df['weight']<w)])
counts.append(ct)
df['counts, same color & less weight'] = counts
For each record, the 'counts, same color & less weight' column is intended to get a count of the other records in the df with the same color and a lesser weight. For example, the result for row 0 (blue, 4) is zero because no other records with color=='blue' have lesser weight. The result for row 1 (green, 5) is 1 because row 4 is also color=='green' but weight==1.
How do I define a function that can be applied to the dataframe to achieve the same?
I'm familiar with apply, for example to square the weight column I'd use:
df['weight squared'] = df['weight'].apply(lambda x: x**2)
... but I'm unclear how to use apply to do a conditional calculation that refers to the entire df.
Thanks in advance for any help.
We can do transform with min groupby
df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
0 0
1 1
2 1
3 0
4 0
5 0
6 0
7 0
8 1
9 0
Name: weight, dtype: int64
#df['c...]=df.weight.gt(df.groupby('color').weight.transform('min')).astype(int)
I have a dataframe
df = pd.DataFrame({'col1': [1,2,1,2], 'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']})
I want to:
Split col2 on the ' ' where col1==1
Split on the '-' where col1==2
Append this data to 3 new columns: (col20, col21, col22)
Ideally the code would look like this:
subdf=df.loc[df['col1']==1]
#list of columns to use
col_list=['col20', 'col21', 'col22']
#append to dataframe new columns from split function
subdf[col_list]=(subdf.col2.str.split(' ', 2, expand=True)
however this hasn't worked.
I have tried using merge and join, however:
join doesn't work if the columns are already populated
merge doesn't work if they aren't.
I have also tried:
#subset dataframes
subdf=df.loc[df['col1']==1]
subdf2=df.loc[df['col1']==2]
#trying the join method, only works if columns aren't already present
subdf.join(subdf.col2.str.split(' ', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
#merge doesn't work if columns aren't present
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'}))
subdf2
the error messages when I run it:
subdf2=subdf2.merge(subdf2.col2.str.split('-', 2, expand=True).rename(columns={0:'col20', 1:'col21', 2: 'col22'})
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
EDIT givin information after mark's comment on regex
My original col1 was actually the regex combination I had used to extract col2 from some strings.
#the combination I used to extract the col2
combinations= ['(\d+)[-](\d+)[-](\d+)[-](\d+)', '(\d+)[-](\d+)[-](\d+)'... ]
here is the original dataframe
col1 col2
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31
I then created a dictionary that connected every combination to what the split values of col2 represented:
filtermap={'(\d+)[-](\d+)[-](\w+)(\d+)': 'thickness temperature sample', '(\d+)[-](\d+)[-](\d+)[-](\d+)': 'thickness temperature width height' }
with this filter I wanted to:
Subset the dattaframe based on regex combinations
use split on col2 to find the values corresponding to the combination using the filtermap (thickness temperature..)
add these values to the new columns on the dataframe
col1 col2 thickness temperature width length sample
(\d+)[-](\d+)[-](\d+)[-](\d+) 350-300-50-10 350 300 50 10
(\d+)[-](\d+)[-](\w+)(\d+) 150-180-G31 150 180 G31
since you mentioned regex maybe you know of a way to do this directly ?
EDIT 2; input-output
in the input there are strings like so:
'this is the first example string 350-300-50-10 ',
'this is the second example string 150-180-G31'
formats that are:
number-number-number-number(350-300-50-10 ) have this orded information in them: thickness(350)-temperature(300)-width(50)-length(10)
number-number-letternumber (150-180-G31 ) have this ordered information in them: thickness-temperature-sample
desired output:
col2, thickness, temperature, width, length, sample
350-300-50-10 350 300 50 10 None
150-180-G31 150 180 None None G31
I used eg:
re.search('(\d+)[-](\d+)[-](\d+)[-](\d+)'))
to find the col2 in the strings
You can use np.where to simplify this problem.
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [1,2,1,2],
'col2': ['aa bb cc', 'ee-ff-gg', 'hh ii kk', 'll-mm-nn']
})
temp = np.where(df['col1'] == 1, #a boolean array/series indicating where the values are equal to 1.
df['col2'].str.split(' '), #Use the output of this if True
df['col2'].str.split('-') #Else use this.
)
temp_df = pd.DataFrame(temp.tolist()) #create a new dataframe with the columns we need
#Output:
0 1 2
0 aa bb cc
1 ee ff gg
2 hh ii kk
3 ll mm nn
Now just assign the result back to the original df. You can use a concat or join, but a simple assignment suffices as well.
df[[f'col2_{i}' for i in temp_df.columns]] = temp_df
print(df)
col1 col2 col2_0 col2_1 col2_2
0 1 aa bb cc aa bb cc
1 2 ee-ff-gg ee ff gg
2 1 hh ii kk hh ii kk
3 2 ll-mm-nn ll mm nn
EDIT: To address more than two conditional splits
If you need more than two conditions, np.where was only designed to work for a binary selection. You can Opt for a "custom" approach that works with as many splits as you like here.
splits = [ ' ', '-', '---']
all_splits = pd.DataFrame({s:df['col2'].str.split(s).values for s in splits})
#Output:
- ---
0 [aa, bb, cc] [aa bb cc] [aa bb cc]
1 [ee-ff-gg] [ee, ff, gg] [ee-ff-gg]
2 [hh, ii, kk] [hh ii kk] [hh ii kk]
3 [ll-mm-nn] [ll, mm, nn] [ll-mm-nn]
First we split df['col2'] on all splits, without expanding. Now, it's just a question of selecting the correct list based on the value of df['col1']
We can use numpy's advanced indexing for this.
temp = all_splits.values[np.arange(len(df)), df['col1']-1]
After this point, the steps should be same as above, starting with creating temp_df
You are pretty close. To generate a column based on some condition, where is often handy, see code below,
col2_exp1 = df.col2.str.split(' ',expand=True)
col2_exp2 = df.col2.str.split('-',expand=True)
col2_combine = (col2_exp1.where(df.col1.eq(1),col2_exp2)
.rename(columns=lambda x:f'col2{x}'))
Finally,
df.join(col2_combine)
With this df as base i want the following output:
So all should be aggregated by column 0 and all strings from column 1 should be added and the numbers from column 2 should be summed when the strings from column 1 have the same name.
With the following code i could aggregate the strings but without summing the numbers:
df2= df1.groupby([0]).agg(lambda x: ','.join(set(x))).reset_index()
df2
Avoid an arbitrary number of columns
Your desired output suggests you have an arbitrary number of columns dependent on the number of values in 1 for each group 0. This is anti-Pandas, which is strongly geared towards an arbitrary number of rows. Hence series-wise operations are preferred.
So you can just use groupby + sum to store all the information you require.
df = pd.DataFrame({0: ['2008-04_E.pdf']*3,
1: ['Mat1', 'Mat2', 'Mat2'],
2: [3, 1, 1]})
df_sum = df.groupby([0, 1]).sum().reset_index()
print(df_sum)
0 1 2
0 2008-04_E.pdf Mat1 3
1 2008-04_E.pdf Mat2 2
But if you insist...
If you insist on your unusual requirement, you can achieve it as follows via df_sum calculated as above.
key = df_sum.groupby(0)[1].cumcount().add(1).map('Key{}'.format)
res = df_sum.set_index([0, key]).unstack().reset_index().drop('key', axis=1)
res.columns = res.columns.droplevel(0)
print(res)
Key1 Key2 Key1 Key2
0 2008-04_E.pdf Mat1 Mat2 3 2
This seems like a 2-step process. It also requires that each group from column 1 has the same number of unique elements in column 2. First groupby the columns you want grouped
df_grouped = df.groupby([0,1]).sum().reset_index()
Then reshape to the form you want:
def group_to_row(group):
group = group.sort_values(1)
output = []
for i, row in group[[1,2]].iterrows():
output += row.tolist()
return pd.DataFrame(data=[output])
df_output = df_grouped.groupby(0).apply(group_to_row).reset_index()
This is untested but this is also quite a non-standard form so unfortunately I don't think there's a standard Pandas function for you.
consider the df
idx = map('first {}'.format, range(2)) + map('last {}'.format, range(3))
df = pd.DataFrame(np.arange(25).reshape(5, -1), idx, idx)
df
I want to group the dataframe into four quadrants based on the text in the row and column headers. Meaning that the upper left quadrant consists of columns with 'first' and rows with 'first'. The upper right quadrant consists of columns with 'last' and rows with 'first' and so on.
Then within each group, I want to
roll each element one to right if it can
otherwise start on next row at the beggining if it can
otherwise start at the very beginning
This should help illustrate
The expected output should look like this.
Using a nested groupby-apply pattern and np.roll. Perform a groupby on the columns, followed by a groupby on the index to get the desired subgroups to roll. Then use np.roll to perform the roll, wrapping the output in a DataFrame since np.roll only returns an array.
def roll_frame(df, shift):
return pd.DataFrame(np.roll(df, shift), index=df.index, columns=df.columns)
# Groupers for the index and the columns.
idx_groups = df.index.map(lambda x: x.split()[0])
col_groups = df.columns.map(lambda x: x.split()[0])
# Nested groupby, then perform the roll..
df = df.groupby(col_groups, axis=1) \
.apply(lambda grp: grp.groupby(idx_groups).apply(roll_frame, 1))
Kind of gross, but gets the job done. The order in which you perform the nested groupby doesn't really matter.
The resulting output:
first 0 first 1 last 0 last 1 last 2
first 0 6 0 9 2 3
first 1 1 5 4 7 8
last 0 21 10 24 12 13
last 1 11 15 14 17 18
last 2 16 20 19 22 23
my solution
sdf = df.stack()
tups = sdf.index.to_series().apply(lambda x: tuple(pd.Series(x).str.split().str[0]))
sdf.groupby(tups).apply(lambda x: pd.Series(np.roll(x.values, 1), x.index)).unstack()