Maybe someone knows how to add two rows of an data frame by grouping with specific condition.
dfa.groupby(['Col1','Col2'])[['Quantity']].sum()
Say we have this df:
Col1 Col2 Quantity
0 1 1 10
1 1 1 10
2 2 1 3
3 1 2 3
4 1 2 3
And Im trying to get this:
Condition to sum:
Col1 element of one Row is equal to element of other row in Col1 AND
Col2 element of that Row is equal to element of the other row in Col2
Col1 Col2 Quantity
0 1 1 20
2 2 1 3
3 1 2 6
This seems like what you are looking for:
dfa[dfa.Col1 == dfa.Col2].groupby(['Col1','Col2'])[['Quantity']].sum()
I would think that Groupby would do the trick.
Related
It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)
I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ
From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)
I have a dataframe :
row1 col1 col2
1 U 1
2 U 1
3 U 1
4 D 1
5 D 1
6 U 1
7 U 1
When I did groupby sum I got :
col1 col2
1 U 5
2 D 2
But what I want is :
col1 col2
1 U 3
2 D 2
3 U 2
someone answered a similar question. but using oracle sql. I only have pandas and python available.
Group rows Keeping the Order of values
done with sql
How can I get achieve the output.
Groupby by checking if first row is not equal to second rows. i.e
df = pd.DataFrame({'col1':['U','U','D','U','U'],'col2':[3,1,2,1,1]})
mask = df['col1'].ne(df['col1'].shift()).cumsum()
ndf = df.groupby(mask).agg({'col1':'first','col2':'sum'})
col1 col2
col1
1 U 4
2 D 2
3 U 2
I have a pandas DataFrame as follow:
col1 col2 col3
0 1 3 ABCDEFG
1 1 5 HIJKLMNO
2 1 2 PQRSTUV
I want to add another column which should be a substring of col3 from position as indicated in col1 to position as indicated in col2. Something like col3[(col1-1):(col2-1)], which should result in:
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJK
2 1 2 PQRSTUV PQ
I tried with the following:
my_df['new_col'] = my_df.col3.str.slice(my_df['col1']-1, my_df['col2']-1)
and
my_df['new_col'] = data['col3'].str[(my_df['col1']-1):(my_df['col2']-1)]
Both of them results in a column of NaN, while if I insert two numerical values (i.e. data['col3'].str[1:3]) it works fine. I checked and the types are correct (int64, int64 and object). Also, outside such context (e.g. using a for loop) I can get the job done, but I'd prefer a one liner that exploit the DataFrame. What am I doing wrong?
Use apply, because each row has to be process separately:
my_df['new_col'] = my_df.apply(lambda x: x['col3'][x['col1']-1:x['col2']], 1)
print (my_df)
col1 col2 col3 new_col
0 1 3 ABCDEFG ABC
1 1 5 HIJKLMNO HIJKL
2 1 2 PQRSTUV PQ