It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)
Related
In a data, frame match two-column and if any value from the second column is available in the first column, remove value from the second columns
col1 col2
1
2 1
3 9
4
5 1
6 2
Output
col1 col2
1
2
3 9
4
5
6
Here, 1 and 2 from col2 are available in col1.
So, this repeated data should be removed
Using s.mask to value match and replace, we can do something along the likes of:
df['col2'] = df['col2'].mask(pd.to_numeric(df['col2']).isin(df['col1']), "")
col1 col2
0 1
1 2
2 3 9.0
3 4
4 5
5 6
import pandas as pd
col1= [1,2,3,4,5,6]
col2= [0,0,9.0,0,0,0]
df = pd.DataFrame({'col1':col1, 'col2':col2})
# add column with no of occurrence of Non None values in the column name starts with 'a'
# iterate over columns
for col in df.columns:
# remove values that are in previous columns
for prev_col in df.columns[:df.columns.get_loc(col)]:
df[col] = df[col].where(~df[col].isin(df[prev_col]), None)
# OUTPUT
# col1 col2
# 0 1 0.0
# 1 2 0.0
# 2 3 9.0
# 3 4 0.0
# 4 5 0.0
# 5 6 0.0
I have the following pandas data frame.
ID col1 col2 value
1 4 New 20
2 4 OLD 30
3 5 OLD 60
4 5 New 50
5 3 New 70
I would like to select only rows which has the following rules. from col1 value 4 and 3 should be in New and 5 should be in Old in col2. Drop other rows other wise.
ID col1 col2 value
1 4 New 20
3 5 Old 60
5 3 New 70
Can any one help with this in Python pandas?
Use DataFrame.query with filter by in chained by & for bitwise AND and second condition chain by | for bitwise OR:
df1 = df.query("(col1 in [4,3] & col2 == 'New') | (col1 == 5 & col2 == 'OLD')")
print (df1)
ID col1 col2 value
0 1 4 New 20
2 3 5 OLD 60
4 5 3 New 70
Or use boolean indexing with Series.isin:
df1 = df[df['col1'].isin([3,4]) & df['col2'].eq('New') |
df['col1'].eq(5) & df['col2'].eq('OLD')]
I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5
From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)
Maybe someone knows how to add two rows of an data frame by grouping with specific condition.
dfa.groupby(['Col1','Col2'])[['Quantity']].sum()
Say we have this df:
Col1 Col2 Quantity
0 1 1 10
1 1 1 10
2 2 1 3
3 1 2 3
4 1 2 3
And Im trying to get this:
Condition to sum:
Col1 element of one Row is equal to element of other row in Col1 AND
Col2 element of that Row is equal to element of the other row in Col2
Col1 Col2 Quantity
0 1 1 20
2 2 1 3
3 1 2 6
This seems like what you are looking for:
dfa[dfa.Col1 == dfa.Col2].groupby(['Col1','Col2'])[['Quantity']].sum()
I would think that Groupby would do the trick.