I have the following pandas data frame.
ID col1 col2 value
1 4 New 20
2 4 OLD 30
3 5 OLD 60
4 5 New 50
5 3 New 70
I would like to select only rows which has the following rules. from col1 value 4 and 3 should be in New and 5 should be in Old in col2. Drop other rows other wise.
ID col1 col2 value
1 4 New 20
3 5 Old 60
5 3 New 70
Can any one help with this in Python pandas?
Use DataFrame.query with filter by in chained by & for bitwise AND and second condition chain by | for bitwise OR:
df1 = df.query("(col1 in [4,3] & col2 == 'New') | (col1 == 5 & col2 == 'OLD')")
print (df1)
ID col1 col2 value
0 1 4 New 20
2 3 5 OLD 60
4 5 3 New 70
Or use boolean indexing with Series.isin:
df1 = df[df['col1'].isin([3,4]) & df['col2'].eq('New') |
df['col1'].eq(5) & df['col2'].eq('OLD')]
Related
It should be straight forward but I cannot find the right command.
I want to add a new column (Col3) to my Numpy which counts the occurrence of a value of a column (Col2) for each row. take this example:
before:
Col1
Col2
1
4
2
4
3
1500
4
60
5
60
6
60
after:
Col1
Col2
Col3
1
4
2
2
4
2
3
1500
1
4
60
3
5
60
3
6
60
3
any idea?
Using numpy:
Create a frequency dictionary based on the values on Col2
from collections import Counter
freq = Counter(arr[:,1])
generate the values of the Col3 iterating the elements of Col2
new_col = np.array([freq[val] if val in freq else 0 for val in arr[:,1] ]).reshape(-1,1)
concatenate the new column to the existing array
new_arr = np.concatenate([arr, new_col],axis=1)
I have a dataframe in pandas and I would like to subtract one column (lets say col1) from col2 and col3 (or from more columns if there are) without writing the the below assign statement for each column.
df = pd.DataFrame({'col1':[1,2,3,4], 'col2':[2,5,6,8], 'col3':[5,5,5,9]})
df = (df
...
.assign(col2 = lambda x: x.col2 - x.col1)
)
How can I do this? Or would it work with apply? How would you be able to do this with method chaining?
Edit: (using **kwarg with chainning method)
As in your comment, if you want to chain method on the intermediate(on-going calculated) dataframe, you need to define a custom dictionary to calculate each column to use with assign as follows (you can't use lambda to directly construct dictionary inside assign).
In this example I do add 5 to the dataframe before chaining assign to show how it works on chain processing as you want
d = {cl: lambda x, cl=cl: x[cl] - x['col1'] for cl in ['col2','col3']}
df_final = df.add(5).assign(**d)
In [63]: df
Out[63]:
col1 col2 col3
0 1 2 5
1 2 5 5
2 3 6 5
3 4 8 9
In [64]: df_final
Out[64]:
col1 col2 col3
0 6 1 4
1 7 3 3
2 8 3 2
3 9 4 5
Note: df_final.col1 is different from df.col1 because of the add operation before assign. Don't forget cl=cl in the lambda of dictionary. It is there to avoid late-binding issue of python.
Use df.sub
df_sub = df.assign(**df[['col2','col3']].sub(df.col1, axis=0).add_prefix('sub_'))
Out[22]:
col1 col2 col3 sub_col2 sub_col3
0 1 2 5 1 4
1 2 5 5 3 3
2 3 6 5 3 2
3 4 8 9 4 5
If you want to assign values back to col2, col3, use additional update
df.update(df[['col2','col3']].sub(df.col1, axis=0))
print(df)
Output:
col1 col2 col3
0 1 1 4
1 2 3 3
2 3 3 2
3 4 4 5
I have a dataframe with columns like this -
Name Id 2019col1 2019col2 2019col3 2020col1 2020col2 2020col3 2021col1 2021Ccol2 2021Ccol3
That is, the columns are repeated for each year.
I want to take the year out and make it a column, so that the final dataframe looks like -
Name Id Year col1 col2 col3
Is there a way in pandas to achieve something like this?
Use wide_to_long, but before change order years to end of columns names like 2019col1 to col12019 in list comprehension:
print (df)
Name Id 2019col1 2019col2 2019col3 2020col1 2020col2 2020col3 \
0 a 456 4 5 6 2 3 4
2021col1 2021col2 2021col3
0 5 2 1
df.columns = [x[4:] + x[:4] if x[:4].isnumeric() else x for x in df.columns]
df = (pd.wide_to_long(df.reset_index(),
['col1','col2', 'col3'],
i='index',
j='Year').reset_index(level=0, drop=True).reset_index())
print (df)
Year Id Name col1 col2 col3
0 2019 456 a 4 5 6
1 2020 456 a 2 3 4
2 2021 456 a 5 2 1
From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?
You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)
Maybe someone knows how to add two rows of an data frame by grouping with specific condition.
dfa.groupby(['Col1','Col2'])[['Quantity']].sum()
Say we have this df:
Col1 Col2 Quantity
0 1 1 10
1 1 1 10
2 2 1 3
3 1 2 3
4 1 2 3
And Im trying to get this:
Condition to sum:
Col1 element of one Row is equal to element of other row in Col1 AND
Col2 element of that Row is equal to element of the other row in Col2
Col1 Col2 Quantity
0 1 1 20
2 2 1 3
3 1 2 6
This seems like what you are looking for:
dfa[dfa.Col1 == dfa.Col2].groupby(['Col1','Col2'])[['Quantity']].sum()
I would think that Groupby would do the trick.