i want to give if condition in my pandas column, but nothing applied. I had check dtypes of column and it said Float32, am i wrong in any typing or somewhat i dont know
i had tried this code :
How to use numpy.where with logical operators
Numpy "where" with multiple conditions
multiple if else conditions in pandas dataframe and derive multiple columns
and i try to use numpy.where, but nothing happen.. data['growth'] always have result 2
my code :
data['growth'] = np.where(np.logical_and(data['tqty'] < 0, data['qty_x'] < 0), 1, 2)
i dont prefer to use if or numpy.where
many thanks
Try to use DataFrame's method 'apply':
Using this sample data:
df = pd.DataFrame(data=[(-1, -1),
(-1, 0),
(0, -1),
(0, 0),
(1, 1)],
columns=['tqty', 'qtyx'])
| | tqty | qtyx |
|---:|-------:|-------:|
| 0 | -1 | -1 |
| 1 | -1 | 0 |
| 2 | 0 | -1 |
| 3 | 0 | 0 |
| 4 | 1 | 1 |
You can get this using lambda functions:
df['growth'] = df.apply(
lambda row: 1 if ((row['tqty'] < 0) & (row['qtyx'] < 0))
else 2,
axis=1)
| | tqty | qtyx | growth |
|---:|-------:|-------:|---------:|
| 0 | -1 | -1 | 1 |
| 1 | -1 | 0 | 2 |
| 2 | 0 | -1 | 2 |
| 3 | 0 | 0 | 2 |
| 4 | 1 | 1 | 2 |
'Axis=1' allows you to iterate through rows, and lambda operates over each row testing both conditions.
Related
python newbie here. I have written the code that solves the issue. However, there should be a much better way of doing it.
I have two Series that come from the same table but due to some earlier process I get as separate sets. (They could be joined into a single dataframe again since the entries belong to the same record)
Ser1 Ser2
| id | | section |
| ---| |-------- |
| 1 | | A |
| 2 | | B |
| 2 | | C |
| 3 | | D |
df2
| id | section |
| ---|---------|
| 1 | A |
| 2 | B |
| 2 | Z |
| 2 | Y |
| 4 | X |
First, I would like to find those entries in Ser1, which match the same id in df2. Then, check if the values in the ser2 can NOT be found in the section column of df2
My expected results:
| id | section | result |
| ---|-------- |---------|
| 1 | A | False | # Both id(1) and section(A) are also in df2
| 2 | B | False | # Both id(2) and section(B) are also in df2
| 2 | C | True | # id(2) is in df2 but section(C) is not
| 3 | D | False | # id(3) is not in df2, in that case the result should also be False
My code:
for k, v in Ser2.items():
rslt_df = df2[df2['id'] == Ser[k]]
if rslt_df.empty:
print(False)
if(v not in rslt_df['section'].tolist()):
print(True)
else:
print(False)
I know the code is not very good. But after reading about merging and comprehension lists I am getting confused what the best way would be to improve it.
You can concat the series and compute the "result" with boolean arithmetic (XOR):
out = (
pd.concat([ser1, ser2], axis=1)
.assign(result=ser1.isin(df2['id'])!=ser2.isin(df2['section']))
)
Output:
id section result
0 1 A False
1 2 B False
2 2 C True
3 3 D False
Intermediates:
m1 = ser1.isin(df2['id'])
m2 = ser2.isin(df2['section'])
m1 m2 m1!=m2
0 True True False
1 True True False
2 True False True
3 False False False
As the title states I want to pivot/crosstab my dataframe
Let's say I have a df that looks like this:
df = pd.DataFrame({'ID' : [0, 0, 1, 1, 1],
'REV' : [0, 0, 1, 1, 1],
'GROUP' : [1, 2, 1, 2, 3],
'APPR' : [True, True, NULL, NULL, True})
+----+-----+-------+------+
| ID | REV | GROUP | APPR |
+----+-----+-------+------+
| 0 | 0 | 1 | True |
| 0 | 0 | 2 | True |
| 1 | 1 | 1 | NULL |
| 1 | 1 | 2 | NULL |
| 1 | 1 | 3 | True |
+----+-----+-------+------+
I want to do some kind of pivot so my result of the table looks like
+----+-----+------+------+-------+
| ID | REV | 1 | 2 | 3 |
+----+-----+------+------+-------+
| 0 | 0 | True | True | False |
| 1 | 1 | NULL | NULL | True |
+----+-----+------+------+-------+
Now the values from the GROUP column becomes there own column. The value of each of those columns is T/F/NULL based on APPR only for the T/NULL part. I want it be False when the group didn't exist for the ID REV combo.
similar question I've asked before, but I wasn't sure how to make this answer work with my new scenario:
Pandas pivot dataframe and setting the new columns as True/False based on if they existed or not
Hope that makes, sense!
Have you tried to pivot?
pd.pivot(df, index=['ID','REV'], columns=['GROUP'], values='APPR').fillna(False).reset_index()
I have the following dataframe
+-------+------------+--+
| index | keep | |
+-------+------------+--+
| 0 | not useful | |
| 1 | start_1 | |
| 2 | useful | |
| 3 | end_1 | |
| 4 | not useful | |
| 5 | start_2 | |
| 6 | useful | |
| 7 | useful | |
| 8 | end_2 | |
+-------+------------+--+
There are two pairs of strings (start_1, end_1, start_2, end_2) that indicate that the rows between those strings are the only ones relevant in the data. Hence, in the dataframe below, the output dataframe would be only composed of the rows at index 2, 6, 7 (since 2 is between start_1 and end_1; and 6 and 7 is between start_2 and end_2)
d = {'keep': ["not useful", "start_1", "useful", "end_1", "not useful", "start_2", "useful", "useful", "end_2"]}
df = pd.DataFrame(data=d)
What is the most Pythonic/Pandas approach to this problem?
Thanks
Here's one way to do that (in a couple of steps, for clarity). There might be others:
df["sections"] = 0
df.loc[df.keep.str.startswith("start"), "sections"] = 1
df.loc[df.keep.str.startswith("end"), "sections"] = -1
df["in_section"] = df.sections.cumsum()
res = df[(df.in_section == 1) & ~df.keep.str.startswith("start")]
Output:
index keep sections in_section
2 2 useful 0 1
6 6 useful 0 1
7 7 useful 0 1
My goal is to replace all negative elements in a column of a PySpark.DataFrame with zero.
input data
+------+
| col1 |
+------+
| -2 |
| 1 |
| 3 |
| 0 |
| 2 |
| -7 |
| -14 |
| 3 |
+------+
desired output data
+------+
| col1 |
+------+
| 0 |
| 1 |
| 3 |
| 0 |
| 2 |
| 0 |
| 0 |
| 3 |
+------+
Basically I can do this as below:
df = df.withColumn('col1', F.when(F.col('col1') < 0, 0).otherwise(F.col('col1'))
or udf can be defined as
import pyspark.sql.functions as F
smooth = F.udf(lambda x: x if x > 0 else 0, IntegerType())
df = df.withColumn('col1', smooth(F.col('col1')))
or
df = df.withColumn('col1', (F.col('col1') + F.abs('col1')) / 2)
or
df = df.withColumn('col1', F.greatest(F.col('col1'), F.lit(0))
My question is, which one is the most efficient way of doing this? Udf has optimization issues, so absolutely it's not the correct way of doing this. But I don't know how to approach comparing the other two cases. One answer should be absolutely making experiments and comparing the mean running times and so on. But I want to compare these approaches (and new approaches) theoretically.
Thanks in advance...
You can simply make a column where you say, if x > 0: x else 0. This would be the best approach.
The question has already been addressed, theoretically: Spark functions vs UDF performance?
import pyspark.sql.functions as F
df = df.withColumn("only_positive", F.when(F.col("col1") > 0, F.col("col1")).otherwise(0))
You can overwrite col1 in the original dataframe, if you pass that to withColumn()
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())