I have a dataframe df1 and column 1 (col1) contains customer id. Col2 is filled with sales and some of the values are missing
My problem is that I want to drop duplicate customer ids in col1 only where the value of sales is missing.
I tried writing a function saying:
def drop(i):
if i[col2] == np.nan:
i.drop_duplicates(subset = 'col1')
else:
return i['col1']
I am getting an error saying truth value of series is ambiguous
Thank you for reading. Would appreciate a solution
Following should work, using groupby, apply, dropna, reset_index
assuming your data is something like this
input:
col1 col2
0 1001 2.0
1 1001 NaN
2 1002 4.0
3 1002 NaN
code:
import pandas as pd
import numpy as np
#Dummy data
data = {
'col1':[1001,1001,1002,1002],
'col2':[2,np.nan,4,np.nan],
}
df = pd.DataFrame(data)
#Solution
df.groupby('col1').apply(lambda group: group.dropna(subset=['col2'])).reset_index(drop=True)
output:
col1 col2
0 1001 2.0
1 1002 4.0
Related
I download data using url calls. The dataframe' columns are not static. For example, with one url call the dataframe can contain x columns while with another url call it can contain y columns etc.
The column which is always included in the dataframe is the id column. The potential columns' names are: col1, col2, col3, col4, col5, col6 (except of the id column).
I would like to select only the rows of the dataframe that they are not include nan in all columns. There is a circumstance that the dataframe can contain only the id column and therefore there is no need to select any rows.
Lets say that one single url call gives the following dataframe using this hypothetical code:
data = {'id': [1000,2000,3000,4000],
'col1': [np.nan,25000,np.nan,np.nan],
'col2': [np.nan,27000,np.nan,30000],
'col3': [28000,np.nan,np.nan,25000]
}
dfexp = pd.DataFrame(data, columns = ['id', 'col1', 'col2', 'col3'])
id col1 col2 col3
0 1000 NaN NaN 28000.0
1 2000 25000.0 27000.0 NaN
2 3000 NaN NaN NaN
3 4000 NaN 30000.0 25000.0
For example from the above dataframe I would like to select only the rows 0, 1 and 3.
A second url call can give the potential dataframe using the following hypothetical code:
data2 = {'id': [1500,2500,3500,4500],
'col1': [1900,np.nan,np.nan,np.nan],
'col4': [np.nan,np.nan,np.nan,np.nan],
'col5': [np.nan,np.nan,np.nan,np.nan],
'col6': [np.nan,np.nan,np.nan,25000]
}
dfexp2 = pd.DataFrame(data, columns = ['id', 'col1', 'col4', 'col5', 'col6'])
id col1 col4 col5 col6
0 1500 NaN NaN NaN NaN
1 2500 25000.0 NaN NaN NaN
2 3500 NaN NaN NaN NaN
3 4500 NaN NaN NaN NaN
From this second dataframe I want only to select row 1.
In general, I would like to select only the rows that have at least 1 non-nan element. I am a beginner and the dynamic thing is tricky for me. Do you have any thoughts?
Thank you in advance!
Use:
df.set_index('id').dropna(how='all').reset_index()
Explanation
As you are a beginner, let me explain it a bit.
This will (Step 1) temporarily set column id as index and then (Step 2) drop all rows with ALL nan elements (except the original column id as it is now index and index will not be checked for the nan values in the dropna() call.) We need to include parameter how='all' (thanks for reminder of anon01) because the default is how='any' which will drop the rows whenever any one column contains nan. After that, we can (Step 3) restore the column id by moving back from index to column by the reset_index() call.
I have to slice my Dataframe according to values (imported from a txt) that occur in one of my Dataframe' s column. This is what I have:
>df
col1 col2
a 1
b 2
c 3
d 4
>'mytxt.txt'
2
3
This is what I need: drop rows whenever value in col2 is not among values in mytxt.txt
Expected result must be:
>df
col1 col2
b 2
c 3
I tried:
values = pd.read_csv('mytxt.txt', header=None)
df = df.col2.isin(values)
But it doesn' t work. Help would be very appreciated, thanks!
When you read values, I would do it as a Series, and then convert it to a set, which will be more efficient for lookups:
values = pd.read_csv('mytxt.txt', header=None, squeeze=True)
values = set(values.tolist())
Then slicing will work:
>>> df[df.col2.isin(values)]
col1 col2
1 b 2
2 c 3
What was happening is you were reading values in as a DataFrame rather than a Series, so the .isin method was not behaving as you expected.
I imported an excel and now I need multiply certain values from the list but if the value from the first column is NaN, Python should take another column for the calculation. I got the following Code:
if pd['Column1'] == 'NaN':
pd['Column2'] * pd['Column3']
else:
pd['Column1'] * pd['Column3']
Thank you for your help.
You can use isna() together with any() or all(). Here is an example:
import pandas as pd
import numpy as np
#generating test data assuming all the values in Col1 are 'NaN'
df = pd.DataFrame({'Col1':[np.nan,np.nan,np.nan,np.nan], 'Col2':[1,2,3,4], 'Col3':[2,3,4,5]})
if df['Col1'].isna().all(): # you can also use 'any()' instead of all()
df['Col4'] = df['Col2']*df['Col3']
else:
df['Col4'] = df['Col1']*df['Col3']
print(df)
Output:
Col1 Col2 Col3 Col4
0 NaN 1 2 2
1 NaN 2 3 6
2 NaN 3 4 12
3 NaN 4 5 20
I have a dataframe with a column that contain strings. I want to know if it is possible to create a new column based on the dataframe.
This is an example of the column:
col1
016e3d588c
071b4x718g
011e3d598c
041e0i608g
I want to create a new column based on the last character of the string. This is what I tried:
for i in DF['col1']:
if i[-1] == 'g':
DF['col2'] = 1
else:
DF['col2'] = 0
I want the new column like this:
col2
0
1
0
1
but my code have the following output:
col2
0
0
0
0
Is possible to do it?
Thanks in advance
You can try this using numpy.
import numpy as np
DF["col2"] = np.where(DF["col1"].str[-1]=="g",1,0)
Using str.endswith()
Ex:
df = pd.DataFrame({"Col1": ['016e3d588c', '071b4x718g', '011e3d598c', '041e0i608g']})
df["Col2"] = df["Col1"].str.endswith("g").astype(int)
print(df)
Output:
Col1 Col2
0 016e3d588c 0
1 071b4x718g 1
2 011e3d598c 0
3 041e0i608g 1
In my code the df.fillna() method is not working when the df.dropna() method is working. I don't want to drop the column though. What can I do that the fillna() method works?
def preprocess_df(df):
for col in df.columns: # go through all of the columns
if col != "target": # normalize all ... except for the target itself!
df[col] = df[col].pct_change() # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
# df.dropna(inplace=True) # remove the nas created by pct_change
df.fillna(method="ffill", inplace=True)
print(df)
break
df[col] = preprocessing.scale(df[col].values) # scale between 0 and 1.
it should work unless its not within loop as mentioned..
You should consider filling it before you construct a loop or during the DataFrame construction:
Example Below cleary shows it working :
>>> df
col1
0 one
1 NaN
2 two
3 NaN
Works as expected:
>>> df['col1'].fillna( method ='ffill') # This is showing column specific to `col1`
0 one
1 one
2 two
3 two
Name: col1, dtype: object
Secondly, if you wish to change few selective columns then you use below method:
Let's suppose you have 3 columns and want to fillna with ffill for only 2 columns.
>>> df
col1 col2 col3
0 one test new
1 NaN NaN NaN
2 two rest NaN
3 NaN NaN NaN
Define the columns to be changed..
cols = ['col1', 'col2']
>>> df[cols] = df[cols].fillna(method ='ffill')
>>> df
col1 col2 col3
0 one test new
1 one test NaN
2 two rest NaN
3 two rest NaN
If you are considering it to be happen across entire DataFrame, the use it during as Follows:
>>> df
col1 col2
0 one test
1 NaN NaN
2 two rest
3 NaN NaN
>>> df.fillna(method ='ffill') # inplace=True if you considering as you wish for permanent change.
col1 col2
0 one test
1 one test
2 two rest
3 two rest
the first value was a NaN so I had to use bfill method instead. Thanks everyone