I am trying to remove specific NA format with .dropna() method from pandas, however when apply it, the method returns None object.
import pandas as pd
# importing data #
df = pd.read_csv(path, sep=',', na_values='NA')
# this is how the df looks like
df = {'col1': [1, 2], 'col2': ['NA', 4]}
df=pd.DataFrame(df)
# trying to drop NA
d= df.dropna(how='any', inplace=True)
This code returns a None object, expected output could look like this:
# col1 col2
#0 2 4
How could I adjust this method?
Is there any simpler method to accomplish this task?
import numpy as np
import pandas as pd
Firstly replace 'NA' values in your dataframe with actual 'NaN' value by replace() method:
df=df.replace('NA',np.nan,regex=True)
Finally:
df.dropna(how='any', inplace=True)
Now if you print df you will get your desired output:
col1 col2
1 2 4.0
If you want exact same output that you mentioned in question then just use reset_index() method:
df=df.reset_index(drop=True)
Now if you print df you will get:
col1 col2
0 2 4.0
Remove records with string 'NA'
df[~df.eq('NA').any(1)]
col1 col2
1 2 4
I am trying to create a dataframe in pandas and directly use one of the generated columns to assign a new column to the same df.
As a simplified example, I tried to multiply a column of a df using assign:
import pandas as pd
df = pd.DataFrame([['A', 1], ['B', 2], ['C', 3]] , columns = ['col1', 'col2'])\
.assign(col3 = 2 * col2)
but then I get an error NameError: name 'col2' is not defined.
Using R/dplyr, I would be able to do this in a pipe using
df <- data.frame(col1 = LETTERS[1:3], col2 = 1:3) %>% mutate(col3 = 2 * col2)
Also, in a general sense, pipe notation in R/dplyr allows the usage of the "." to refer to the data forwarded by the pipe.
Is there a way to refer to the columns just created (or to the data that goes into the assign statement), thus doing the same thing in Pandas?
Use lambda function, more information in Assigning New Columns in Method Chains:
df = (pd.DataFrame([['A', 1], ['B', 2], ['C', 3]] , columns = ['col1', 'col2'])
.assign(col3 = lambda x: 2 * x.col2))
print (df)
col1 col2 col3
0 A 1 2
1 B 2 4
2 C 3 6
I wrote a package datar to port dplyr and family to python. Now you can do it with (almost) the same syntax as you do it in R:
>>> from datar.all import f, tibble, LETTERS, mutate
>>> tibble(col1=LETTERS[:3], col2=f[1:3]) >> mutate(col3=2*f.col2)
col1 col2 col3
<object> <int64> <int64>
0 A 1 2
1 B 2 4
2 C 3 6
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
for idx, row in df.iterrows():
df.at[idx, 'col3'] = dict(cats=calculation(row['col1']))
df
So as you can see from the code above I have tried a few different things.
Basically I am trying to get the dictionary in to col3.
However, when you run for the first time on new dataframe - you get a
col1 col2 col3
0 1 3 cats
1 2 4 {'cats': 4}
If you run the for loop again on the same dataframe you get what I am looking for which is
col1 col2 col3
0 1 3 {'cats': 2}
1 2 4 {'cats': 4}
How do I go straight to having the dictionary in there to start without having to run the loop again?
I have tried other ways like df.loc and others, still no joy.
Try to stay away from df.iterrows().
You can use df.apply instead:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
def calc_dict(row):
return dict(cats=calculation(row['col1']))
df['col3'] = df.apply(calc_dict, axis=1)
df
Which outputs the result you expect.
The error seems to creep in with the creation and assignment of an object datatype to col col3. I tried to pre-allocate to NaNs with df['col3'] = pd.np.NaN which did not have an effect (inspect with print(df.dtypes)). Anyway this seems like buggy behaviour. Use df.apply instead, its faster and less prone to these types of issues.
Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2
So I have initialized an empty pandas DataFrame and I would like to iteratively append lists (or Series) as rows in this DataFrame. What is the best way of doing this?
df = pd.DataFrame(columns=list("ABC"))
df.loc[len(df)] = [1,2,3]
Sometimes it's easier to do all the appending outside of pandas, then, just create the DataFrame in one shot.
>>> import pandas as pd
>>> simple_list=[['a','b']]
>>> simple_list.append(['e','f'])
>>> df=pd.DataFrame(simple_list,columns=['col1','col2'])
col1 col2
0 a b
1 e f
Here's a simple and dumb solution:
>>> import pandas as pd
>>> df = pd.DataFrame()
>>> df = df.append({'foo':1, 'bar':2}, ignore_index=True)
Could you do something like this?
>>> import pandas as pd
>>> df = pd.DataFrame(columns=['col1', 'col2'])
>>> df = df.append(pd.Series(['a', 'b'], index=['col1','col2']), ignore_index=True)
>>> df = df.append(pd.Series(['d', 'e'], index=['col1','col2']), ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
Does anyone have a more elegant solution?
Following onto Mike Chirico's answer... if you want to append a list after the dataframe is already populated...
>>> list = [['f','g']]
>>> df = df.append(pd.DataFrame(list, columns=['col1','col2']),ignore_index=True)
>>> df
col1 col2
0 a b
1 d e
2 f g
There are several ways to append a list to a Pandas Dataframe in Python. Let's consider the following dataframe and list:
import pandas as pd
# Dataframe
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["col1", "col2"])
# List to append
list = [5, 6]
Option 1: append the list at the end of the dataframe with pandas.DataFrame.loc.
df.loc[len(df)] = list
Option 2: convert the list to dataframe and append with pandas.DataFrame.append().
df = df.append(pd.DataFrame([list], columns=df.columns), ignore_index=True)
Option 3: convert the list to series and append with pandas.DataFrame.append().
df = df.append(pd.Series(list, index = df.columns), ignore_index=True)
Each of the above options should output something like:
>>> print (df)
col1 col2
0 1 2
1 3 4
2 5 6
Reference : How to append a list as a row to a Pandas DataFrame in Python?
Converting the list to a data frame within the append function works, also when applied in a loop
import pandas as pd
mylist = [1,2,3]
df = pd.DataFrame()
df = df.append(pd.DataFrame(data[mylist]))
Here's a function that, given an already created dataframe, will append a list as a new row. This should probably have error catchers thrown in, but if you know exactly what you're adding then it shouldn't be an issue.
import pandas as pd
import numpy as np
def addRow(df,ls):
"""
Given a dataframe and a list, append the list as a new row to the dataframe.
:param df: <DataFrame> The original dataframe
:param ls: <list> The new row to be added
:return: <DataFrame> The dataframe with the newly appended row
"""
numEl = len(ls)
newRow = pd.DataFrame(np.array(ls).reshape(1,numEl), columns = list(df.columns))
df = df.append(newRow, ignore_index=True)
return df
If you want to add a Series and use the Series' index as columns of the DataFrame, you only need to append the Series between brackets:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: row=pd.Series([1,2,3],["A","B","C"])
In [4]: row
Out[4]:
A 1
B 2
C 3
dtype: int64
In [5]: df.append([row],ignore_index=True)
Out[5]:
A B C
0 1 2 3
[1 rows x 3 columns]
Whitout the ignore_index=True you don't get proper index.
simply use loc:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
As mentioned here - https://kite.com/python/answers/how-to-append-a-list-as-a-row-to-a-pandas-dataframe-in-python, you'll need to first convert the list to a series then append the series to dataframe.
df = pd.DataFrame([[1, 2], [3, 4]], columns = ["a", "b"])
to_append = [5, 6]
a_series = pd.Series(to_append, index = df.columns)
df = df.append(a_series, ignore_index=True)
Consider an array A of N x 2 dimensions. To add one more row, use the following.
A.loc[A.shape[0]] = [3,4]
The simplest way:
my_list = [1,2,3,4,5]
df['new_column'] = pd.Series(my_list).values
Edit:
Don't forget that the length of the new list should be the same of the corresponding Dataframe.