Drop/edit rows in dataframe where entry doesn't meet condition

Drop/edit rows in dataframe where entry doesn't meet condition - python

I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.

Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.

I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'

Related

Extract strings values from DataFrame column

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new

This works for me:
s = df[df.Student==1]['food'][0]
s.strip()

It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

Pandas conditionally copy values from one column to another row

I have this Dataframe:
I would like to copy the value of the Date column to the New_Date column, but not only to the same exact row, I want to every row that has the same User_ID value.
So, it will be:
I tried groupby and then copy, but groupby made all values become lists and other columns with same user_id can have different values in different rows and then it messes up many things.
I tried also:
df['New_Date'] = df.apply(lambda x: x['Date'] if x['User_ID'] == x['User_ID'] else x['New_Date'], axis=1)
But it only copied values to the same row and left the other two empty.
And this:
if (df['User_ID'] == df['User_ID']):
df['New_Date'] = np.where(df['New_Date'] == '', df['Date'], df['New_Date'])
None accomplished my intention.
Help is appreciated, Thanks!

try this:
df['New_Date'] = df.groupby('User_Id')['Date'].transform('first')

If I'm understanding you correctly, just copy the Date column and then use .fillna() with ffill=True. If you post your data as text I can provide example code.

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?

I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

Updating element of dataframe while referencing column name and row number

I am coming from an R background and used to being able to retrieve the value from a dataframe by using syntax like:
r_dataframe$some_column_name[row_number]
And I can assign a value to the dataframe by the following syntax:
r_dataframe$some_column_name[row_number] <= some_value
or without the arrow:
r_dataframe$some_column_name[row_number] = some_value
For example:
#create R dataframe data
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))
employ.data <- data.frame(employee, salary, startdate)
#print out the name of this employee
employ.data$employee[2]
#assign the name
employ.data$employee[2] <= 'Some other name'
I'm now learning some Python and from what I can see the most straight-forward way to retreive a value from a pandas dataframe is:
pandas_dataframe['SomeColumnName'][row_number]
I can see the similarities to R.
However, what confuses me is that when it comes to modifying/assigning the value in the pandas dataframe I need to completely change the syntax to something like:
pandas_dataframe.at[row_number, 'SomeColumnName'] = some_value
To read this code is going to require a lot more concentration because the column name and row number have changed order.
Is this the only way to perform this pair of operations? Is there a more logical way to do this that respects the consistent use of column name and row number order?

If I understand what you mean correctly, as #sammywemmy mentioned you can use .loc and .iloc to get/change value in any row and column.
If the order of your dataframe rows changes, you must define index to get every row (datapoint) by its index, even if the order has changed.
Like below:
df = pd.DataFrame(index=['a', 'b', 'c'], columns=['time', 'date', 'name'])
Now you can get the first row by its index:
df.loc['a'] # equivalent to df.iloc[0]

It turns out that pandas_dataframe.at[row_number, 'SomeColumnName'] can be used to modify AND retrieve information.

range(1:len(df)) assigns NaN to last rows in dataframe

I have this weird problem with my code . I am trying to generate Auto Id to my dataframe with this code
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format
now, len(df) is equals to 799734
but df['id'] is Nan after row 77998
I tried to print the values using:
[print(i) for i in range(1,(len(df)+1))]
In first attempt it printed None after 77998 values. In second attempt it printed all values to the end normally. but dataframe has still Nan in last rows.
May be it has something to do with memory? I am not getting any hint. Please help me solve this issue.

Missing values means there is different index values in Series and DataFrame, for correct working need same.
So need pass df.index to Series constructor:
df['id'] = pd.Series(range(1,(len(df)+1)), index=df.index).astype(str).apply('{:0>8}'.format
Or 2 rows solution with assign range:
df['id'] = range(1,(len(df)+1))
df['id'] = df['id'].astype(str).apply('{:0>8}'.format
Or create default index values in DataFrame for same like Series:
df = df.reset_index(drop=True)
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Drop/edit rows in dataframe where entry doesn't meet condition - python

Try with: df.drop(df[df.age.str.isnumeric() == False].index, inplace=True) If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error. Also you will need the ==False because you have mixed types and get a series with only booleans.

Related

Extract strings values from DataFrame column

Pandas conditionally copy values from one column to another row

Selecting specific columns in where condition using Pandas

Updating element of dataframe while referencing column name and row number

range(1:len(df)) assigns NaN to last rows in dataframe

Categories

Resources