Pandas conditionally copy values from one column to another row - python

I have this Dataframe:
I would like to copy the value of the Date column to the New_Date column, but not only to the same exact row, I want to every row that has the same User_ID value.
So, it will be:
I tried groupby and then copy, but groupby made all values become lists and other columns with same user_id can have different values in different rows and then it messes up many things.
I tried also:
df['New_Date'] = df.apply(lambda x: x['Date'] if x['User_ID'] == x['User_ID'] else x['New_Date'], axis=1)
But it only copied values to the same row and left the other two empty.
And this:
if (df['User_ID'] == df['User_ID']):
df['New_Date'] = np.where(df['New_Date'] == '', df['Date'], df['New_Date'])
None accomplished my intention.
Help is appreciated, Thanks!

try this:
df['New_Date'] = df.groupby('User_Id')['Date'].transform('first')

If I'm understanding you correctly, just copy the Date column and then use .fillna() with ffill=True. If you post your data as text I can provide example code.

Related

Extract strings values from DataFrame column

I have the following DataFrame:
Student
food
1
R0100000
2
R0200000
3
R0300000
4
R0400000
I need to extract as a string the values of the "food" column of the df DataFrame when I filter the data.
For example, when I filter by the Student=1, I need the return value of "R0100000" as a string value, without any other characters or spaces.
This is the code to create the same DataFrame as mine:
data={'Student':[1,2,3,4],'food':['R0100000', 'R0200000', 'R0300000', 'R0400000']}
df=pd.DataFrame(data)
I tried to select the Dataframe Column and apply str(), but it does not return me the desired results:
df_new=df.loc[df['Student'] == 1]
df_new=df_new.food
df_str=str(df_new)
del df_new
This works for me:
s = df[df.Student==1]['food'][0]
s.strip()
It's pretty simple, first get the column.
like, col =data["food"] and then use col[index] to get respective value
So, you answer would be data["food"][0]
Also, you can use iloc and loc search for these.
(df.iloc[rows,columns], so we can use this property to get answer as, df.iloc[0,1])
df.loc[rows, column_names] example: df.loc[0,"food"]

Selecting specific columns in where condition using Pandas

I have a below Dataframe with 3 columns:
df = DataFrame(query, columns=["Processid", "Processdate", "ISofficial"])
In Below code, I get Processdate based on Processid==204 (without Column Names):
result = df[df.Processid == 204].Processdate.to_string(index=False)
But I wan the same result for Two columns at once without column names, Something like below code:
result = df[df.Processid == 204].df["Processdate","ISofficial"].to_string(index=False)
I know how to get above result but I dont want Column names, Index and data type.
Can someone help?
I think you are looking for header argument in to_string parameters. Set it to False.
df[df.Processid==204][['Processdate', 'ISofficial']].to_string(index=False, header=False)

Drop/edit rows in dataframe where entry doesn't meet condition

I know this has been asked before but I cannot find an answer that is working for me. I have a dataframe df that contains a column age, but the values are not all integers, some are strings like 35-59. I want to drop those entries. I have tried these two solutions as suggested by kite but they both give me AttributeError: 'Series' object has no attribute 'isnumeric'
df.drop(df[df.age.isnumeric()].index, inplace=True)
df = df.query("age.isnumeric()")
df = df.reset_index(drop=True)
Additionally is there a simple way to edit the value of an entry if it matches a certain condition? For example instead of deleting rows that have age as a range of values, I could replace it with a random value within that range.
Try with:
df.drop(df[df.age.str.isnumeric() == False].index, inplace=True)
If you check documentation isnumeric is a method of Series.str and not of Series. That's why you get that error.
Also you will need the ==False because you have mixed types and get a series with only booleans.
I'm posting it in case this also helps you with your last question. You can use pandas.DataFrame.at with pandas.DataFrame.Itertuples for iteration over rows of the dataframe and replace values:
for row in df.itertuples():
# iterate every row and change the value of that column
if row.age == 'non_desirable_value:
df.at[row.Index, "age"] = 'desirable_value'
Hence, it could be:
for row in df.itertuples():
if row.age.str.isnumeric() == False or row.age == 'non_desirable_value':
df.at[row.Index, "age"] = 'desirable_value'

range(1:len(df)) assigns NaN to last rows in dataframe

I have this weird problem with my code . I am trying to generate Auto Id to my dataframe with this code
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format
now, len(df) is equals to 799734
but df['id'] is Nan after row 77998
I tried to print the values using:
[print(i) for i in range(1,(len(df)+1))]
In first attempt it printed None after 77998 values. In second attempt it printed all values to the end normally. but dataframe has still Nan in last rows.
May be it has something to do with memory? I am not getting any hint. Please help me solve this issue.
Missing values means there is different index values in Series and DataFrame, for correct working need same.
So need pass df.index to Series constructor:
df['id'] = pd.Series(range(1,(len(df)+1)), index=df.index).astype(str).apply('{:0>8}'.format
Or 2 rows solution with assign range:
df['id'] = range(1,(len(df)+1))
df['id'] = df['id'].astype(str).apply('{:0>8}'.format
Or create default index values in DataFrame for same like Series:
df = df.reset_index(drop=True)
df['id'] = pd.Series(range(1,(len(df)+1))).astype(str).apply('{:0>8}'.format

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

Categories