take the last one id in dataframe using python - python

how i select row in dataframe based on the last position for every user id. Is there any idea?
data=pd.DataFrame({'User_ID':['122','122','122','233','233','233','233','366','366','366'],'Age':[23,23,np.nan,24,24,24,24,21,21,np.nan]})
data
and the outcomes should be like this
data_new=pd.DataFrame({'User_ID':['122','233','366'],'Age':[np.nan,24,np.nan]})
so i just try to take the last row for every user_id. I'm totally beginner, is there any idea?

As you want to keep the NaN, you can groupby.tail (groupby.last would drop the NaNs):
out = data.groupby('User_ID').tail(1)
Another option is to drop_duplicates:
out = data.drop_duplicates(subset='User_ID', keep='last')
output:
User_ID Age
2 122 NaN
6 233 24.0
9 366 NaN
If you want to reset the index in the process use ignore_index=True:
out = data.drop_duplicates(subset='User_ID', keep='last', ignore_index=True)
output:
User_ID Age
0 122 NaN
1 233 24.0
2 366 NaN

data_new =data.drop_duplicates(subset='User_ID', keep='last')

Related

Grouping a Python Dataframe containing columns with column locations

I have this dataframe below where i am trying to group each them into a single row by person and purchase id. column purchase date location contains the column name in which the date is located for said purchase. i am trying to use the location to determine the earliest date a purchase was made
person
purchase_id
purchase_date_location
column_z
column_x
final_pruchase_date
a
1
column_z
NaN
NaN
a
1
column_z
2022-01-01
NaN
a
1
column_z
2022-02-01
NaN
b
2
column_x
NaN
NaN
b
2
column_x
NaN
2022-03-03
i have tried this so far:
groupings = {df.purchase_date_location.iloc[0]: 'min'}
df2 = df.groupby('purchase_id', as_index=False).agg(groupings)
My problem here is due to the iloc[0] my value will always be column_z, my question is how do i make this value change corresponding to the row and not be fixated on the first
I would try solve it like this:
df['purchase_date'] = df[['purchase_date_location']].apply(
lambda x: df.loc[x.name, x.iloc[0]], axis=1)
df2 = df.groupby('purchase_id', as_index=False).agg({"purchase_date": min})

How to calculate time-difference in pandas with NaN-values

I´m relatively new to Pandas and tried the search already, but I couldn´t find a solution.
I have a dataframe with Transaction-No., customerId and the date of purchase which looks like this:
Transaction 12345 12346 12347 12348 12349
customerID
1 NaN 2019-09-01 NaN 2019-09-11 2019-09-22...
2 2019-10-01 NaN NaN NaN 2019-10-07...
3 ...
The dataframe has [6334 rows x 8557 columns].
Every row has NaN-values, as the Transaction-No. is unique.
I would like to calculate the date difference for each row so I get
customerID Datedifference1 Datedifference2 etc.
1 10 11
2 6
3 ...
I´m struggling to get a list with the datedifferences for every customerId.
Is there a way to ignore NaN in the dataframe and to only calculate on the values that are not NaN?
I would like to have a list with customerId and the datediff between purchase 1 and 2, 2 and 3, etc. to estimate the days until the next purchase will occur.
Is there a solution for that?
Idea is reshape data by DataFrame.stack, then get differences, remove first missing values per groups and reshape back:
df = df.apply(pd.to_datetime)
df1 = (df.stack()
.groupby(level=0)
.diff()
.dropna()
.dt.days
.reset_index(level=1, drop=True)
.to_frame())
df1 = (df1.set_index(df1.groupby(['customerID']).cumcount(), append=True)[0]
.unstack()
.add_prefix('Datedifference'))
print (df1)
Datedifference0 Datedifference1
Transaction
1 10.0 11.0
2 6.0 NaN
EDIT: If input data are different, solution is changed - convert column to datetimes, create new column by DataFrameGroupBy.diff for differencies, remove only NaN rows by DataFrame.dropna and last reshape with DataFrame.set_index and unstack with counter Series by GroupBy.cumcount:
print (df1)
customerID Transaction date
0 1 12346 2019-09-01
1 1 12348 2019-09-11
2 1 12349 2019-09-22
3 2 12345 2019-10-01
4 2 12349 2019-10-07
df1['date'] = pd.to_datetime(df1['date'])
df1['diff'] = df1.groupby('customerID')['date'].diff().dt.days
df1 = df1.dropna(subset=['diff'])
df2 = (df1.set_index(['customerID', df1.groupby('customerID').cumcount()])['diff']
.unstack()
.add_prefix('Datedifference'))
print (df2)
Datedifference0 Datedifference1
customerID
1 10.0 11.0
2 6.0 NaN

Data Rearrangement using python Pandas | Create a column based on repeated index and fill with the column value

Rearranging python csv data into rows and different column
I have csv dtabase which contains the name and friend list in below format
Expected output like as below:
Name and Value in one row with the number of repeated columns as per the name repetition.
What is the best way to perform this output?
You could also use groupby and create a new Dataframe with from_dict :
new_dict = (df.groupby('Name')
.apply(lambda x: list(map(lambda x: x, x['Value'])))
.to_dict())
new_df = pd.DataFrame.from_dict(new_dict, orient='index')
This will give you :
0 1 2
Ajay C529 C530 None
Djna A-506 A-507 A-508
Patc2 B-526 B-527 B-528
IIUC you would need df.pivot() and then shift the values to the left:
df_new=df.pivot(index='Name',columns='Value',values='Value')\
.apply(lambda x: pd.Series(x.dropna().values), axis=1).fillna(np.nan)
df_new.columns=['value_'+str(i+1) for i in df_new.columns]
print(df_new)
value_1 value_2 value_3 value_4 value_5 value_6 value_7 value_8 value_9 \
Name
Ajay C529 C530 C531 C532 C533 C534 C535 NaN NaN
Djna A-506 A-507 A-508 A-509 A-510 A-511 A-512 A-513 A-514
Patc2 B-526 B-527 B-528 NaN NaN NaN NaN NaN NaN
value_10
Name
Ajay NaN
Djna A-515
Patc2 NaN

How to use groupby in lambda and call first and last value of same id?

I have a very basic doubt . How to call first and last value of teo column with same id using lambda function?
for example if i have the following data,
df = pd.DataFrame()
df["id"] = ['A1','A2','A1','A1','A1','2A','2A','2C','A2','2C']
df["Start"] = [ '- 24.432972' ,'-33.94611','48.12358','-108.235678','75.56794','300.235689‌​','-80.26598','55.23‌​4987','208.29574','1‌​01.235689']
df["End"] = ['-12.234859','-78.26574','40.59862','81.265987','78.245798'‌​,'36.159648','-88.22‌​2256','-51.624566','‌​-205.235894','108.23‌​5684']
How to groupby id and get the first value and last value of start and end , grouped by id ?
df = df.groupby('id')['Start'].first().apply(pd.to_numeric, errors='coerce').fillna(0)
This gives me NaN values as output
It seems you need first groupby with agg by first and last and then get max per columns or per rows by max:
df1 = df.groupby('id').agg({'start': 'first', 'end': 'last'})
print (df1)
start end
id
123 4 1
213 6 6
456 2 8
df2 = df1.max()
print (df2)
start 6.0
end 8.0
dtype: float64
df3 = df1.max(axis=1)
print (df3)
id
123 4.0
213 6.0
456 8.0
dtype: float64

Pandas - merge two DataFrames with Identical Column Names

I have two Data Frames with identical column names and identical IDs in the first column. With the exception of the ID column, every cell that contains a value in one DataFrame contains NaN in the other.
Here's an example of what they look like:
ID Cat1 Cat2 Cat3
1 NaN 75 NaN
2 61 NaN 84
3 NaN NaN NaN
ID Cat1 Cat2 Cat3
1 54 NaN 44
2 NaN 38 NaN
3 49 50 53
I want to merge them into one DataFrame while keeping the same Column Names. So the result would look like this:
ID Cat1 Cat2 Cat3
1 54 75 44
2 61 38 84
3 49 50 53
I tried:
df3 = pd.merge(df1, df2, on='ID', how='outer')
Which gave me a DataFrame containing twice as many columns. How can I merge the values from each DataFrame into one?
You probably want df.update. See the documentation.
df1.update(df2, raise_conflict=True)
In this case, the combine_first function is appropriate. (http://pandas.pydata.org/pandas-docs/version/0.13.1/merging.html)
As the name implies, combine_first takes the first DataFrame and adds to it with values from the second wherever it finds a NaN value in the first.
So:
df3 = df1.combine_first(df2)
produces a new DataFrame, df3, that is essentially just df1 with values from df2 filled in whenever possible.
You could also just change the NaN values in df1 with non-NaN values in df2.
df1[pd.isnull(df1)] = df2[~pd.isnull(df2)]

Categories