Python, Pandas: Keep only the newest and unique data inside dataframe - python

Good evening,
the objects inside my dataframe can pop up as many times they want, always with additional - changing - extra data and at least with a unique timestamp (column with date is not unique), something like this...
id | object | additional_data | date | timestamp
1 | item_a | ... | 2014-04-15 | 10:16:22
2 | item_a | ... | 2014-04-10 | 18:19:01
3 | item_a | ... | 2014-04-10 | 17:59:43
4 | item_b | ... | 2014-04-13 | 10:16:22
5 | item_c | ... | 2014-04-15 | 00:01:59
6 | item_c | ... | 2014-04-14 | 08:46:00
7 | item_d | ... | 2014-04-15 | 10:12:47
Is it possible to filter the dataframe only for the unqique and newest data? For example like this:
id | object | additional_data | date | timestamp
1 | item_a | ... | 2014-04-15 | 10:16:22
4 | item_b | ... | 2014-04-13 | 10:16:22
5 | item_c | ... | 2014-04-15 | 00:01:59
7 | item_d | ... | 2014-04-15 | 10:12:47
Thanks for all your help and have a great day!

Firstly sort your dataframe by 'date' and 'timestamp' column by using sort_values():
df=df.sort_values(by=['date','timestamp'],ascending=[False,False]])
Now use drop_duplicates() method:
df=df.drop_duplicates(subset=['object'],ignore_index=True)
OR
you can also do this by sort_values() and groupby():
df.sort_values(by=['date','timestamp'],ascending=[False,False]).groupby('object',as_index=False).first()

Related

Extract utc format for datetime object in a new Python column

Be the following pandas DataFrame:
| ID | date |
|--------------|---------------------------------------|
| 0 | 2022-03-02 18:00:20+01:00 |
| 0 | 2022-03-12 17:08:30+01:00 |
| 1 | 2022-04-23 12:11:50+01:00 |
| 1 | 2022-04-04 10:15:11+01:00 |
| 2 | 2022-04-07 08:24:19+02:00 |
| 3 | 2022-04-11 02:33:22+02:00 |
I want to separate the date column into two columns, one for the date in the format "yyyy-mm-dd" and one for the time in the format "hh:mm:ss+tmz".
That is, I want to get the following resulting DataFrame:
| ID | date_only | time_only |
|--------------|-------------------------|----------------|
| 0 | 2022-03-02 | 18:00:20+01:00 |
| 0 | 2022-03-12 | 17:08:30+01:00 |
| 1 | 2022-04-23 | 12:11:50+01:00 |
| 1 | 2022-04-04 | 10:15:11+01:00 |
| 2 | 2022-04-07 | 08:24:19+02:00 |
| 3 | 2022-04-11 | 02:33:22+02:00 |
Right now I am using the following code, but it does not return the time with utc +hh:mm.
df['date_only'] = df['date'].apply(lambda a: a.date())
df['time_only'] = df['date'].apply(lambda a: a.time())
| ID | date_only |time_only |
|--------------|-------------------------|----------|
| 0 | 2022-03-02 | 18:00:20 |
| 0 | 2022-03-12 | 17:08:30 |
| ... | ... | ... |
| 3 | 2022-04-11 | 02:33:22 |
I hope you can help me, thank you in advance.
Convert column to datetimes and then extract Series.dt.date and times with timezones by Series.dt.strftime:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].dt.strftime('%H:%M:%S%z')
Or split converted values to strings by space and select second lists:
df['date'] = pd.to_datetime(df['date'])
df['date_only'] = df['date'].dt.date
df['time_only'] = df['date'].astype(str).str.split().str[1]

How to dimensionalize a pandas dataframe

I'm looking for a more elegant way of doing this, other than a for-loop and unpacking manually...
Imagine I have a dataframe that looks like this
| id | value | date | name |
| -- | ----- | ---------- | ---- |
| 1 | 5 | 2021-04-05 | foo |
| 1 | 6 | 2021-04-06 | foo |
| 5 | 7 | 2021-04-05 | bar |
| 5 | 9 | 2021-04-06 | bar |
If I wanted to dimensionalize this, I could split it up into two different tables. One, perhaps, would contain "meta" information about the person, and the other serving as "records" that would all relate back to one person... a pretty simple idea as far as SQL-ian ideas go...
The resulting tables would look like this...
Meta
| id | name |
| -- | ---- |
| 1 | foo |
| 5 | bar |
Records
| id | value | date |
| -- | ----- | ---------- |
| 1 | 5 | 2021-04-05 |
| 1 | 6 | 2021-04-06 |
| 5 | 7 | 2021-04-05 |
| 5 | 9 | 2021-04-06 |
My question is, how can I achieve this "dimensionalizing" of a dataframe with pandas, without having to write a for loop on the unique id key field and unpacking manually?
Think about this not as "splitting" the existing dataframe, but as creating two new dataframes from the original. You can do this in a couple of lines:
meta = df[['id','name']].drop_duplicates() #Select the relevant columns and remove duplicates
records = df.drop("name", axis=1) #Replicate the original dataframe but drop the name column
You could drop_duplicates based off a subset of columns for the columns you want to keep. For the second dataframe, you can drop the name column:
df1 = df.drop_duplicates(['id', 'name']).loc[:,['id', 'name']] # perigon's answer is simpler with df[['id','name']].drop_duplicates()
df2 = df.drop('name', axis=1)
df1, df2
Output:
( id name
0 1 foo
2 5 bar,
id value date
0 1 5 2021-04-05
1 1 6 2021-04-06
2 5 7 2021-04-05
3 5 9 2021-04-06)

How to apply an aggregate function to all columns of a pivot table in Pandas

A pivot table is counting the monthly occurrences of a phenomenon. Here's the simplified sample data followed by the pivot:
+--------+------------+------------+
| ad_id | entreprise | date |
+--------+------------+------------+
| 172788 | A | 2020-01-28 |
| 172931 | A | 2020-01-26 |
| 172793 | B | 2020-01-26 |
| 172768 | C | 2020-01-19 |
| 173219 | C | 2020-01-14 |
| 173213 | D | 2020-01-13 |
+--------+------------+------------+
My pivot_table code is the following:
my_pivot_table = pd.pivot_table(df[(df['date'] >= some_date) & ['date'] <= some_other_date)],
values=['ad_id'], index=['entreprise'],
columns=['year', 'month'], aggfunc=['count'])
The resulting table looks like this:
+-------------+---------+----------+-----+----------+
| | 2018 | | | |
+-------------+---------+----------+-----+----------+
| entreprise | january | february | ... | december |
| A | 12 | 10 | ... | 8 |
| B | 24 | 12 | ... | 3 |
| ... | ... | ... | ... | ... |
| D | 31 | 18 | ... | 24 |
+-------------+---------+----------+-----+----------+
Now, I would like to add a column that gives me the monthly average, and perform other operations such as comparing last month's count to the monthly average of, say, the last 12 months...
I tried to fiddle with the aggfunc parameter of the pivot_table, as well as trying to add an average column to the original dataframe, but without success.
Thanks in advance!
Because you get Multiindex table after pivot_table you can use:
df1 = df.mean(axis=1, level=0)
df1.columns = pd.MultiIndex.from_product([df1.columns, ['mean']])
Or:
df2 = df.mean(axis=1, level=1)
df2.columns = pd.MultiIndex.from_product([['all_years'], df2.columns])

Value error when merging 2 dataframe with identical number of row

I have a dataframe like this:
+-----+-------+---------+
| id | Time | Name |
+-----+-------+---------+
| 1 | 1 | John |
+-----+-------+---------+
| 2 | 2 | David |
+-----+-------+---------+
| 3 | 4 | Rebecca |
+-----+-------+---------+
| 4 | later | Taylor |
+-----+-------+---------+
| 5 | later | Li |
+-----+-------+---------+
| 6 | 8 | Maria |
+-----+-------+---------+
I want to merge with another table based on 'id' and time:
data1=pd.merge(data1, data2,left_on=['id', 'time'],right_on=['id', 'time'], how='left')
The other table data
+-----+-------+--------------+
| id | Time | Job |
+-----+-------+--------------+
| 2 | 2 | Doctor |
+-----+-------+--------------+
| 1 | 1 | Engineer |
+-----+-------+--------------+
| 4 | later | Receptionist |
+-----+-------+--------------+
| 3 | 4 | Professor |
+-----+-------+--------------+
| 5 | later | Lawyer |
+-----+-------+--------------+
| 6 | 8 | Trainer |
+-----+-------+--------------+
It raised error:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
What I tried:
data1['time']=data1['time'].astype(str)
data2['time']=data2['time'].astype(str)
Did not work. What can I do?
PS: in this example Id are different, but in my data Id can be the same so I need to merge both on Time and Id
Have you tried also casting 'id' column to either str or int?
Sorry but I have not enough reputation for just comment your question.

Select certain row values and make them columns in pandas

I have a dataset that looks like the below:
+-------------------------+-------------+------+--------+-------------+--------+--+
| | impressions | name | shares | video_views | diff | |
+-------------------------+-------------+------+--------+-------------+--------+--+
| _ts | | | | | | |
| 2016-09-12 23:15:04.120 | 1 | Vidz | 7 | 10318 | 15mins | |
| 2016-09-12 23:16:45.869 | 2 | Vidz | 7 | 10318 | 16mins | |
| 2016-09-12 23:30:03.129 | 3 | Vidz | 18 | 29291 | 30mins | |
| 2016-09-12 23:32:08.317 | 4 | Vidz | 18 | 29291 | 32mins | |
+-------------------------+-------------+------+--------+-------------+--------+--+
I am trying to build a dataframe to feed to a regression model, and I'd like to parse out specific rows as features. To do this I would like the dataframe to resemble this
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| | name | 15min_shares | 15min_impressions | 15min_video_views | 30min_shares | 30min_impressions | 30min_video_views |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
| _ts | | | | | | | |
| 2016-09-12 23:15:04.120 | Vidz | 7 | 1 | 10318 | 18 | 3 | 29291 |
+-------------------------+------+--------------+-------------------+-------------------+--------------+-------------------+-------------------+
What would be the best way to do this? I think this would be easier if I were only trying to select 1 row (15mins), just parse out the unneeded rows and pivot.
However, I need 15min and 30min features and am unsure on how to proceed of the need for these columns
You could take subsets of your DF to include rows for 15mins and 30mins and concatenate them by backfilling NaN values of first row(15mins) with that of it's next row(30mins) and dropping off the next row(30mins) as shown:
prefix_15="15mins"
prefix_30="30mins"
fifteen_mins = (df['diff']==prefix_15)
thirty_mins = (df['diff']==prefix_30)
df = df[fifteen_mins|thirty_mins].drop(['diff'], axis=1)
df_ = pd.concat([df[fifteen_mins].add_prefix(prefix_15+'_'), \
df[thirty_mins].add_prefix(prefix_30+'_')], axis=1) \
.fillna(method='bfill').dropna(how='any')
del(df_['30mins_name'])
df_.rename(columns={'15mins_name':'name'}, inplace=True)
df_
stacking to pivot and collapsing your columns
df1 = df.set_index('diff', append=True).stack().unstack(0).T
df1.columns = df1.columns.map('_'.join)
To see just the first row
df1.iloc[[0]].dropna(1)

Categories