Drop rows in a pandas DataFrame up to a certain value - python

I'm currently working with a pandas data frame, with approximately 80000 rows, like the following one:
artist
date
Drake
2014-10-12
Kendrick Lamar
2014-10-12
Ed Sheeran
2014-10-12
Maroon 5
2014-10-12
Rihanna
2014-10-19
Foo Fighters
2014-10-19
Bad Bunny
2014-10-19
Eminem
2014-10-19
Drake
2014-10-26
Eminem
2014-10-26
Taylor Swift
2014-10-26
Kendrick Lamar
2014-10-26
Rihanna
2014-11-02
Ed Sheeran
2014-11-02
Kanye West
2014-11-02
Lime Cordiale
2014-11-02
I want to drop the rows up to, but not including, the last date. The result should be something like:
artist
date
Drake
2014-10-26
Eminem
2014-10-26
Taylor Swift
2014-10-26
Kendrick Lamar
2014-10-26
Rihanna
2014-11-02
Ed Sheeran
2014-11-02
Kanye West
2014-11-02
Lime Cordiale
2014-11-02
I tried using pandas .drop() method like in the last line of the following block:
dataset = pd.read_csv("charts.csv")
dataset = pd.DataFrame(dataset)
dataset = dataset.loc[dataset['country'] == "us", :]
dataset = dataset.sort_values(by= ["date", "position"])
dataset = dataset.drop(dataset.loc[dataset['date'] <= "2014-10-19", :])
but I get an error after running it.

You could use:
last_date_to_drop = pd.to_datetime("2014-10-19")
dataset["date"] = pd.to_datetime(dataset["date"])
dataset = dataset.loc[dataset["date"].gt(last_date_to_drop)].copy()
You don't need to sort or drop. Just subset the dataframe and copy as above.
Also drop is not what you think it will do. It won't drop by row values, it drops by column or index labels.

not sure what error you got you must have to mentioned error log.
Anyway
You can use index for drop rows, get index by filter data and then drop it
indexx = dataset[ dataset['date'] <= "2014-10-19" ].index
dataset.drop(indexx , inplace=True)

import pandas as pd
df = pd.DataFrame({'artist':['Drake', 'Kendrick Lamar', 'Kendrick Lamar', 'Drake'],
'date':['2014-10-12', '2014-10-12', '2014-10-26', '2014-10-26']})
# Be cautious : sort first
df = (df.sort_values(by='date', key=lambda t: pd.to_datetime(t, format='%Y-%m-%d'))
.drop_duplicates(subset=['artist'], keep='last'))
print(df)
# artist date
# 2 Kendrick Lamar 2014-10-26
# 3 Drake 2014-10-26

Related

Add days to date conditionally using Pandas

I have a table that I need add days to and create a new column with that information. The problem I am having is that there are two date calculations based on a different column. Here is a similar table to the one I am working with:
Type Name Date
A Abe 6/2/2021
B Joe 6/15/2021
A Jin 6/25/2021
A Jen 6/1/2021
B Pan 6/21/2021
B Pin 6/22/2021
B Hon 6/11/2021
A Hen 6/23/2021
A Bin 6/23/2021
A Ban 6/5/2021
I am trying to get the table to return like this where Type A goes up by 7 days and Type B goes up by 2 business days:
Type Name Date NewDate
A Abe 6/2/2021 6/9/2021
B Joe 6/15/2021 6/19/2021
A Jin 6/25/2021 7/2/2021
A Jen 6/1/2021 6/8/2021
B Pan 6/21/2021 6/23/2021
B Pin 6/22/2021 6/26/2021
B Hon 6/11/2021 6/13/2021
A Hen 6/23/2021 6/30/2021
A Bin 6/23/2021 6/30/2021
A Ban 6/5/2021 6/12/2021
So far I have tried these:
import pandas as pd
from pandas.tseries.offsets import BDay
from datetime import datetime, timedelta
df1['NewDate'] = df1.apply(df1['Date'] + timedelta(days=7)
if x=='Emergency' else df1['Date'] + BDay(2) for x in df1['Type'])
Don't run that, either you will go in an infinite loop or it will take a very long time.
I've also run this:
df1['NewDate'] = [df1['Date'] + timedelta(days=7) if i=='Emergency' else df1['Date'] + BDay(2)
for i in df1.Type] (also tried with df1[Type] same results.
This puts all the rows in a single row (almost looks like how it returns on jupyter notebook with the ...)
I have also tried this:
df1['NewDate'] = df1['Type'].apply(lambda x: df1['Date'] + timedelta(days=7) if x=='Emergency'
else df1['Date'] + BDay(2))
When I run that one it will go through each row on the type and apply the correct logic on the if emergency calculate by 7 days and if not calculate by business day, the problem is that every row returned is calculated on the first row of the entire table.
At this point I am a little lost, any help would be greatly appreciated. For simplicity sakes it can be calculated at plus timedelta(7) and plus timedelta(2). Also what would change if I had to add more conditions like say on Name column.
To use apply, try:
df["Date"] = pd.to_datetime(df["Date"])
df["NewDate"] = df.apply(lambda x: x["Date"]+BDay(2) if x["Type"]=="B" else x["Date"]+pd.DateOffset(days=7), axis=1)
>>> df
Type Name Date NewDate
0 A Abe 2021-06-02 2021-06-09
1 B Joe 2021-06-15 2021-06-17
2 A Jin 2021-06-25 2021-07-02
3 A Jen 2021-06-01 2021-06-08
4 B Pan 2021-06-21 2021-06-23
5 B Pin 2021-06-22 2021-06-24
6 B Hon 2021-06-11 2021-06-15
7 A Hen 2021-06-23 2021-06-30
8 A Bin 2021-06-23 2021-06-30
9 A Ban 2021-06-05 2021-06-12
Alternatively, you can use numpy.where:
import numpy as np
df["NewDate"] = np.where(df["Type"]=="B", df["Date"]+BDay(2), df["Date"]+pd.DateOffset(7))

checking if it is a holiday based on date using holidays library -python

I have a dataset from the last 3 years, I would like to add a new column based on holidays.
when I try this :
import holidays
de_holidays = holidays.DE()
for date, name in sorted(holidays.DE(years=2021).items()):
print(date, name)
I get the result
2021-01-01 Neujahr
2021-04-02 Karfreitag
2021-04-05 Ostermontag
2021-05-01 Erster Mai
2021-05-13 Christi Himmelfahrt
2021-05-24 Pfingstmontag
2021-10-03 Tag der Deutschen Einheit
2021-12-25 Erster Weihnachtstag
2021-12-26 Zweiter Weihnachtstag
now I wanted to create a new column in my existing dataset with true/false in case of holiday.
I tried to use the below code snippet.
My Date column looks something like this: Dtype is datetime64[ns]
2021-07-22
2021-07-21
2021-07-20
2021-07-19
#I used the code
import holidays
de_holidays = holidays.DE()
df['Holiday'] = df['Date'].isin(de_holidays)
rslt_df
rslt_df.loc[rslt_df['Date'] == '2021-05-13']
The result I was expecting is True as 13th may was a holiday but I realized this code is giving all the false values. can anyone help?
edit
12390 2021-07-22
12380 2021-07-21
12370 2021-07-20
12360 2021-07-19
12350 2021-07-18
...
40 2018-03-05
30 2018-03-04
20 2018-03-03
10 2018-03-02
0 2018-03-01
Name: Date, Length: 1240, dtype: datetime64[ns]
now when I use
df['Holiday'] = df['Date'].isin(holidays.DE(years=2021))
I get the correct True/False values but as soon as I remove years tab then I get all the false value
df['Holiday'] = df['Date'].isin(holidays.DE())
This works well to get Boolean Value
from datetime import date
import holidays
de_holidays = holidays.DE()
#date(2021-07-22) in de_holidays
rslt_df['Holiday'] = rslt_df['Date'].isin(holidays.DE(years=[2018,2019,2020,2021]))
rslt_df

Pandas: how to calculate percentage of a population from elsewhere

I found this data file on covid vaccinations, and I'd like to see the vaccination coverage in (parts of) the population. It'll probably become more clear with the actual example, so bear with me.
If I read the csv using df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE']) I get this result:
DATE REGION AGEGROUP SEX BRAND DOSE COUNT
0 2020-12-28 Brussels 18-34 F Pfizer-BioNTech A 1
1 2020-12-28 Brussels 45-54 F Pfizer-BioNTech A 2
2 2020-12-28 Brussels 55-64 F Pfizer-BioNTech A 3
3 2020-12-28 Brussels 55-64 M Pfizer-BioNTech A 1
4 2020-12-28 Brussels 65-74 F Pfizer-BioNTech A 2
I'm particularly interested in the numbers by region & date.
So I regrouped using df.groupby(['REGION','DATE']).sum()
COUNT
REGION DATE
Brussels 2020-12-28 56
2020-12-30 5
2021-01-05 725
2021-01-06 989
2021-01-07 994
... ...
Wallonia 2021-06-18 49567
2021-06-19 43577
2021-06-20 2730
2021-06-21 37193
2021-06-22 16938
In order to compare vaccination 'speeds' in different regions I have to transform the data from absolute to relative numbers, using the population from each region.
I have found some posts explaining how to calculate percentages in a multi-index dataframe like this, but the problem is that I want to divide each COUNT by a population number that is not in the original dataframe.
The population numbers are here below
REGION POP
Flanders 6629143
Wallonia 3645243
Brussels 1218255
I think the solution must be in looping through the original df and checking both REGIONs or index levels, but I have absolutely no idea how. It's a technique I'd like to master, because it might come in handy when I want some other subsets with different populations (AGEGROUP or SEX maybe).
Thank you so much for reading this far!
Disclaimer: I've only just started out using Python, and this is my very first question on Stack Overflow, so please be gentle with me... The reason why I'm posting this is because I can't find an answer anywhere else. This is probably because I haven't got the terminology down and I don't exactly know what to look for ^_^
One option would be to reformat the population_df with set_index + rename:
population_df = pd.DataFrame({
'REGION': {0: 'Flanders', 1: 'Wallonia', 2: 'Brussels'},
'POP': {0: 6629143, 1: 3645243, 2: 1218255}
})
denom = population_df.set_index('REGION').rename(columns={'POP': 'COUNT'})
denom:
COUNT
REGION
Flanders 6629143
Wallonia 3645243
Brussels 1218255
Then div the results of groupby sum relative to level=0:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'}).div(denom, level=0)
new_df:
COUNT
REGION DATE
Brussels 2020-12-28 0.000046
2020-12-30 0.000004
2021-01-05 0.000595
2021-01-06 0.000812
2021-01-07 0.000816
... ...
Wallonia 2021-06-18 0.013598
2021-06-19 0.011954
2021-06-20 0.000749
2021-06-21 0.010203
2021-06-22 0.004647
Or as a new column:
new_df = df.groupby(['REGION', 'DATE']).agg({'COUNT': 'sum'})
new_df['NEW'] = new_df.div(denom, level=0)
new_df:
COUNT NEW
REGION DATE
Brussels 2020-12-28 56 0.000046
2020-12-30 5 0.000004
2021-01-05 725 0.000595
2021-01-06 989 0.000812
2021-01-07 994 0.000816
... ... ...
Wallonia 2021-06-18 49567 0.013598
2021-06-19 43577 0.011954
2021-06-20 2730 0.000749
2021-06-21 37193 0.010203
2021-06-22 16938 0.004647
You could run reset_index() on the groupby and then run df.apply on a custom function that does the calculations:
import pandas as pd
df = pd.read_csv('https://epistat.sciensano.be/Data/COVID19BE_VACC.csv', parse_dates=['DATE'])
df = df.groupby(['REGION','DATE']).sum().reset_index()
def calculate(row):
if row['REGION'] == 'Flanders':
return row['COUNT'] / 6629143
elif row['REGION'] == 'Wallonia':
return row['COUNT'] / 3645243
elif row['REGION'] == 'Brussels':
return row['COUNT'] / 1218255
df['REL_COUNT'] = df.apply(calculate, axis=1) #axis=1 takes the rows as input, axis=0 would run on columns
Output df.head():
REGION
DATE
COUNT
REL_COUNT
0
Brussels
2020-12-28 00:00:00
56
0.000046
1
Brussels
2020-12-30 00:00:00
5
0.000004
2
Brussels
2021-01-05 00:00:00
725
0.000595
3
Brussels
2021-01-06 00:00:00
989
0.000812
4
Brussels
2021-01-07 00:00:00
994
0.000816

Change year/quarter date format to previous period in python

I have a dataset containing monthly observations of a time-series.
What I want to do is transform the datetime to year/quarter format and then extract the first value DATE[0] as the previous quarter. For example 2006-10-31 belongs to 4Q of 2006. But I want to change it to 2006Q3.
For the extraction of the subsequent values I will just use the last value from each quarter.
So, for 2006Q4 I will keep BBGN, SSD, and QQ4567 values only from DATE[2]. Similarly, for 2007Q1 I will keep only DATE[5] values, and so forth.
Original dataset:
DATE BBGN SSD QQ4567
0 2006-10-31 00:00:00 1.210 22.022 9726.550
1 2006-11-30 00:00:00 1.270 22.060 9891.008
2 2006-12-31 00:00:00 1.300 22.080 10055.466
3 2007-01-31 00:00:00 1.330 22.099 10219.924
4 2007-02-28 00:00:00 1.393 22.110 10350.406
5 2007-03-31 00:00:00 1.440 22.125 10480.888
After processing the DATE
DATE BBGN SSD QQ4567
0 2006Q3 1.210 22.022 9726.550
2 2006Q4 1.300 22.080 10055.466
5 2007Q1 1.440 22.125 10480.888
The steps I have taken so far are:
Turn the values from the yyyy-mm-dd hh format to yyyyQQ format
DF['DATE'] = pd.to_datetime(DF['DATE']).dt.to_period('Q')
and I get this
DATE BBGN SSD QQ4567
0 2006Q4 1.210 22.022 9726.550
1 2006Q4 1.270 22.060 9891.008
2 2006Q4 1.300 22.080 10055.466
3 2007Q1 1.330 22.099 10219.924
4 2007Q1 1.393 22.110 10350.406
5 2007Q1 1.440 22.125 10480.888
The next step is to extract the last values from each quarter. But because I always want to keep the first row I will exclude DATE[0] from the function.
quarterDF = DF.iloc[1:,].drop_duplicates(subset='DATE', keep='last')
Now, my question is how can I change the value in DATE[0] to always be the previous quarter. So, from 2006Q4 to be 2006Q3. Also, how this will work if DATE[0] is 2007Q1, can I change it to 2006Q4?
My suggestion would be to create a new DATE column with a day 3 months in the past. Like this
import pandas as pd
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['2006-10-31', '2007-01-31'])
one_quarter = pd.tseries.offsets.DateOffset(months=3)
df['Last_quarter'] = df.Date - one_quarter
This will give you
Date Last_quarter
0 2006-10-31 2006-07-31
1 2007-01-31 2006-10-31
Then you can do the same process as you described above on Last_quarter
Here is a pivot_table approach
# Subtract the quarter from date save it in a column
df['Q'] = df['DATE'] - pd.tseries.offsets.QuarterEnd()
#0 2006-09-30
#1 2006-09-30
#2 2006-09-30
#3 2006-12-31
#4 2006-12-31
#5 2006-12-31
#Name: Q, dtype: datetime64[ns]
# Drop and pivot for not including the columns
ndf = df.drop(['DATE','Q'],1).pivot_table(index=pd.to_datetime(df['Q']).dt.to_period('Q'),aggfunc='last')
BBGN QQ4567 SSD
Qdate
2006Q3 1.30 10055.466 22.080
2006Q4 1.44 10480.888 22.125

changing relative times to actual dates in a pandas dataframe

I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.
Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.
I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.
My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.
I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?
Example of some rows:
Publication Time Headline
0 The San Diego Union-Tribune 6 hours ago I am not opposed to new therapeutic modalities...
1 Devon Live 13 hours ago If you're looking for a bargain this Christmas...
15 ABS-CBN News 1 day ago Now, Thirdy has a chance to do something that ...
26 New York Times Nov 2, 2017 Shepherds lead their sheep through the centre ...
You can use to_datetime with to_timedelta first and then use combine_first with floor:
#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')
print (df)
Publication Time \
0 The San Diego Union-Tribune 2017-11-12
1 Devon Live 2017-11-11
2 ABS-CBN News 2017-11-11
3 New York Times 2017-11-02
Headline
0 I am not opposed to new therapeutic modalities
1 If you're looking for a bargain this Christmas
2 Now, Thirdy has a chance to do something that
3 Shepherds lead their sheep through the centre
print (df['Time'])
0 2017-11-12
1 2017-11-11
2 2017-11-11
3 2017-11-02
Name: Time, dtype: datetime64[ns]
Your approach should work. Use Pandas Timedelta to subtract relative dates from the current date.
For example, given your sample data as:
Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre
Read in the data from the clipboard (although you could just as easily substitute with read_csv() or some other file format):
import pandas as pd
from datetime import datetime
df = pd.read_clipboard(sep=";")
For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime():
absolute_date = pd.to_datetime(df.Time, errors="coerce")
absolute_date
0 NaT
1 NaT
2 NaT
3 2017-11-02
Name: Time, dtype: datetime64[ns]
For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta:
relative_date = (datetime.today() -
df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))
relative_date
0 2017-11-11 17:05:54.143548
1 2017-11-11 10:05:54.143548
2 2017-11-10 23:05:54.143548
3 NaT
Name: Time, dtype: datetime64[ns]
Now fill in the respective NaN values from each set, absolute and relative (updated to use combine_first(), via Jezrael's answer):
date = relative_date.combine_first(absolute_date)
relative_date
0 2017-11-11 17:06:29.658925
1 2017-11-11 10:06:29.658925
2 2017-11-10 23:06:29.658925
3 2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]
Finally, pull out just the date from the datetime:
date.dt.date
0 2017-11-11
1 2017-11-11
2 2017-11-10
3 2017-11-02
Name: Time, dtype: object

Categories