Changing the values of a column in pandas dataframe - python

i scraped a html table from a nba game as an pandas dataframe.
import pandas as pd
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df
i have the column "time" which starts at 12:00.0 and is descending to 0:00.0 and this for every quarter.
i want the time as overall time so that it begins at 48:00.0 and is descending.
my approach: overall_time(i) = overall_time(i-1) - (quarter_time(i-1) - quarter_time(i))
e.g. 48:00.0 - (12:00.0 - 11:46.0) = 47:46.0 for the first row of my dataframe
i think this should be working but i am struggling to implement this in python. maybe someone can help me with this

There is probably a better way, but I felt I needed to converting from Time 'string' format like 11:30 which is hard to subtract, to 11.5 and then back again. Then a bit of fussing with formatting
import pandas as pd
pd.options.mode.chained_assignment = None # default='warn'
url = 'https://www.basketball-reference.com/boxscores/pbp/200911060GSW.html'
dfs = pd.read_html(url)
df = dfs[0]
df.rename(columns={'Unnamed: 2_level_1': 'PM1', 'Unnamed: 4_level_1': 'PM2'}, inplace=True)
df.columns = df.columns.droplevel() #columns currently multiindex, you don't need 1st Q, drop it
df = df[df['Time'].str.contains(':')] #only include rows with a real 'Time' that contains a colon, excludes headers
#Identify which rows signify the start of a new quarter
#has to have 12 minutes of time and text of 'Start of...' in the 'Score' column
quarter_start_rows = df['Time'].eq('12:00.0') & df['Score'].str.startswith('Start of')
#create a new column called quarter w/ 1 at new quarter, 0 otherwise then cumsum
df['Quarter'] = np.where(quarter_start_rows,1,0).cumsum()
#separate the minutes and seconds and make them int and float respectively
df[['Minutes','Seconds']] = df['Time'].str.split(':',expand=True).astype({0:'int',1:'float'})
#represent Q2 11:30 as 11.5 etc so it is easy to add/subtract times
fractional_time = df['Minutes'].add(df['Seconds'].div(60))
#convert from Q2 11:30 (11.5) to 'global time' which would be 35.5
global_fractional_time = fractional_time.add((4-df['Quarter'])*12)
#convert from fractional time back to Minutes and Seconds
minutes = global_fractional_time.astype(int)
seconds = global_fractional_time.sub(minutes).multiply(60).round(1)
#Make a new string column to show the global minutes and seconds more nicely
df['Overall Time'] = minutes.astype(str).str.zfill(2)+':'+seconds.astype(str).str.zfill(4)
Output

Related

How can I merge the numerous data of two columns within the same DataFrame?

here is a pic of df1 = fatalities
So, in order to create a diagram that displays the years with the most injuries(i have an assignment about plane crash incidents in Greece from 2000-2020), i need to create a column out of the minor_injuries and serious_injuries ones.
So I had a first df with more data, but i tried to catch only the columnw that i needed, so we have the fatalities df1, which contains the years, the fatal_injuries, the minor_injuries, the serious_injuries and the total number of incident per year(all_incidents). What i wish to do, is merge the minor and serious injuries in a column named total_injuries or just injuries.
import pandas as pd
​ pd.set_option('display.max_rows', None)
df = pd.read_csv('all_incidents_cleaned.csv')
df.head()
df\['Year'\] = pd.to_datetime(df.incident_date).dt.year
fatalities = df.groupby('Year').fatalities.value_counts().unstack().reset_index()fatalities\
['all_incidents'\] = fatalities\[\['Θανάσιμος τραυματισμός',
'Μικρός τραυματισμός','Σοβαρός τραυματισμός', 'Χωρίς Τραυματισμό'\]\].sum(axis=1)
df\['percentage_deaths_to_all_incidents'\] = round((fatalities\['Θανάσιμος
τραυματισμός'\]/fatalities\['all_incidents'\])\*100,1)
df1 = fatalities
fatalities_pd = pd.DataFrame(fatalities)
df1
fatalities_pd.rename(columns = {'Θανάσιμος τραυματισμός':'fatal_injuries','Μικρός τραυματισμός':
'minor_injuries', 'Σοβαρός τραυματισμός' :'serious_injuries', 'Χωρίς Τραυματισμό' :
'no_injuries'}, inplace = True)
df1
For your current dataset two steps are needed.
First i would replace the "NaN" values with 0.
This could be done with:
df1.fillna(0)
Then you can create a new column "total_injuries" with the sum of minor and serious injuries:
df1["total_injuries"]=df1["minor_injuries"]+df1["serious_injuries"]
Its always nice when you first check your data for consistency before working on it. Helpful commands would look like:
data.shape
data.info()
data.isna().values.any()
data.duplicated().values.any()
duplicated_rows = data[data.duplicated()]
len(duplicated_rows)
data.describe()

How to append new record of everyday to a new row in pandas dataframe?

I have a dataframe which contains an aggregated value till today. My datastream will update everyday so that I want to monitoring the change of each day. How can I append each newday's data to the dataframe?
The format will be like this,
Date Agg_Value
2022-12-07 0.43
2022-12-08 0.44
2022-12-09 0.41
... ...
You want to use pandas.concat to bring together 2 dataframes:
import pandas as pd
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
print(pd.concat([s1, s2]))
Assuming that you always have yesterday's dataframe available, you can simply execute a script that runs everyday to get today's date and concatenate the result.
# Assuming you have todays value
today_value = 42
# Get today's date
from datetime import date
today = date.today()
# Create a new dataframe with today's value
df1 = pd.DataFrame(
{
'Date':[str(today)],
'Agg_value':[today_value]
}
)
# Update df by concatenating today's data
df = pd.concat([df, df1], axis=0)
There are multiple ways:
creating example dataframe -
import pandas as pd
import numpy as np
Date = pd.date_range("2022-12-01", periods=4, freq="D")
Avg_value = np.random.rand(4)
df = pd.DataFrame(list(zip(Date, Avg_value)), columns=['Date','Avg_value'])
df
will give -
.loc
Now add single row by -
df.loc[len(df.index)] = [pd.Timestamp("2022-12-05"), 0.67]
will give-
by append
new_row = {'Date': pd.Timestamp("2022-12-06"), 'Avg_value': 0.39}
df = df.append(new_row, ignore_index=True)
df
concat example is explained in Arturo Sbr answer

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

SyntaxError comparing a date with a field in Python

Q) Resample the data to get prices for the end of the business month. Select the Adjusted Close for each stock.import pandas as pd
mmm = pd.read_csv('mmm.csv')
ibm = pd.read_csv('ibm.csv')
fb = pd.read_csv('fb.csv')
amz_date = amz.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
amz_price = amz_date[:, ['Date', 'AdjClose']] #dataframe with only these 2 columns
mmm_date = mmm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
mmm_price = mmm_date[:, ['Date', 'AdjClose']]
ibm_date = ibm.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
ibm_price = ibm_date[:, ['Date', 'AdjClose']]
fb_date = fb.loc['Date']==2017-6-30 #dataframe showing enitire row for that date
fb_price = fb_date[:, ['Date', 'AdjClose']]
KeyError: 'Date'
What Am i doing wrong? also Date is column in csv file
Your particular problem is that "06" is not a legal way to describe a value here. To do arithmetic, you'd need to drop the leading zero.
Your next problem is that 2017-06-30 is (would be) an arithmetic expression that evaluates to the integer 1981. You need to express this as a date, such as datetime.strptime("6-30-2017").

Dataframe drop between_time multiple rows by shifting timedelta

I would like to drop multiple groups of rows by time criterion. Date criterion may be ignored.
I have dataframe that contains 100 million rows, with around 0.001s sampling frequency - but it is variable for different columns.
The goal is to drop multiple rows by criterion of "shifting". The leave duration might be 0.01 seconds and the drop duration might be 0.1 second, as shown in Figure:
I have many problems with Timestamp to Time conversions and with the defining the oneliner that will drop multiple groups of rows.
I made tries with following code:
import pandas as pd
from datetime import timedelta#, timestamp
from datetime import datetime
import numpy as np
# leave_duration=0.01 seconds
# drop_duration=0.1 seconds
i = pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='2ms')
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='3ms'))
i=i.append(pd.date_range('2018-01-01 00:01:15.004', periods=1000, freq='0.5ms'))
df = pd.DataFrame({'A': range(len(i))}, index=i)
df=df.sort_index()
minimum_time=df.index.min()
print("Minimum time:",minimum_time)
maximum_time=df.index.max()
print("Maximum time:",maximum_time)
# futuredate = minimum_time + timedelta(microseconds=100)
print("Dataframe before dropping:\n",df)
df.drop(df.between_time(*pd.to_datetime([minimum_time, maximum_time]).time).index, inplace=True)
print("Dataframe after dropping:\n",df)
# minimum_time=str(minimum_time).split()
# minimum_time=minimum_time[1]
# print(minimum_time)
# maximum_time=str(maximum_time).split()
# maximum_time=maximum_time[1]
# print(maximum_time)
How can I drop rows by time criterion, with shifting?
Working for me:
df = df.loc[(df.index - df.index[0]) % pd.to_timedelta('110ms') > pd.to_timedelta('100ms')]
I *think this is what you're looking for. If not, it hopefully gets you closer.
I defined drop periods by taking the minimum time and incrementing it by your drop/leave times. I then append it to a dictionary where the key is the start of the drop period and the value is the end of the drop period.
Lastly I just iterate through the dictionary and drop rows that fall between those two times in your dataframe, shedding rows at each step.
drop_periods = {}
start_drop = minimum_time + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
while end_drop < maximum_time:
start_drop = end_drop + datetime.timedelta(seconds=0.01)
end_drop = start_drop + datetime.timedelta(seconds=0.1)
drop_periods[start_drop] = end_drop
for start, end in drop_periods.items():
print("Dataframe before dropping:\n", len(df))
df.drop(df.between_time(*pd.to_datetime([start, end]).time).index, inplace=True)
print("Dataframe after dropping:\n", len(df))

Categories