Since half a year I am into python and all it incredible libraries, such as Panads Dataframes.
I am struggling to get the iteration logic (see attached image) implement in my code. The logic is pretty clear to me but unfortunately am not able to get the snippet coded!
I was wondering if there is someone out, who can give me the right hint?
Thank you very much in advance!
Transparent iteration logic
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'])
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020'])
import pandas as pd
df1 = pd.to_datetime(['01.01.2020', '15.01.2020', '01.02.2020', '01.03.2020', '15.03.2020', '01.04.2020', '01.05.2020', '01.06.2020', '01.07.2020', '01.08.2020', '01.09.2020', '01.10.2020'], format="%d.%m.%Y")
df2 = pd.to_datetime(['01.01.2020', '14.01.2020', '04.03.2020', '20.03.2020', '17.07.2020', '19.09.2020', '03.11.2021'], format="%d.%m.%Y")
lst_df1 = list(df1.sort_values())
lst_df2 = list(df2.sort_values())
dict_df3 = {}
window_start = lst_df2[0]
window_stop = lst_df2[1]
for date in lst_df1:
while date > window_stop:
window_start = lst_df2[0]
lst_df2 = lst_df2[1:]
window_stop = lst_df2[0]
dict_df3[date] = window_start
df3 = pd.DataFrame.from_dict(dict_df3, orient='index').reset_index()
Related
I've seen other people pose this question, but I have been through the solutions and nothing is working so far.
My DataFrame isn't dropping duplicates. I don't know how to fix it.
combined_input_dat = pd.concat([input_dat, subcatch_dat_big], axis=1)
combined_dat = pd.concat([combined_input_dat, output_nolid_dat], axis=1)
date_first = date(2010,1,1)
date_last = date(2020,12,31)
date_delta = date_last-date_first
row_names_date = [(date_first + timedelta(days=i)).strftime('%m/%d/%Y') for i in range(date_delta.days + 1) ]
n_subcatchments = 140
row_names_date_long = np.repeat(row_names_date, n_subcatchments)
combined_input_dat['date']= row_names_date_long
combined_input_dat.drop_duplicates(keep=False,inplace=True)
I'm able to drop the duplicates before this code, but not after. Any suggestions would be greatly appreciated.
So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.
For the purpose to be clear here is the code that works perfectly (of course I put only the beginning, the rest is not important here):
df = pd.read_csv(
'https://github.com/pcm-dpc/COVID-19/raw/master/dati-andamento-nazionale/'
'dpc-covid19-ita-andamento-nazionale.csv',
parse_dates=['data'], index_col='data')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
So basically it takes some data from the online Github csv that you may find here:
Link NAZIONALE and look at the "data" which is the italian for "date" and extract for every date the value nuovi_positivi and then it put it into the program.
Now I have to do the same thing with this json that you may find here
Link Json
As you may see, now for every date there are 21 different values because Italy has 21 regions (Abruzzo Basilicata Campania and so on) but I am interested ONLY with the values of the region "Veneto", and I want to extract only the rows that contains "Veneto" under the label "denominazione_regione" to get for every day the value "nuovi_positivi".
I tried with:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-
json/dpc-covid19-ita-regioni.json' , parse_dates=['data'], index_col='data',
index_row='Veneto')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
but of course it doesn't work. How to solve the problem? Thanks
try this:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json',
convert_dates =['data'])
df.index = df['data']
df.index = df.index.normalize()
df = df[df["denominazione_regione"] == 'Veneto']
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
I currently have the following wikipedia scraper:
import wikipedia as wp
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Wikipedia __scraper__
wiki_page = 'Climate_of_Italy'
html = wp.page(wiki_page).html().replace(u'\u2212', '-')
def dataframe_cleaning(table_number: int):
global html
df = pd.read_html(html, encoding='utf-8')[table_number]
df.drop(np.arange(5, len(df.index)), inplace=True)
df.columns = df.columns.droplevel()
df.drop('Year', axis=1, inplace=True)
find = '\((.*?)\)'
for i, column in enumerate(df.columns):
if i>0:
df[column] = (df[column]
.str.findall(find)
.map(lambda x: np.round((float(x[0])-32)* (5/9), 2)))
return df
potenza_df = dataframe_cleaning(3)
milan_df = dataframe_cleaning(4)
florence_df = dataframe_cleaning(6)
italy_df = pd.concat((potenza_df, milan_df, florence_df))
Produces the following DataFrame:
As you may see I have concatenated the DataFrames, which result in a number of repeating lines. Using the groupby I want to filter all of these to be in a single DataFrame and using .agg method I want to ensure that there would application of min, max, mean. The issue that I am facing is inability to apply .agg method on row by row. I know it is a very simple question, but I've been looking through documentation and sadly cannot figure it out.
Thank you for your help in advance.
P.S. sorry if it is a repeated question post, but I was unable to find similar solution.
EDIT:
Added desired output (NOTE: was done on excel)
Just a quick update, I was able to achieve my desired goal, however I was not able to find a good resolution to it.
concat_df = pd.concat((potenza_df, milan_df, florence_df))
italy_df = pd.DataFrame()
for i, index in enumerate(list(set(concat_df['Month']))):
if i == 0:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.max)
if i in range(1, 4):
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.mean)
if i == 4:
temp_df = concat_df[concat_df['Month'] == index]
temp_df = temp_df.groupby('Month').agg(np.min)
italy_df = italy_df.append(temp_df)
italy_df = italy_df.apply(lambda x: np.round(x, 2))
italy_df
The following code achieves the desired result, however, it is highly dependent on the user's manual configuration:
I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time