PYTHON, Pandas Dataframe: how to select and read only certain rows - python

For the purpose to be clear here is the code that works perfectly (of course I put only the beginning, the rest is not important here):
df = pd.read_csv(
'https://github.com/pcm-dpc/COVID-19/raw/master/dati-andamento-nazionale/'
'dpc-covid19-ita-andamento-nazionale.csv',
parse_dates=['data'], index_col='data')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
So basically it takes some data from the online Github csv that you may find here:
Link NAZIONALE and look at the "data" which is the italian for "date" and extract for every date the value nuovi_positivi and then it put it into the program.
Now I have to do the same thing with this json that you may find here
Link Json
As you may see, now for every date there are 21 different values because Italy has 21 regions (Abruzzo Basilicata Campania and so on) but I am interested ONLY with the values of the region "Veneto", and I want to extract only the rows that contains "Veneto" under the label "denominazione_regione" to get for every day the value "nuovi_positivi".
I tried with:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-
json/dpc-covid19-ita-regioni.json' , parse_dates=['data'], index_col='data',
index_row='Veneto')
df.index = df.index.normalize()
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi
but of course it doesn't work. How to solve the problem? Thanks

try this:
df = pd.read_json('https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-json/dpc-covid19-ita-regioni.json',
convert_dates =['data'])
df.index = df['data']
df.index = df.index.normalize()
df = df[df["denominazione_regione"] == 'Veneto']
ts = df[['nuovi_positivi']].dropna()
sts = ts.nuovi_positivi

Related

How to add a new row with new header information in same dataframe

I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows

How can I merge the numerous data of two columns within the same DataFrame?

here is a pic of df1 = fatalities
So, in order to create a diagram that displays the years with the most injuries(i have an assignment about plane crash incidents in Greece from 2000-2020), i need to create a column out of the minor_injuries and serious_injuries ones.
So I had a first df with more data, but i tried to catch only the columnw that i needed, so we have the fatalities df1, which contains the years, the fatal_injuries, the minor_injuries, the serious_injuries and the total number of incident per year(all_incidents). What i wish to do, is merge the minor and serious injuries in a column named total_injuries or just injuries.
import pandas as pd
​ pd.set_option('display.max_rows', None)
df = pd.read_csv('all_incidents_cleaned.csv')
df.head()
df\['Year'\] = pd.to_datetime(df.incident_date).dt.year
fatalities = df.groupby('Year').fatalities.value_counts().unstack().reset_index()fatalities\
['all_incidents'\] = fatalities\[\['Θανάσιμος τραυματισμός',
'Μικρός τραυματισμός','Σοβαρός τραυματισμός', 'Χωρίς Τραυματισμό'\]\].sum(axis=1)
df\['percentage_deaths_to_all_incidents'\] = round((fatalities\['Θανάσιμος
τραυματισμός'\]/fatalities\['all_incidents'\])\*100,1)
df1 = fatalities
fatalities_pd = pd.DataFrame(fatalities)
df1
fatalities_pd.rename(columns = {'Θανάσιμος τραυματισμός':'fatal_injuries','Μικρός τραυματισμός':
'minor_injuries', 'Σοβαρός τραυματισμός' :'serious_injuries', 'Χωρίς Τραυματισμό' :
'no_injuries'}, inplace = True)
df1
For your current dataset two steps are needed.
First i would replace the "NaN" values with 0.
This could be done with:
df1.fillna(0)
Then you can create a new column "total_injuries" with the sum of minor and serious injuries:
df1["total_injuries"]=df1["minor_injuries"]+df1["serious_injuries"]
Its always nice when you first check your data for consistency before working on it. Helpful commands would look like:
data.shape
data.info()
data.isna().values.any()
data.duplicated().values.any()
duplicated_rows = data[data.duplicated()]
len(duplicated_rows)
data.describe()

Pandas DF, DateOffset, creating new column

So I'm working with the JHU covid19 data and they've left their recovered dataset go, they're no longer tracking it, just confirmed cases and deaths. What I'm trying to do here is recreate it. The table is the confirmed cases and deaths for every country for every date sorted by date and my getRecovered function below attempts to pick the date for that row, find the date two weeks before that and for the country of that row, and return a 'Recovered' column, which is the confirmed of two weeks ago - the dead today.
Maybe a pointless exercise, but still would like to know how to do it haha. I know it's a big dataset also and there's a lot of operations there, but I've been running it 20 mins now and still going. Did I do something wrong or would it just take this long?
Thanks for any help, friends.
urls = [
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
]
[wget.download(url) for url in urls]
confirmed = pd.read_csv('time_series_covid19_confirmed_global.csv')
deaths = pd.read_csv('time_series_covid19_deaths_global.csv')
dates = confirmed.columns[4:]
confirmed_long_form = confirmed.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Confirmed'
)
deaths_long_form = deaths.melt(
id_vars =['Province/State', 'Country/Region', 'Lat', 'Long'],
value_vars=dates,
var_name='Date',
value_name='Deaths'
)
full_table = confirmed_long_form.merge(
right=deaths_long_form,
how='left',
on=['Province/State', 'Country/Region', 'Date', 'Lat', 'Long']
)
full_table['Date'] = pd.to_datetime(full_table['Date'])
full_table = full_table.sort_values(by='Date', ascending=True)
def getRecovered(row):
ts = row['Date']
country = row['Country/Region']
ts = pd.Timestamp(ts)
do = pd.tseries.offsets.DateOffset(n = 14)
newTimeStamp = ts - do
oldrow = full_table.loc[(full_table['Date'] == newTimeStamp) & (full_table['Country/Region'] == country)]
return oldrow['Confirmed'] - row['Deaths']
full_table['Recovered'] = full_table.apply (lambda row: getRecovered(row), axis=1)
full_table
Your function is being applied row by row, which is likely why performance is suffering. Pandas is fastest when you make use of vectorised functions. For example you can use
pd.to_datetime(full_table['Date'])
to convert the whole date column much faster (see here: Convert DataFrame column type from string to datetime).
You can then add the date offset to that column, something like:
full_table['Recovery_date'] = pd.to_datetime(full_table['Date']) - pd.tseries.offsets.DateOffset(n = 14)
You can then self merge the table on date==recovery_date (plus any other keys) and subtract the numbers.

Filtering Pandas Dataframe by date not working

I am downloading Bitcoin price data using Cryptowatch API. Downloading price data works well, but I only need price data until from 1 month ago, so data from 29.10.2019 until 28.11.2019. I read several answers to similar questions but it does not seem to work for my code as I get the same output after filtering than without filtering.
Here is my code:
#define 1day period
periods = '86400'
#get price data from cryptowatchAPI
resp = requests.get('https://api.cryptowat.ch/markets/bitfinex/btcusd/ohlc', params={'periods': periods})
resp.ok
#create pandas dataframe
data = resp.json()
df = pd.DataFrame(data['result'][periods], columns=[
'CloseTime', 'OpenPrice', 'HighPrice', 'LowPrice', 'ClosePrice', 'Volume', 'NA'])
#Make a date out of CloseTime
df['CloseTime'] = pd.to_datetime(df['CloseTime'], unit='s')
#make CloseTime Index of the Dataframe
df.set_index('CloseTime', inplace=True)
#filter df by date until 1 month ago
df.loc[datetime.date(year=2019,month=10,day=29):datetime.date(year=2019,month=11,day=28)]
df
There is no error or anything, but the output is always the same, so filtering does not work.
Thank you very much in advance!!
Use strings format of datetimes for filtering, check also more infornation in docs:
df1 = df['2019-10-29':'2019-11-28']
Or:
s = datetime.datetime(year=2019,month=10,day=29)
e = datetime.datetime(year=2019,month=11,day=28)
df1 = df[s:e]

how to extract weekends and bank holidays for stock price data

markowitz = pd.read_excel('C:/Users/jordan/Desktop/book2.xlsx')
markowitz = markowitz.set_index('Dates')
markowitz
there are some NaN values in the data,some of them are weekends and some of them are holidays,i have to identify the holidays and set it as previous value
is there a simple way i can do this ,i used
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
dr = pd.date_range(start='2013-01-01', end='2018-06-12')
df = pd.DataFrame()
df['Date'] = dr
cal = calendar()
holidays = cal.holidays(start=dr.min(), end=dr.max())
df['Holiday'] = df['Date'].isin(holidays)
print (df)
df = df[df['Holiday'] == True]
df
but there are still a lot of dates i have to copy and paste(can i just display the second row "date")and then set them as previous trading day value, is there a simpler way to do this ? Thanks a lot in advance.
There may be a simpler way, if I know what you are trying to do. The fillna method on dataframes lets you forward fill. So if you don't want to fill weekend days but want to fill all other nas (i.e. holidays), you can just exclude Saturdays and Sundays as follows:
df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])] = df.loc[~df['Date'].dt.weekday_name.isin(['Saturday','Sunday'])].fillna(method='ffill')
You can use this on the whole dataframe or on particular columns.
I think your best bet is to get an API key from quandl.com. It's free and it gives you access to all kinds of time series historical data. There used to be access to Yahoo Finance and Google Finance, but I think both were depreciated well over 1 year ago.
Here is a small sample of code that can definitely help you.
import quandl
quandl.ApiConfig.api_key = 'your_api_key_goes_here'
# get the table for daily stock prices and,
# filter the table for selected tickers, columns within a time range
# set paginate to True because Quandl limits tables API to 10,000 rows per call
data = quandl.get_table('WIKI/PRICES', ticker = ['AAPL', 'MSFT', 'WMT'],
qopts = { 'columns': ['ticker', 'date', 'adj_close'] },
date = { 'gte': '2015-12-31', 'lte': '2016-12-31' },
paginate=True)
print(data)
Check the link below for info about how to get the data you need.
https://blog.quandl.com/api-for-stock-data
Also, please see this for more details about using Python for quantitative finance.
https://financetrain.com/best-python-librariespackages-finance-financial-data-scientists/
Finally, and I apologize if this is a little off topic, but I think it may be helpful at some level...consider something like this...
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)
It's not time series data, but rather fundamental data. I haven't spent a lot of time on that site, but maybe you can poke around and find something there that suits your needs. Just a thought.

Categories