I'm having difficulty getting my dataframe data into their own timeseries. Here is my code so far:
data = pd.read_csv(s3_data_path, index_col=0, parse_dates=True, dtype="float64")
num_timeseries = data.shape[1]
print("This is the number of time series you're running:")
print(num_timeseries)
data_length = num_timeseries
t0 = data.index[0]
print("This is the beginning date:")
print(t0)
time_series=[]
for i in range(num_timeseries):
index = pd.DatetimeIndex(start=t0, freq=freq, periods=data_length)
time_series.append(pd.Series(data=data.iloc[:,i], index=index))
print(data.head(10))
print(time_series[0]
Whenever I run print(data.head(10)) I see the following, which is what I expect:
However, my time_series object has values of NaN:
I don't quite understand what I'm missing to get the correct values in my time_series object. Thanks for the help!
Edit---
Whenever I remove the index=index out of my time_series.append the code generates my expected result (pictured below.) However, this creates an issue as no frequency is defined which is a requirement.
Related
I have written a code to retrieve JSON data from an URL. It works fine. I give the start and end date and it loops through the date range and appends everything to a dataframe.
The colums are populated with the JSON data sensor and its corresponding values, hence the column names are like sensor_1. When I request the data from the URL it sometimes happens that there are new sensors and the old ones are switched off and deliver no data anymore and often times the length of the columns change. In that case my code just adds new columns.
What I want is instead of new columns a new header in the ongoing dataframe.
What I currently get with my code:
datetime;sensor_1;sensor_2;sensor_3;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-01;23.2;43.5;45.2;NaN;NaN;NaN;NaN;NaN;
2023-01-02;13.2;33.5;55.2;NaN;NaN;NaN;NaN;NaN;
2023-01-03;26.2;23.5;76.2;NaN;NaN;NaN;NaN;NaN;
2023-01-04;NaN;NaN;NaN;75;12;75;93;123;
2023-01-05;NaN;NaN;NaN;23;31;24;15;136;
2023-01-06;NaN;NaN;NaN;79;12;96;65;72;
What I want:
datetime;sensor_1;sensor_2;sensor_3;
2023-01-01;23.2;43.5;45.2;
2023-01-02;13.2;33.5;55.2;
2023-01-03;26.2;23.5;76.2;
datetime;new_sensor_8;new_sensor_9;sensor_10;sensor_11;
2023-01-04;75;12;75;93;123;
2023-01-05;23;31;24;15;136;
2023-01-06;79;12;96;65;72;
My loop to retrieve the data:
start_date = datetime.datetime(2023,1,1,0,0)
end_date = datetime.datetime(2023,1,6,0,0)
sensor_data = pd.DataFrame()
while start_zeit < end_zeit:
q = 'url'
r = requests.get(q)
j = json.loads(r.text)
sub_data = pd.DataFrame()
if 'result' in j:
datetime = pd.to_datetime(np.array(j['result']['data'])[:,0])
sensors = np.array(j['result']['sensors'])
data = np.array(j['result']['data'])[:,1:]
df_new = pd.DataFrame(data, index=datetime, columns=sensors)
sub_data = pd.concat([sub_data, df_new])
sensor_data = pd.concat([sensor_data, sub_data])
start_date += timedelta(days=1)
if 2 DataFrames will do for you the you can simply split using the column names:
df1 = df[['datetime', 'sensor_1', 'sensor_2', 'sensor_3']]
df2 = df[['datetime', 'new_sensor_8', 'new-sensor_9', 'sensor_10', 'sensor_11']]
Note the [[ used.
and use .dropna() to lose the NaN rows
I have a df of 300000 rows and 25 columns.
Heres a link to 21 rows of the dataset
I have added a unique index to all the rows, using uuid.uuid4().
Now I only wand a random portion of the dataset (say 25%). Here is what I am trying to do to get it, but its not working:
def gen_uuid(self, df, percentage = 1.0, uuid_list = []):
for i in range(df.shape[0]):
uuid_list.append(str(uuid.uuid4()))
uuid_pd = pd.Series(uuid_list)
df_uuid = df.copy()
df_uuid['id'] = uuid_pd
df_uuid = df_uuid.set_index('id')
if (percentage == 1.0) : return df_uuid
else:
uuid_list_sample = random.sample(uuid_list, int(len(uuid_list) * percentage))
return df_uuid[df_uuid.index.any() in uuid_list_sample]
But this gives an error saying keyerror: False
The uuid_list_sample that I generate is the correct length
So I have 2 questions:
How do I get the above code to work as intendend? Return a random portion of the pandas df based on index
How do I in general get a percentage of the whole pandas data frame? I was looking at pandas.DataFrame.quantile, but Im not sure if that does what im looking for
I have this code where I wish to change the dataformat. But I only manage to change one line and not the whole dataset.
Code:
import pandas as pd
df = pd.read_csv ("data_q_3.csv")
result = df.groupby ("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
from datetime import datetime
datetime.fromisoformat("2020-03-18T12:13:09").strftime("%Y-%m-%d-%H:%M")
Does anyone know how to fit the code so that the datetime changes in the whole dataset?
Thanks!
After looking at your problem for a while, I figured out how to change the values in the 'DateTime' column. The only problem that may arise is if the 'Country/Region' column has duplicate location names.
Editing the time is simple, as all you have to do is make use of pythons slicing. You can slice a string by typing
string = 'abcdefghijklnmopqrstuvwxyz'
print(string[0:5])
which will result in abcdef.
Below is the finished code.
import pandas as pd
# read unknown data
df = pd.read_csv("data_q_3.csv")
# List of unknown data
result = df.groupby("Country/Region").max().sort_values(by='Confirmed', ascending=False)[:10]
pd.set_option('display.max_column', None)
# you need a for loop to go through the whole column
for row in result.index:
# get the current stored time
time = result.at[row, 'DateTime']
# reformat the time string by slicing the
# string from index 0 to 10, and from index 12 to 16
# and putting a dash in the middle
time = time[0:10] + "-" + time[12:16]
# store the new time in the result
result.at[row, 'DateTime'] = time
#print result
print ("Covid 19 top 10 countries based on confirmed case:")
print(result)
when I try to make a new column to add to an existing dataframe , the new column only has empty values . However, when print "result" before assigns it to the dataframe it works fine! and thus I get this weird error of max arg!
ValueError: max() arg is an empty sequence
I'm using mplfinance to plot the data
strategy.py
def moving_average (self, df , i):
signal = df['sma20'][i]*1.10
if (df['sma20'][i] > df['sma50'][i]) & (signal >df['Close'][i]):
return df['Close'][i]
else:
return None
trading.py
for i in range(0, len(df['Close'])-1):
result = strategy.moving_average(df , i)
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)
Based on the very small amount of information here, and on your comment
"because df['buy'] column has nan values only."
I'm going to guess that your problem is that strategy.moving_average() is returning None instead of nan when there is no signal.
There is a big difference between None and nan. (The main issue is that nan supports math, whereas None does not; and as a general rule plotting packages always do math).
I suggest you import numpy as np and then in strategy.moving_average()
change return None
to return np.nan.
ALSO just saw another problem.
You are only assigning a single value to df['buy'].
You need to take it out of the loop.
I suggest initialize result as an empty list before the loop
then:
result = []
for i in range(0, len(df['Close'])-1):
result.append(strategy.moving_average(df , i))
print(result)
df['buy']= result
df.to_csv('test.csv', encoding='utf-8')
apd = mpf.make_addplot(df['buy'],scatter=True,marker='^')
mpf.plot(df, type='candle', addplot=apd)
I want to calculate the average number of successful Rattatas catches hourly for this whole dataset. I am looking for an efficient way to do this by utilizing pandas--I'm new to Python and pandas.
You don't need any loops. Try this. I think logic is rather clear.
import pandas as pd
#read csv
df = pd.read_csv('pkmn.csv', header=0)
#we need apply some transformations to extract date from timestamp
df['time'] = df['time'].apply(lambda x : pd.to_datetime(str(x)))
df['date'] = df['time'].dt.date
#main transformations
df = df.query("Pokemon == 'rattata' and caught == True").groupby('hour')
result = pd.DataFrame()
result['caught total'] = df['hour'].count()
result['days'] = df['date'].nunique()
result['caught average'] = result['caught total'] / result['days']
If you have your pandas dataframe saved as df this should work:
rats = df.loc[df.Pokemon == "rattata"] #Gives you subset of rows relating to Rattata
total = sum(rats.Caught) #Gives you the number caught total
diff = rats.time[len(rats)] - rats.time[0] #Should give you difference between first and last
average = total/diff #Should give you the number caught per unit time